我先上传了一个文件:
➜ ~ curl http://localhost:9200/test1/test1/1/ -d '{"text":"a b d"}'
{"_index":"test1","_type":"test1","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}
然后用querystring搜索:
➜ ~ curl http://localhost:9200/test1/test1/_search\?q\=a+AND+b+OR+c+AND+d
{"took":53,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}
为什么不会命中呢?
感觉不管是把a AND b OR c AND d解释成a AND (b OR c) AND d还是 (a AND b) OR (c AND d) 还是 ((a AND b) OR c) AND d 都是应该命中的啊。
而且lucene 的 官网文档 https://lucene.apache.org/core ... .html 举例都明确说会解析成(a AND b) OR (c AND d):
然后我用_explain API试了一下:
这里说『no match on required clause (_all:c)』,但是我找不到中间那个OR在哪里。
所以,到底es是怎么解析querystring语法里的多个AND、OR的啊?
➜ ~ curl http://localhost:9200/test1/test1/1/ -d '{"text":"a b d"}'
{"_index":"test1","_type":"test1","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}
然后用querystring搜索:
➜ ~ curl http://localhost:9200/test1/test1/_search\?q\=a+AND+b+OR+c+AND+d
{"took":53,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}
为什么不会命中呢?
感觉不管是把a AND b OR c AND d解释成a AND (b OR c) AND d还是 (a AND b) OR (c AND d) 还是 ((a AND b) OR c) AND d 都是应该命中的啊。
而且lucene 的 官网文档 https://lucene.apache.org/core ... .html 举例都明确说会解析成(a AND b) OR (c AND d):
The Precedence Query Parser extends the Standard Query Parser and enables the boolean precedence. So, the query <a AND b OR c AND d> is parsed to <(+a +b) (+c +d)> instead of <+a +b +c +d>
然后我用_explain API试了一下:
{
"_index" : "test1",
"_type" : "test1",
"_id" : "1",
"matched" : false,
"explanation" : {
"value" : 0.0,
"description" : "Failure to meet condition(s) of required/prohibited clause(s)",
"details" : [
{
"value" : 0.25316024,
"description" : "weight(_all:a in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.25316024,
"description" : "weight(_all:b in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.0,
"description" : "no match on required clause (_all:c)",
"details" : [
{
"value" : 0.0,
"description" : "no matching term",
"details" : [ ]
}
]
},
{
"value" : 0.25316024,
"description" : "weight(_all:d in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
这里说『no match on required clause (_all:c)』,但是我找不到中间那个OR在哪里。
所以,到底es是怎么解析querystring语法里的多个AND、OR的啊?
8 个回复
kennywu76 - Wood
赞同来自: rockybean 、novia 、liujia 、medcl 、strglee
其次,大致扫了一眼ES的源码,QueryString Query是通过QueryStringQueryBuilder来构造的,这个类调用了
ES自己写的一个类=> MapperQueryParser里parse方法来做解析。而这个类最终继承自
org.apache.lucene.queryparser.classic.QueryParser, 也就是说并没有使用到你问题里贴的org.apache.lucene.queryparser.flexible.precedence.PrecedenceQueryParser这个类。
因此猜测ES的QueryString Query对于AND OR这一类的操作符的优先级解析方面,和PrecedenceQueryParser是不同的。Lucene文档对于PrecedenceQueryParser这段说明,也隐含了标准的QueryParser会将<a AND b OR c AND d> 解析成<+a +b +c +d>.
根据你给的范例,我测试下来(开启profile),ES将"a AND b OR c AND d" ,解析成了下面这种形式:
也就是a, b ,c, d都是must。
在https://www.elastic.co/guide/e ... ators 这篇文档里,ES专门解释了一下布尔操作符的复杂性,其不如表面上看起来的直观。 因此,根据测试结果,我推测,在没有加grouping操作符(括号)的情况下, ES (lucene的标准parser)只是简单的将AND ,OR等操作符应用于两侧的操作对象上,并且根据优先级决定某个对象应该使用哪个操作符(对应+ -等操作符),而不像PrecedenceQueryParser那样考虑布尔表达式结合的优先级。
这样看来,在使用AND OR的时候,最好通过括号,对布尔表达式做显示分组。 另外最好是使用 "+, -"等操作符结合括号分组,+ -只对右侧对象有效,和must,should ,not等bool query一一对应,写起来更清晰。
liujia
赞同来自:
```
POST test1/_search?
{
"profile": true,
"query": {
"query_string": {
"default_field": "text",
"query": "a OR c AND d"
}
}
}
```
"a AND b OR c AND d" 和 "a AND b AND c AND d" 得到的结果是一样的.
刚刚看到你发的lucene文档, 非常明显的不一样.. 那应该算是BUG吧.
medcl - 今晚打老虎。
赞同来自:
其实 QueryString 文档里面有说的:
(翻译下就是)
优先级:NOT > AND > OR
Lucene 语法:+、- 只作用于关键字右边的表达式
Lucene 语法:AND、OR 作用于关键字两边的表达式
上面 Wood 的答案也都已经提到了。
kennywu76 - Wood
赞同来自:
其中,特别提到了强调了必须使用括号来合并操作符,才能产生通常意义上的布尔逻辑。
文中也提到,"+ -"这种前缀符号才是使用QueryString Query的最佳姿势。
@medcl 建议ES官方文档里对于QueryString Query里这个坑做个特别说明,并可以给个链接到上面这篇文章。
dizhuang
赞同来自:
ferraborghini
赞同来自:
如果表达式是,a AND b OR c AND d在addClause的时候,添加到第三个其实没有问题,添加到第四个的时候,会将前一个clause拿出来,强行转换成Must再填进去。所以这个时候就会出现这个错误。也不是优先级的问题。
关于优先级的问题,你们有没有什么新的发现?
liujia
赞同来自:
默认都是平级的,没有括号,自然没有布尔逻辑。
官网其实已经给出了相应的资料和建议了,且做了可能造成错误示范的介绍:
https://www.elastic.co/guide/e ... ators
chamcyl
赞同来自:
优先级是指 表达式左右同时有 NOT/AND/OR ,判断到底是用 + 还是用 - 。