es是如何解析querystring语法的AND、OR优先级的?

作者 三斗室 | 发布于2017年08月13日 | 阅读数:3288

我先上传了一个文件:
 
➜ ~ curl http://localhost:9200/test1/test1/1/ -d '{"text":"a b d"}'
{"_index":"test1","_type":"test1","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}
 
然后用querystring搜索:
➜ ~ curl http://localhost:9200/test1/test1/_search\?q\=a+AND+b+OR+c+AND+d
{"took":53,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}
 
为什么不会命中呢?

感觉不管是把a AND b OR c AND d解释成a AND (b OR c) AND d还是 (a AND b) OR (c AND d) 还是 ((a AND b) OR c) AND d 都是应该命中的啊。
 
而且lucene 的 官网文档 https://lucene.apache.org/core ... .html 举例都明确说会解析成(a AND b) OR (c AND d):
 


The Precedence Query Parser extends the Standard Query Parser and enables the boolean precedence. So, the query <a AND b OR c AND d> is parsed to <(+a +b) (+c +d)> instead of <+a +b +c +d>


然后我用_explain API试了一下:
 
{
"_index" : "test1",
"_type" : "test1",
"_id" : "1",
"matched" : false,
"explanation" : {
"value" : 0.0,
"description" : "Failure to meet condition(s) of required/prohibited clause(s)",
"details" : [
{
"value" : 0.25316024,
"description" : "weight(_all:a in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.25316024,
"description" : "weight(_all:b in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.0,
"description" : "no match on required clause (_all:c)",
"details" : [
{
"value" : 0.0,
"description" : "no matching term",
"details" : [ ]
}
]
},
{
"value" : 0.25316024,
"description" : "weight(_all:d in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.25316024,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 0.88,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}

这里说『no match on required clause (_all:c)』,但是我找不到中间那个OR在哪里。
 
所以,到底es是怎么解析querystring语法里的多个AND、OR的啊?
已邀请:

kennywu76 - wood@Ctrip

赞同来自: rockybean novia liujia medcl strglee

首先,测试了一下,你说的问题存在。
 其次,大致扫了一眼ES的源码,QueryString Query是通过QueryStringQueryBuilder来构造的,这个类调用了
ES自己写的一个类=> MapperQueryParser里parse方法来做解析。而这个类最终继承自
org.apache.lucene.queryparser.classic.QueryParser, 也就是说并没有使用到你问题里贴的org.apache.lucene.queryparser.flexible.precedence.PrecedenceQueryParser这个类。  
 
因此猜测ES的QueryString Query对于AND OR这一类的操作符的优先级解析方面,和PrecedenceQueryParser是不同的。Lucene文档对于PrecedenceQueryParser这段说明,也隐含了标准的QueryParser会将<a AND b OR c AND d> 解析成<+a +b +c +d>.


The Precedence Query Parser extends the Standard Query Parser and enables the boolean precedence. So, the query <a AND b OR c AND d> is parsed to <(+a +b) (+c +d)> instead of <+a +b +c +d>.
 


 
 
根据你给的范例,我测试下来(开启profile),ES将"a AND b OR c AND d" ,解析成了下面这种形式:
type": "BooleanQuery",
"description": """+(+text:a +text:b +text:c +text:d) #ConstantScore(MatchNoDocsQuery("empty BooleanQuery"))""",

也就是a, b ,c, d都是must。
 
https://www.elastic.co/guide/e ... ators  这篇文档里,ES专门解释了一下布尔操作符的复杂性,其不如表面上看起来的直观。 因此,根据测试结果,我推测,在没有加grouping操作符(括号)的情况下, ES (lucene的标准parser)只是简单的将AND ,OR等操作符应用于两侧的操作对象上,并且根据优先级决定某个对象应该使用哪个操作符(对应+ -等操作符),而不像PrecedenceQueryParser那样考虑布尔表达式结合的优先级。 
 
这样看来,在使用AND OR的时候,最好通过括号,对布尔表达式做显示分组。 另外最好是使用 "+, -"等操作符结合括号分组,+ -只对右侧对象有效,和must,should ,not等bool query一一对应,写起来更清晰。
 

liujia

赞同来自:

和我之前的理解也不一样, 不知道到底算是BUG, 还是本意如此, 用profile看会更清楚:
 
```
POST test1/_search?
{
   "profile": true,
   "query": {
      "query_string": {
          "default_field": "text", 
         "query": "a OR c AND d"
      }
   }
}
```
 
"a AND b OR c AND d" 和 "a AND b AND c AND d" 得到的结果是一样的.
 
刚刚看到你发的lucene文档, 非常明显的不一样.. 那应该算是BUG吧.

medcl - Elastic 🇨🇳 !

赞同来自:

👍 Wood的回复。
 其实 QueryString 文档里面有说的:


The familiar operators AND, OR and NOT (also written &&, || and !) are also supported. However, the effects of these operators can be more complicated than is obvious at first glance. NOT takes precedence over AND, which takes precedence over OR. While the + and - only affect the term to the right of the operator, AND and OR can affect the terms to the left and right.


(翻译下就是)
优先级:NOT > AND > OR
Lucene 语法:+、- 只作用于关键字右边的表达式
Lucene 语法:AND、OR 作用于关键字两边的表达式
 
上面 Wood 的答案也都已经提到了。
 
 

kennywu76 - wood@Ctrip

赞同来自:

我刚Google到一篇对这个问题解释的非常清晰的文章,文章链接: https://lucidworks.com/2011/12 ... -not/
 
其中,特别提到了强调了必须使用括号来合并操作符,才能产生通常意义上的布尔逻辑。 


Please note that it is important to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports one or more clauses, meaning that (X OR Y OR Z) will create a single BooleanQuery with three clauses — strictly speaking it is not equivalent to either ((X OR Y) OR Z) or (X OR (Y OR Z)) because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. While the scores of all three of those queries will typically be the same using Lucene’s default Similarity class, those queries are structurally different, and other usages (or other Similarity functions) may produce subtly different results.
 


 
文中也提到,"+ -"这种前缀符号才是使用QueryString Query的最佳姿势。 
 
@medcl 建议ES官方文档里对于QueryString Query里这个坑做个特别说明,并可以给个链接到上面这篇文章。
 
 

dizhuang

赞同来自:

666

要回复问题请先登录注册