es是如何解析querystring语法的AND、OR优先级的？

Elasticsearch | 作者三斗室 | 发布于2017年08月13日 | 阅读数：18738

我先上传了一个文件：

➜ ~ curl http://localhost:9200/test1/test1/1/ -d '{"text":"a b d"}'
{"_index":"test1","_type":"test1","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

然后用querystring搜索：
➜ ~ curl http://localhost:9200/test1/test1/_search\?q\=a+AND+b+OR+c+AND+d
{"took":53,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":}}

为什么不会命中呢？

感觉不管是把a AND b OR c AND d解释成a AND (b OR c) AND d还是 (a AND b) OR (c AND d) 还是 ((a AND b) OR c) AND d 都是应该命中的啊。

而且lucene 的官网文档 https://lucene.apache.org/core ... .html 举例都明确说会解析成(a AND b) OR (c AND d)：

The Precedence Query Parser extends the Standard Query Parser and enables the boolean precedence. So, the query <a AND b OR c AND d> is parsed to <(+a +b) (+c +d)> instead of <+a +b +c +d>

然后我用_explain API试了一下：

{

  "_index" : "test1",

  "_type" : "test1",

  "_id" : "1",

  "matched" : false,

  "explanation" : {

    "value" : 0.0,

    "description" : "Failure to meet condition(s) of required/prohibited clause(s)",

    "details" : [

      {

        "value" : 0.25316024,

        "description" : "weight(_all:a in 0) [PerFieldSimilarity], result of:",

        "details" : [

          {

            "value" : 0.25316024,

            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",

            "details" : [

              {

                "value" : 0.2876821,

                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "docFreq",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.0,

                    "description" : "docCount",

                    "details" : [ ]

                  }

                ]

              },

              {

                "value" : 0.88,

                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "termFreq=1.0",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.2,

                    "description" : "parameter k1",

                    "details" : [ ]

                  },

                  {

                    "value" : 0.75,

                    "description" : "parameter b",

                    "details" : [ ]

                  },

                  {

                    "value" : 3.0,

                    "description" : "avgFieldLength",

                    "details" : [ ]

                  },

                  {

                    "value" : 4.0,

                    "description" : "fieldLength",

                    "details" : [ ]

                  }

                ]

              }

            ]

          }

        ]

      },

      {

        "value" : 0.25316024,

        "description" : "weight(_all:b in 0) [PerFieldSimilarity], result of:",

        "details" : [

          {

            "value" : 0.25316024,

            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",

            "details" : [

              {

                "value" : 0.2876821,

                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "docFreq",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.0,

                    "description" : "docCount",

                    "details" : [ ]

                  }

                ]

              },

              {

                "value" : 0.88,

                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "termFreq=1.0",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.2,

                    "description" : "parameter k1",

                    "details" : [ ]

                  },

                  {

                    "value" : 0.75,

                    "description" : "parameter b",

                    "details" : [ ]

                  },

                  {

                    "value" : 3.0,

                    "description" : "avgFieldLength",

                    "details" : [ ]

                  },

                  {

                    "value" : 4.0,

                    "description" : "fieldLength",

                    "details" : [ ]

                  }

                ]

              }

            ]

          }

        ]

      },

      {

        "value" : 0.0,

        "description" : "no match on required clause (_all:c)",

        "details" : [

          {

            "value" : 0.0,

            "description" : "no matching term",

            "details" : [ ]

          }

        ]

      },

      {

        "value" : 0.25316024,

        "description" : "weight(_all:d in 0) [PerFieldSimilarity], result of:",

        "details" : [

          {

            "value" : 0.25316024,

            "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",

            "details" : [

              {

                "value" : 0.2876821,

                "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "docFreq",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.0,

                    "description" : "docCount",

                    "details" : [ ]

                  }

                ]

              },

              {

                "value" : 0.88,

                "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",

                "details" : [

                  {

                    "value" : 1.0,

                    "description" : "termFreq=1.0",

                    "details" : [ ]

                  },

                  {

                    "value" : 1.2,

                    "description" : "parameter k1",

                    "details" : [ ]

                  },

                  {

                    "value" : 0.75,

                    "description" : "parameter b",

                    "details" : [ ]

                  },

                  {

                    "value" : 3.0,

                    "description" : "avgFieldLength",

                    "details" : [ ]

                  },

                  {

                    "value" : 4.0,

                    "description" : "fieldLength",

                    "details" : [ ]

                  }

                ]

              }

            ]

          }

        ]

      }

    ]

  }

}

这里说『no match on required clause (_all:c)』，但是我找不到中间那个OR在哪里。

所以，到底es是怎么解析querystring语法里的多个AND、OR的啊？

8 个回复

kennywu76 - Wood

赞同来自: rockybean 、novia 、liujia 、medcl 、strglee

首先，测试了一下，你说的问题存在。
其次，大致扫了一眼ES的源码，QueryString Query是通过QueryStringQueryBuilder来构造的，这个类调用了
ES自己写的一个类=> MapperQueryParser里parse方法来做解析。而这个类最终继承自
org.apache.lucene.queryparser.classic.QueryParser，也就是说并没有使用到你问题里贴的org.apache.lucene.queryparser.flexible.precedence.PrecedenceQueryParser这个类。

因此猜测ES的QueryString Query对于AND OR这一类的操作符的优先级解析方面，和PrecedenceQueryParser是不同的。Lucene文档对于PrecedenceQueryParser这段说明，也隐含了标准的QueryParser会将<a AND b OR c AND d> 解析成<+a +b +c +d>.

The Precedence Query Parser extends the Standard Query Parser and enables the boolean precedence. So, the query <a AND b OR c AND d> is parsed to <(+a +b) (+c +d)> instead of <+a +b +c +d>.

根据你给的范例，我测试下来(开启profile)，ES将"a AND b OR c AND d" ，解析成了下面这种形式:

type": "BooleanQuery",

                "description": """+(+text:a +text:b +text:c +text:d) #ConstantScore(MatchNoDocsQuery("empty BooleanQuery"))""",

也就是a, b ,c, d都是must。

在https://www.elastic.co/guide/e ... ators 这篇文档里，ES专门解释了一下布尔操作符的复杂性，其不如表面上看起来的直观。因此，根据测试结果，我推测，在没有加grouping操作符(括号)的情况下， ES (lucene的标准parser)只是简单的将AND ,OR等操作符应用于两侧的操作对象上，并且根据优先级决定某个对象应该使用哪个操作符（对应+ -等操作符)，而不像PrecedenceQueryParser那样考虑布尔表达式结合的优先级。

这样看来，在使用AND OR的时候，最好通过括号，对布尔表达式做显示分组。另外最好是使用 "+, -"等操作符结合括号分组，+ -只对右侧对象有效，和must,should ,not等bool query一一对应，写起来更清晰。

liujia

和我之前的理解也不一样, 不知道到底算是BUG, 还是本意如此, 用profile看会更清楚:

```
POST test1/_search?
{
"profile": true,
"query": {
"query_string": {
"default_field": "text",
"query": "a OR c AND d"
}
}
}
```

"a AND b OR c AND d" 和 "a AND b AND c AND d" 得到的结果是一样的.

刚刚看到你发的lucene文档, 非常明显的不一样.. 那应该算是BUG吧.

medcl - 今晚打老虎。

? Wood的回复。
其实 QueryString 文档里面有说的：

The familiar operators AND, OR and NOT (also written &&, || and !) are also supported. However, the effects of these operators can be more complicated than is obvious at first glance. NOT takes precedence over AND, which takes precedence over OR. While the + and - only affect the term to the right of the operator, AND and OR can affect the terms to the left and right.

（翻译下就是）
优先级：NOT > AND > OR
Lucene 语法：+、- 只作用于关键字右边的表达式
Lucene 语法：AND、OR 作用于关键字两边的表达式

上面 Wood 的答案也都已经提到了。

kennywu76 - Wood

我刚Google到一篇对这个问题解释的非常清晰的文章，文章链接： https://lucidworks.com/2011/12 ... -not/

其中，特别提到了强调了必须使用括号来合并操作符，才能产生通常意义上的布尔逻辑。

Please note that it is important to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports one or more clauses, meaning that (X OR Y OR Z) will create a single BooleanQuery with three clauses — strictly speaking it is not equivalent to either ((X OR Y) OR Z) or (X OR (Y OR Z)) because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. While the scores of all three of those queries will typically be the same using Lucene’s default Similarity class, those queries are structurally different, and other usages (or other Similarity functions) may produce subtly different results.

文中也提到，"+ -"这种前缀符号才是使用QueryString Query的最佳姿势。

@medcl 建议ES官方文档里对于QueryString Query里这个坑做个特别说明，并可以给个链接到上面这篇文章。

dizhuang

666

ferraborghini

看了一下，lucene的代码。

protected void addClause(List<BooleanClause> clauses, int conj, int mods, Query q) {

  boolean required, prohibited;



  // If this term is introduced by AND, make the preceding term required,

  // unless it's already prohibited

  if (clauses.size() > 0 && conj == CONJ_AND) {

    BooleanClause c = clauses.get(clauses.size()-1);

    if (!c.isProhibited())

      clauses.set(clauses.size() - 1, new BooleanClause(c.getQuery(), Occur.MUST));

  }



  if (clauses.size() > 0 && operator == AND_OPERATOR && conj == CONJ_OR) {

    // If this term is introduced by OR, make the preceding term optional,

    // unless it's prohibited (that means we leave -a OR b but +a OR b-->a OR b)

    // notice if the input is a OR b, first term is parsed as required; without

    // this modification a OR b would parsed as +a OR b

    BooleanClause c = clauses.get(clauses.size()-1);

    if (!c.isProhibited())

      clauses.set(clauses.size() - 1, new BooleanClause(c.getQuery(), Occur.SHOULD));

  }

如果表达式是，a AND b OR c AND d在addClause的时候，添加到第三个其实没有问题，添加到第四个的时候，会将前一个clause拿出来，强行转换成Must再填进去。所以这个时候就会出现这个错误。也不是优先级的问题。

关于优先级的问题，你们有没有什么新的发现？

liujia

medcl 已经说了:

默认都是平级的，没有括号，自然没有布尔逻辑。
官网其实已经给出了相应的资料和建议了，且做了可能造成错误示范的介绍：
https://www.elastic.co/guide/e ... ators

chamcyl

NOT/AND/OR 的作用就是给左右的2个表达式加上 +- 。

优先级是指表达式左右同时有 NOT/AND/OR ，判断到底是用 + 还是用 - 。

需要表达我们通常理解的逻辑运算时，最好用括号

要回复问题请先登录或注册

es是如何解析querystring语法的AND、OR优先级的？

8 个回复

发起人

活动推荐

相关问题

问题状态

es是如何解析querystring语法的AND、OR优先级的？

与内容相关的链接

8 个回复

发起人

活动推荐

相关问题

问题状态