elasticsearch性能问题

贡献

Elasticsearch • laoyang360 回复了问题 • 3 人关注 • 2 个回复 • 3814 次浏览 • 2019-04-18 20:49 • 来自相关话题

range字符串查询性能问题

贡献

Elasticsearch • steven123 回复了问题 • 4 人关注 • 3 个回复 • 3958 次浏览 • 2018-09-27 16:03 • 来自相关话题

Mmap fs可能让大索引访问变得缓慢

Elasticsearch • kennywu76 发表了文章 • 2 个评论 • 7560 次浏览 • 2018-08-13 17:52 • 来自相关话题

在一年多以前，我写过Elasticsearch 5 入坑指南一文，其中提到将生产的某个ES集群从2.4升级到5.0以后，冷数据结点搜索性能变差，对大索引进行搜索的时候，io read会长时间飙高，导致系统load很重，甚至到无法响应的程度。通过进一步分析，用Linux下的Sar -B命令，可以看到有大量的数据被pagein到内存。虽然通过“试”的方法，定位到这个问题和5.0开始使用的mmap fs有关联，并且通过更改为nio fs以后得到解决，但问题的底层根源一直没找到。近期有空重新去看了一下这个问题，在Github上发现一个对os底层更熟悉的人提交并分析了类似的问题 Avoid file cache trashing on Linux with mmapfs by using madvise 。细读之后，感觉该文抓到了问题的本质，以下基于该文做个总结:

mmap fs对比nio fs，省去了磁盘io上的系统调用，并且不需要在jvm内部做io缓存，也减轻了GC压力。所以通常来说，mmapfs的性能应该更高。这也是为什么lucene推荐使用mmap fs，并且ES从5.0开始做为默认的store配置的原因。
然而，mmap系统调用，在内核层面默认会有一个2MB的预读大小设置，也就是说，当映射了一个大文件以后，即使读取其中1k个字节，mmap也会预读取2MB的数据到缓存。这种策略是基于文件的访问大多数是顺序的假设。
在ES这个特定的应用场景，如果某数据结点上索引不是很大，系统剩余缓存也足够，一般不会有问题。但是如果是大数据应用场景，典型的如海量的日志ELK应用，则可能对大索引的搜索聚合，产生较多的随机磁盘访问。从而mmap的预读策略，可能会导致大量的无用数据从磁盘读取到系统缓存。在系统可用的缓存不是非常宽裕的情况下，某些极端场景下，会导致热数据被过于频繁的踢出内存，再反复读入，让磁盘IO不堪重负。
Lucene有一个NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM)的native调用，可以用于指导内核对mmap过的文件做读取的时候，禁用预读。上文作者将该调用hack进lucene代码，做搜索对比测试。结论是对于磁盘io和cache的消耗，niofs都要好于mmapfs，而patch过的mmapfs则比niofs更好。
作者的测试仅限于搜索，对于其他类型的io操作，如写入，merge没有做过详尽测试，因此不清楚利弊。
ES官方开发人员认为这是一个有趣的发现，值得深入去探究。对于用户报告的mmap fs性能比nio fs更差的问题，猜测可能是在大索引读取的场景下，预读带来的额外开销，抵消了相对niofs节省的系统调用等开销。
ES官方提到Lucene已经有一种类似功能的store，叫做NativeUnixDirectory（显然ES目前还没有对这种store的支持)，用户动手能力强的话，应该可以利用这个store自己写一个ES plugin。另外提到JAVA 10提供了O_DIRECT to streams / channels ，似乎官方打算等这个出来以后再看怎么处理这个问题。
要注意，这个预读是mmap层面的，和块设备的预读是两回事。我们曾经尝试过使用 blockdev --setra 这个linux命令取消块设备预读，证实无法解决这个问题。

结论: 如果ES结点上会存放海量的索引数据，经常会有大索引（如1TB+)的搜索聚合操作，使用NIOFS会更安全，可以避免很多怪异的性能问题。

Elasticsearch took和实测的耗时差异大，qps低

贡献

Elasticsearch • cwn1986 回复了问题 • 4 人关注 • 3 个回复 • 7833 次浏览 • 2018-06-04 15:43 • 来自相关话题

一例Query Cache引起的性能问题分析

Elasticsearch • kennywu76 发表了文章 • 6 个评论 • 9295 次浏览 • 2017-09-29 19:18 • 来自相关话题

【携程旅行网吴晓刚】

注：本文是针对Elastic中文社区问题question#2484 的分析和总结。

问题概述

一个线上集群，执行的Query DSL都是一样的，只是参数不同。统计数据显示98-99%的查询响应速度都很快，只需要4 -6ms，但有1%左右的查询响应时间在100ms - 200ms。集群硬件配置较高，使用的SSD，系统可用内存远高于索引文件大小总和，并且线上已经运行有一段时间，数据应该已经充分预热。

诊断过程及结论

比较巧的是，问题提出者刚好是我们自家公司的开发者，因此内部联系沟通了下，为问题的快速诊断提供了不少便利。

首先用公司的监控系统排查了一遍集群所有关键数据，未发现任何可能引起查询耗时高的性能瓶颈问题。因此初步怀疑就是有查询本身比较慢。幸好公司有应用埋点系统和日志系统，因此很方便的拿到了应用端发出的一些慢查询样例，包括请求体以及耗时。

以下是埋点系统里记录的一个耗时150ms的查询 (隐去了敏感信息，去掉了非关键部分):

POST /xxxindex/xxxdb/_search?routing=Mxxxxxxx
{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "must": [
              {
                "bool": {
                  "must": [
                    {
                      "bool": {
                        "should": [
                          {
                            "match_phrase": {
                              "ord_orders_uid": {
                                "query": "Mxxxxxxx",
                                "slop": 0,
                                "boost": 1
                              }
                            }
                          }
                        ],
                        "disable_coord": false,
                        "adjust_pure_negative": true,
                        "boost": 1
                      }
                    },
                    {
                      "range": {
                        "ord_orders_orderdate": {
                          "from": "1405032032",
                          "to":   "1504014193",
                          "include_lower": true,
                          "include_upper": true,
                          "boost": 1
                        }
                      }
                    },
                    {
                      "term": {
                        "ord_orders_ispackageorder": {
                          "value": 0,
                          "boost": 1
                        }
                      }
                    },
                    {
                      "bool": {
                        "must_not": [
                          {
                            "exists": {
                              "field": "ord_hideorder_orderid",
                              "boost": 1
                            }
                          }
                        ],
                        "disable_coord": false,
                        "adjust_pure_negative": true,
                        "boost": 1
                      }
                    }
                  ],
                  "disable_coord": false,
                  "adjust_pure_negative": true,
                  "boost": 1
                }
              }
            ],
            "disable_coord": false,
            "adjust_pure_negative": true,
            "boost": 1
          }
        }
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

拿到查询后，自己手动执行了一下，0 hits，耗时1ms。心里明白，命中了Query Cache，所以才会这么快。

于是用clear api清掉Query Cache，然后再执行几次，有以下发现：

头两次查询耗时38ms左右。这是因为没有cache，需要访问倒排索引，耗时符合预期。之所以两次同样耗时，是因为索引有1个复制片，两次查询分别分配到主和副片上。
接下来两次查询耗时150ms左右。 这里要打一个大大的问号？？？
之后不管再查询多少次，耗时全部是1ms，因为又开始命中Cache。

至此，大致明白，埋点系统里记录到的高耗时查询，是步骤2的两次操作。什么操作耗时这么久呢？根据经验，我判断主要是用于为range filter生成缓存，也就生成生成文档列表的bitmap，然后存放到Query Cache里。

这个集群版本是5.1.1，而我记得ES某个5版本开始，去掉了对term filter的cache，理由是term filter速度足够快，缓存term filter往往得不偿失。查了官方release notes，证实这个改变正好是从5.1.1开始的#21566，因此上面查询里的term filters被排除掉，注意力集中到了查询里唯一的一个range filter。

单独执行了一下这个range filter，match的文档是千万数量级的。询问用户，为何这个range filter会hit这么多文档，得知用户主要就是查询从当前时间开始至过去1年的数据，类似于做了一个now-1y TO now这样的过滤。至此初步得出结论，因为这个range filter匹配的文档太多了，在Query Cache里为这个filter构建bitmap耗时会有些高，应该就是它带来了那额外的100多个ms。

但是还有一个待解释的问题，这种高耗时查询比例为何这么高？再仔细想想也就明白了：因为这个集群的搜索并发量还是有点高，300 -400/s的样子，加上时间字段的精度是秒，所以，在某一秒刚开始的时候，头2次查询因为没有cache，耗时可能在38ms左右，之后会有2次查询因为需要缓存range filter，耗时会增加到150-200ms的样子，之后这1秒里剩余的查询都会命中cache，全部是几个ms，直到下一秒开始，周而复始。因为每秒钟都产生2个这样需要构建缓存的查询，耗时较高，对比每秒几百次的查询量，换算成百分比就有点高了。

那么怎么解决这个问题？对于大量含有从now-xxx TO now这样的range查询，实际上官方的文档有对应的加速技巧介绍：tune-for-search-speed.html#_search_rounded_dates 。也就是说，将查询时间的上下限round到整分钟，或者整小时，让range filter可以缓存得更久，避免出现这种过于频繁重建cache的情况。

{
   "range": {
       "my_date": {
       "gte": "now-1y/h",
        "lte": "now-1y/h"
      }
    }
}

在原始Query里，将range filter写成上述形式，手动测试证实可行，range filter有效期延长到1小时，从而每个小时里，只需要为range filter重建2次Cache，至此问题解决。

总结:

Cache并非建得越多越好，因为Cache的生成和Evict会带来额外的开销。特别是结果集非常大的filter，缓存的代价相对查询本身可能非常高。
ES 5.1.1开始取消了Terms filter Cache，因为Terms filter执行非常快，取消缓存多数情况下反而可以提高性能。
大量用到Now-xxxd To Now这样的Range filter时，可以借助round date技巧，提高Cache的有效期，减轻频繁重建Cache带来的性能问题。

ES 5.4+ 引起的Kibana性能问题

Elasticsearch • kennywu76 发表了文章 • 8 个评论 • 6425 次浏览 • 2017-09-11 18:22 • 来自相关话题

【携程旅行网吴晓刚】

上周有用户在社区发了一例Kibana读取超时的问题：question#2319 。周末找时间帮其调查了下，发现某些较新的ES版本和Kibana搭配，会产生意想不到的缓慢问题。考虑到这个问题比较普遍，因此在这里总结一下问题的根源和解决办法，希望用到问题版本的用户不要踩到坑。

首先问题的现象在上面的问题链接里有描述，简而言之就是对于一个硬件配置比较高的集群，每天写入一个20亿左右数据的索引，通过kibana的discovery面板查看数据会一直超时。即使时间范围放到最近半小时，超时依旧，有些蹊跷。

周末拿到用户给的测试账号，登陆集群看了下状态。从机器的硬件配置，集群和索引的配置看，没找到什么特别不对劲的地方。然而点击到Discovery面板，的确数据显示不出来。集群监控数据看，并没有其他用户在做查询，cpu利用率和集群负载都比较低。因此初步可以判定，就是查询本身比较缓慢所致。

对于诊断查询缓慢问题，我通常的做法，就是将对应面板下的查询拷贝出来，在Kibana Dev Console里手动执行，然后再加上"profile"：true选项，看看查询是如何解析和执行的。对应的查询形如下面这样：

{
  "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "*"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now",
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

因为用户query框什么都没有输入，因此默认查询串被Kibana设置为*，然后根据选择的时间范围加了一个range查询。 profile的输出让我稍微有些吃惊，其中 query_string的里的*居然被解析成非常复杂的DisjunctionMaxQuery，主要查询耗时都在这里了。

{
                    "type": "DisjunctionMaxQuery",
                    "description": "(ConstantScore(_field_names:remote_addr.keyword) | ConstantScore(_field_names:geoip.country_isocode) | ConstantScore(_field_names:geoip.country_name.keyword) | ConstantScore(_field_names:via) | ConstantScore(_field_names:domain.keyword) | ConstantScore(_field_names:request_method.keyword) | ConstantScore(_field_names:protocol) | ConstantScore(_field_names:xff.keyword) | ConstantScore(_field_names:host) | ConstantScore(_field_names:geoip.city_name.keyword) | ConstantScore(_field_names:client_ip) | ConstantScore(_field_names:host.keyword) | ConstantScore(_field_names:geoip.longitude) | ConstantScore(_field_names:geoip.subdivision_name.keyword) | ConstantScore(_field_names:geoip.country_code) | ConstantScore(_field_names:upstream_addr.keyword) | ConstantScore(_field_names:@version.keyword) | ConstantScore(_field_names:request_uri) | ConstantScore(_field_names:tags) | ConstantScore(_field_names:idc_tag) | ConstantScore(_field_names:size) | ConstantScore(_field_names:http_referer) | ConstantScore(_field_names:message.keyword) | ConstantScore(_field_names:domain) | ConstantScore(_field_names:geoip.latitude) | ConstantScore(_field_names:xff) | ConstantScore(_field_names:protocol.keyword) | ConstantScore(_field_names:geoip.country_code.keyword) | ConstantScore(_field_names:status) | ConstantScore(_field_names:upstream_addr) | ConstantScore(_field_names:http_referer.keyword) | ConstantScore(_field_names:tags.keyword) | ConstantScore(_field_names:client_ip.keyword) | ConstantScore(_field_names:request_method) | ConstantScore(_field_names:upstream_status) | ConstantScore(_field_names:request_time) | ConstantScore(_field_names:geoip.location) | ConstantScore(_field_names:@version) | ConstantScore(_field_names:geoip.country_name) | ConstantScore(_field_names:user_agent) | ConstantScore(_field_names:idc_tag.keyword) | ConstantScore(_field_names:remote_addr) | ConstantScore(_field_names:geoip.country_isocode.keyword) | ConstantScore(_field_names:geoip.city_name) | ConstantScore(_field_names:via.keyword) | ConstantScore(_field_names:message) | ConstantScore(_field_names:user_agent.keyword) | ConstantScore(_field_names:request_uri.keyword) | ConstantScore(_field_names:@timestamp) | ConstantScore(_field_names:upstream_response_time) | ConstantScore(_field_names:geoip.subdivision_name))",
                    "time": "5535.127008ms",
                    "time_in_nanos": 5535127008

也就是说， ES将只含一个*的query_string query解析成了针对mapping里能找到的所有字段的field:*查询，然后合并所有的查询结果。可想而知，对于比较大，字段比较多的索引这个查询是非常耗时的。而我对于*的认知，是其应该被rewrite成一个match_all query即可，这样几乎没有什么开销。

为什么会这样？查询了一下ES官方关于Query String Query的文档，其中的default_field和all_fields起到了一定作用: elasticsearch/reference/5.5/query-dsl-query-string-query.html

default_field

The default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to _all.

all_fields

Perform the query on all fields detected in the mapping that can be queried. Will be used by default when the _all field is disabled and no default_field is specified (either in the index settings or in the request body) and no fields are specified.

根据解释，查询的时候可以带一个default_field选项，其默认值为索引级别设置index.query.default_field，如果这个设置没有设置，则默认为_all。但一般用户索引日志的时候，都会关掉_all字段，用于节省磁盘空间，提升索引速率。那么这时候default_field是什么呢？答案是all_fields，也就是ES会将查询转换为对所有字段的查询。

为了验证这个是问题所在，我在索引里加了一个default_field的设置，随意挑选了一个字段。果然问题就解决了，discovery面板渲染速度快了差不多有10倍。

但仔细想想，这也只是绕过了问题。问题的根源，为什么*不被rewrite成match_all呢？

这时候想到我们自己生产的集群似乎没有这个问题，于是用我们自己的集群测试了一下,*果然是正常解析成match_all了。于是对比了一下集群ES的版本，我们正常工作的是5.3.2，用户的集群是5.5.0。

接下来，我想找到这些版本之间，ES对于query string的解析源码层面做了什么改动。经过一番探查，找到了下面这个变更历史:

可以看到，在pull/23433里，为了修复一个foo:*解析歧义的问题，对于field为空，类似光一个*的Query string查询，不再被解析成match_all了，而是扩展成全部字段的DisjunctionMaxQuery查询。由此Kibana默认的*，会引起非常严重的性能问题。

这个问题会影响5.4和5.5两个小版本的ES/Kibana。

顺着这个issue里的链接摸下去，找到了对应Kibana相关问题讨论:issues#12097，以及对应的修复: pull/13047，修复版本默认发出的查询串是match all。

修复的版本则是5.5.2及5.6.0，因此有用到5.4.0到5.5.1之间版本的ELK用户一定要安排升级！

mmap fs对比nio fs，省去了磁盘io上的系统调用，并且不需要在jvm内部做io缓存，也减轻了GC压力。所以通常来说，mmapfs的性能应该更高。这也是为什么lucene推荐使用mmap fs，并且ES从5.0开始做为默认的store配置的原因。
然而，mmap系统调用，在内核层面默认会有一个2MB的预读大小设置，也就是说，当映射了一个大文件以后，即使读取其中1k个字节，mmap也会预读取2MB的数据到缓存。这种策略是基于文件的访问大多数是顺序的假设。
在ES这个特定的应用场景，如果某数据结点上索引不是很大，系统剩余缓存也足够，一般不会有问题。但是如果是大数据应用场景，典型的如海量的日志ELK应用，则可能对大索引的搜索聚合，产生较多的随机磁盘访问。从而mmap的预读策略，可能会导致大量的无用数据从磁盘读取到系统缓存。在系统可用的缓存不是非常宽裕的情况下，某些极端场景下，会导致热数据被过于频繁的踢出内存，再反复读入，让磁盘IO不堪重负。
Lucene有一个NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM)的native调用，可以用于指导内核对mmap过的文件做读取的时候，禁用预读。上文作者将该调用hack进lucene代码，做搜索对比测试。结论是对于磁盘io和cache的消耗，niofs都要好于mmapfs，而patch过的mmapfs则比niofs更好。
作者的测试仅限于搜索，对于其他类型的io操作，如写入，merge没有做过详尽测试，因此不清楚利弊。
ES官方开发人员认为这是一个有趣的发现，值得深入去探究。对于用户报告的mmap fs性能比nio fs更差的问题，猜测可能是在大索引读取的场景下，预读带来的额外开销，抵消了相对niofs节省的系统调用等开销。
ES官方提到Lucene已经有一种类似功能的store，叫做NativeUnixDirectory（显然ES目前还没有对这种store的支持)，用户动手能力强的话，应该可以利用这个store自己写一个ES plugin。另外提到JAVA 10提供了O_DIRECT to streams / channels ，似乎官方打算等这个出来以后再看怎么处理这个问题。
要注意，这个预读是mmap层面的，和块设备的预读是两回事。我们曾经尝试过使用 blockdev --setra 这个linux命令取消块设备预读，证实无法解决这个问题。

结论: 如果ES结点上会存放海量的索引数据，经常会有大索引（如1TB+)的搜索聚合操作，使用NIOFS会更安全，可以避免很多怪异的性能问题。

一例Query Cache引起的性能问题分析

Elasticsearch • kennywu76 发表了文章 • 6 个评论 • 9295 次浏览 • 2017-09-29 19:18 • 来自相关话题

【携程旅行网吴晓刚】

注：本文是针对Elastic中文社区问题question#2484 的分析和总结。

问题概述

诊断过程及结论

比较巧的是，问题提出者刚好是我们自家公司的开发者，因此内部联系沟通了下，为问题的快速诊断提供了不少便利。

以下是埋点系统里记录的一个耗时150ms的查询 (隐去了敏感信息，去掉了非关键部分):

POST /xxxindex/xxxdb/_search?routing=Mxxxxxxx
{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "must": [
              {
                "bool": {
                  "must": [
                    {
                      "bool": {
                        "should": [
                          {
                            "match_phrase": {
                              "ord_orders_uid": {
                                "query": "Mxxxxxxx",
                                "slop": 0,
                                "boost": 1
                              }
                            }
                          }
                        ],
                        "disable_coord": false,
                        "adjust_pure_negative": true,
                        "boost": 1
                      }
                    },
                    {
                      "range": {
                        "ord_orders_orderdate": {
                          "from": "1405032032",
                          "to":   "1504014193",
                          "include_lower": true,
                          "include_upper": true,
                          "boost": 1
                        }
                      }
                    },
                    {
                      "term": {
                        "ord_orders_ispackageorder": {
                          "value": 0,
                          "boost": 1
                        }
                      }
                    },
                    {
                      "bool": {
                        "must_not": [
                          {
                            "exists": {
                              "field": "ord_hideorder_orderid",
                              "boost": 1
                            }
                          }
                        ],
                        "disable_coord": false,
                        "adjust_pure_negative": true,
                        "boost": 1
                      }
                    }
                  ],
                  "disable_coord": false,
                  "adjust_pure_negative": true,
                  "boost": 1
                }
              }
            ],
            "disable_coord": false,
            "adjust_pure_negative": true,
            "boost": 1
          }
        }
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

拿到查询后，自己手动执行了一下，0 hits，耗时1ms。心里明白，命中了Query Cache，所以才会这么快。

于是用clear api清掉Query Cache，然后再执行几次，有以下发现：

头两次查询耗时38ms左右。这是因为没有cache，需要访问倒排索引，耗时符合预期。之所以两次同样耗时，是因为索引有1个复制片，两次查询分别分配到主和副片上。
接下来两次查询耗时150ms左右。 这里要打一个大大的问号？？？
之后不管再查询多少次，耗时全部是1ms，因为又开始命中Cache。

{
   "range": {
       "my_date": {
       "gte": "now-1y/h",
        "lte": "now-1y/h"
      }
    }
}

总结:

Cache并非建得越多越好，因为Cache的生成和Evict会带来额外的开销。特别是结果集非常大的filter，缓存的代价相对查询本身可能非常高。
ES 5.1.1开始取消了Terms filter Cache，因为Terms filter执行非常快，取消缓存多数情况下反而可以提高性能。
大量用到Now-xxxd To Now这样的Range filter时，可以借助round date技巧，提高Cache的有效期，减轻频繁重建Cache带来的性能问题。

ES 5.4+ 引起的Kibana性能问题

Elasticsearch • kennywu76 发表了文章 • 8 个评论 • 6425 次浏览 • 2017-09-11 18:22 • 来自相关话题

【携程旅行网吴晓刚】

{
  "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "analyze_wildcard": true,
            "query": "*"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now",
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

{
                    "type": "DisjunctionMaxQuery",
                    "description": "(ConstantScore(_field_names:remote_addr.keyword) | ConstantScore(_field_names:geoip.country_isocode) | ConstantScore(_field_names:geoip.country_name.keyword) | ConstantScore(_field_names:via) | ConstantScore(_field_names:domain.keyword) | ConstantScore(_field_names:request_method.keyword) | ConstantScore(_field_names:protocol) | ConstantScore(_field_names:xff.keyword) | ConstantScore(_field_names:host) | ConstantScore(_field_names:geoip.city_name.keyword) | ConstantScore(_field_names:client_ip) | ConstantScore(_field_names:host.keyword) | ConstantScore(_field_names:geoip.longitude) | ConstantScore(_field_names:geoip.subdivision_name.keyword) | ConstantScore(_field_names:geoip.country_code) | ConstantScore(_field_names:upstream_addr.keyword) | ConstantScore(_field_names:@version.keyword) | ConstantScore(_field_names:request_uri) | ConstantScore(_field_names:tags) | ConstantScore(_field_names:idc_tag) | ConstantScore(_field_names:size) | ConstantScore(_field_names:http_referer) | ConstantScore(_field_names:message.keyword) | ConstantScore(_field_names:domain) | ConstantScore(_field_names:geoip.latitude) | ConstantScore(_field_names:xff) | ConstantScore(_field_names:protocol.keyword) | ConstantScore(_field_names:geoip.country_code.keyword) | ConstantScore(_field_names:status) | ConstantScore(_field_names:upstream_addr) | ConstantScore(_field_names:http_referer.keyword) | ConstantScore(_field_names:tags.keyword) | ConstantScore(_field_names:client_ip.keyword) | ConstantScore(_field_names:request_method) | ConstantScore(_field_names:upstream_status) | ConstantScore(_field_names:request_time) | ConstantScore(_field_names:geoip.location) | ConstantScore(_field_names:@version) | ConstantScore(_field_names:geoip.country_name) | ConstantScore(_field_names:user_agent) | ConstantScore(_field_names:idc_tag.keyword) | ConstantScore(_field_names:remote_addr) | ConstantScore(_field_names:geoip.country_isocode.keyword) | ConstantScore(_field_names:geoip.city_name) | ConstantScore(_field_names:via.keyword) | ConstantScore(_field_names:message) | ConstantScore(_field_names:user_agent.keyword) | ConstantScore(_field_names:request_uri.keyword) | ConstantScore(_field_names:@timestamp) | ConstantScore(_field_names:upstream_response_time) | ConstantScore(_field_names:geoip.subdivision_name))",
                    "time": "5535.127008ms",
                    "time_in_nanos": 5535127008

为什么会这样？查询了一下ES官方关于Query String Query的文档，其中的default_field和all_fields起到了一定作用: elasticsearch/reference/5.5/query-dsl-query-string-query.html

default_field

The default field for query terms if no prefix field is specified. Defaults to the index.query.default_field index settings, which in turn defaults to _all.

all_fields

Perform the query on all fields detected in the mapping that can be queried. Will be used by default when the _all field is disabled and no default_field is specified (either in the index settings or in the request body) and no fields are specified.

但仔细想想，这也只是绕过了问题。问题的根源，为什么*不被rewrite成match_all呢？

接下来，我想找到这些版本之间，ES对于query string的解析源码层面做了什么改动。经过一番探查，找到了下面这个变更历史:

这个问题会影响5.4和5.5两个小版本的ES/Kibana。

顺着这个issue里的链接摸下去，找到了对应Kibana相关问题讨论:issues#12097，以及对应的修复: pull/13047，修复版本默认发出的查询串是match all。

修复的版本则是5.5.2及5.6.0，因此有用到5.4.0到5.5.1之间版本的ELK用户一定要安排升级！

更多...

性能差

elasticsearch性能问题

range字符串查询性能问题

Mmap fs可能让大索引访问变得缓慢

Elasticsearch took和实测的耗时差异大，qps低

一例Query Cache引起的性能问题分析

问题概述

诊断过程及结论

总结:

ES 5.4+ 引起的Kibana性能问题

elasticSearch用PHP加载vendor开发接口压测报异常

elasticsearch小文件入库

elasticsearch性能问题

range字符串查询性能问题

Elasticsearch took和实测的耗时差异大，qps低

elasticSearch用PHP加载vendor开发接口压测报异常

elasticsearch小文件入库

Mmap fs可能让大索引访问变得缓慢

一例Query Cache引起的性能问题分析

问题概述

诊断过程及结论

总结:

ES 5.4+ 引起的Kibana性能问题

话题描述

活动推荐

相关话题

最佳回复者

5 人关注该话题