使用 dmesg 来查看一些硬件或驱动程序的信息或问题。

es query(包含大量的 agg 操作) 导致经常超过 breaker.total 的值

Elasticsearch | 作者 talon | 发布于2019年07月15日 | 阅读数:2837

es 集群有10个节点,每个节点分配的内存大小为 30g,每个节点的总内存大小为256G。es 集群启动后通过 kibana 观察每个 es 节点的 jvm memory 的使用情况:平均为 6%~7% 左右。

2019-07-15_105907.png

 
启动程序后,es 的 jvm memory 使用稳步上升,大概上升到 76% 左右,然后会回落到 20% 左右,然后又重复上面的节奏,但是有的节点就会报出异常:
2019-07-15_110658.png



2019-07-15_110750.png


  
查询的 index 的情况:1个备份,5个分片
 

2019-07-15_111256.png

 
query 语句因为业务需要包含了20多个 agg,有的 agg 还有多层的 nested agg。通过 kibana 查出 queryCache,fielddataCache,requestCache 占用的内存都不多,都在200M 以内。
 
查看 es 的 gc 日志:

2019-07-15_142221.png

 
目前设定的 breakers:
parent: 70%
fielddata: 40%
request: 60%
 
目前已做过的调整和性能方面的调查:
0、设定了 bootstrap.memory_locak: true
1、在 agg 中涉及到 term agg 的地方使用过 map 和 默认的 global orinale,两者效果在本程序中相差不大。
2、采集过其中一个 es 节点的 jvm heap dump,有 23G 左右大小,然后使用 eclipse mat 工具分析,在使用 mat 工具分析生成3个报告的时候有报错(且在 mat 的官网上找出报错是一个 bug ,至今 bug 未解决),导致生成的报告中显示,加起来可能内存泄露的的大小只有3G左右,很难分析出到底是哪一个对象占据的 heap 很大。在3G左右的对象中,占比较多的是 netty相关的,查阅资料设定了各个节点参数:
-Dio.netty.allocator.type=unpooled
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
但是设定过之后,还是会报出同样的错误。
dump 包太大,导出到本地直接用插件分析也不现实。
3、使用 jmap -histo:live 命令查过在运行时各对象的使用内存的情况,较大的是一个 byte 数组,然后是 DirectByteBuffer
 
实在是不知道问题出在哪里,请教大神有什么好的办法或者思路没有?
 
 
附上涉及到的 DSL (这个是涉及到的几个DSL中的一个,但是这个看耗时是比较长的一个,跑一天的数据有时候要1分钟左右):
DSL太长本来想弄成附件的,但是附件没有支持文本文件。。。
GET md_detail/_search
{
"size": 0,
"query": {
"bool": {
"filter": [{
"term": {
"d_type": {
"value": "d"
}
}
}, {
"term": {
"app_id": {
"value": "1"
}
}
}, {
"range": {
"part": {
"gte": 0,
"lte": 99999
}
}
}, {
"range": {
"date_from": {
"lt": "2019-03-08",
"format": "yyyy-MM-dd"
}
}
}]
}
},
"aggregations": {
"person_term": {
"terms": {
"field": "person_id",
"size": 2147483647,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
},
"aggregations": {
"p_id": {
"nested": {
"path": "ps"
},
"aggregations": {
"nested_p_id": {
"terms": {
"field": "ps.c_id",
"size": 2147483647,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
},
"aggregations": {
"reversed_page": {
"reverse_nested": {},
"aggregations": {
"from_last_prd": {
"max": {
"field": "date_from",
"format": "yyyy-MM-dd"
}
},
"last_15_date": {
"filter": {
"range": {
"date_from": {
"from": "2019-02-22",
"to": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"date_from": {
"terms": {
"field": "d_desc",
"size": 14,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
},
"date_30": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-07",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"day_count": {
"terms": {
"field": "date_from",
"format": "yyyy-MM-dd",
"size": 29,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
}
}
}
}
},
"nested_p_c": {
"nested": {
"path": "ps.cs"
},
"aggregations": {
"term_c_id": {
"terms": {
"field": "ps.cs.c_id",
"size": 2147483647,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
},
"aggregations": {
"revsered_c_id": {
"reverse_nested": {},
"aggregations": {
"from_last_prd": {
"max": {
"field": "date_from",
"format": "yyyy-MM-dd"
}
},
"last_15_date": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-22",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"date_from": {
"terms": {
"field": "d_desc",
"size": 14,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
},
"date_30": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-07",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"day_count": {
"terms": {
"field": "date_from",
"format": "yyyy-MM-dd",
"size": 29,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
}
}
}
}
}
}
}
}
},
"f_id": {
"nested": {
"path": "fs"
},
"aggregations": {
"nested_f_id": {
"terms": {
"field": "fs.f_id",
"size": 2147483647,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
},
"aggregations": {
"reversed_f": {
"reverse_nested": {},
"aggregations": {
"from_last_prd": {
"max": {
"field": "date_from",
"format": "yyyy-MM-dd"
}
},
"last_15_date": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-22",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"date_from": {
"terms": {
"field": "d_desc",
"size": 14,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
},
"date_30": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-07",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"day_count": {
"terms": {
"field": "date_from",
"format": "yyyy-MM-dd",
"size": 29,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
}
}
}
}
}
}
},
"from_last_prd": {
"max": {
"field": "date_from",
"format": "yyyy-MM-dd"
}
},
"last_15_date": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-22",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"date_from": {
"terms": {
"field": "d_desc",
"size": 14,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
},
"max_install_time": {
"max": {
"field": "d_install_time",
"format": "yyyy-MM-dd"
}
},
"date_30": {
"filter": {
"range": {
"date_from": {
"gte": "2019-02-07",
"lte": "2019-03-07",
"format": "yyyy-MM-dd"
}
}
},
"aggregations": {
"day_count": {
"terms": {
"field": "date_from",
"format": "yyyy-MM-dd",
"size": 29,
"execution_hint": "map",
"order": [{
"_count": "desc"
}, {
"_key": "asc"
}],
"collect_mode": "breadth_first"
}
}
}
},
"max_part_no": {
"max": {
"field": "part"
}
},
"earliest_date_from": {
"min": {
"field": "date_from",
"format": "yyyy-MM-dd"
}
},
"top_is_loyal": {
"top_hits": {
"from": 0,
"size": 1,
"_source": {
"includes": ["is_loyal"],
"excludes": []
},
"sort": [{
"date_from": {
"order": "desc"
}
}]
}
}
}
}
}
}
已邀请:

Ombres

赞同来自:

 query 语句因为业务需要包含了20多个 agg,有的 agg 还有多层的 nested agg。
 
根本原因还是你上面提到的这句吧,嵌套bucket过多可能导致这种问题。
 
数据量50G并不多。单节点是256G内存,给JVM分了30G,可以适当提升一下heap大小。
 
-----------------------------------------
除了堆内存大小,其他排查可以考虑几方面吧。
1. 可以用搜索前和搜索后用 GET _stats/  查一下索引的状态,对比差异,比如fielddata等
2. 检查索引的mapping是否合理,用于agg的字段是否是比较合理的设置。比如是否在text分词字段上使用了agg.
3. 优化DSL

 
 
 

laoyang360 - 《一本书讲透Elasticsearch》作者,Elastic认证工程师 [死磕Elasitcsearch]知识星球地址:http://t.cn/RmwM3N9;微信公众号:铭毅天下; 博客:https://elastic.blog.csdn.net

赞同来自:

复杂dsl是要害点,建议发下大家讨论下。
说下数据存储的场景。

要回复问题请先登录注册