聚合引起的内存问题

Elasticsearch | 作者 novia | 发布于2017年07月18日 | 阅读数:3266

前两天对集群过期数据进行了清理,删除了大概1/3的数据(只是删除,还没有进行段合并),今天执行aggs的时候,报如下错误:
 
{
"took": 74,
"timed_out": false,
"_shards": {
"total": 12,
"successful": 11,
"failed": 1,
"failures": [
{
"shard": 3,
"index": "business-log-v1",
"node": "svHkYmFOQkeANqyvXNUGvQ",
"reason": {
"type": "circuit_breaking_exception",
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [5569511424/5.1gb]",
"bytes_wanted": 5569656320,
"bytes_limit": 5569511424
}
}
]
},
"hits": {
"total": 0,
"max_score": null,
"hits": [

]
},
"aggregations": {
"decision": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [

]
}
}
}
刚开始以为是aggs数据太大,但是对数据少的索引进行了测试,也是报同样的错误。
 
怀疑有部分内存被占用没有释放,大家有没有遇到此问题,多谢!
集群信息如下:
 
{
"timestamp": 1500365084544,
"cluster_name": "hl_es",
"status": "green",
"indices": {
"count": 20,
"shards": {
"total": 834,
"primaries": 417,
"replication": 1,
"index": {
"shards": {
"min": 2,
"max": 48,
"avg": 41.7
},
"primaries": {
"min": 1,
"max": 24,
"avg": 20.85
},
"replication": {
"min": 1,
"max": 1,
"avg": 1
}
}
},
"docs": {
"count": 943666181,
"deleted": 131650241
},
"store": {
"size_in_bytes": 6051156505786,
"throttle_time_in_millis": 0
},
"fielddata": {
"memory_size_in_bytes": 2370079544,
"evictions": 174252
},
"query_cache": {
"memory_size_in_bytes": 15304728,
"total_count": 184931521718,
"hit_count": 4993960685,
"miss_count": 179937561033,
"cache_size": 4600,
"cache_count": 141518775,
"evictions": 141514175
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 17287,
"memory_in_bytes": 4540006104,
"terms_memory_in_bytes": 2566102272,
"stored_fields_memory_in_bytes": 1813961248,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 7374912,
"doc_values_memory_in_bytes": 152567672,
"index_writer_memory_in_bytes": 423376439,
"index_writer_max_memory_in_bytes": 16768949826,
"version_map_memory_in_bytes": 1521793,
"fixed_bit_set_memory_in_bytes": 0
},
"percolate": {
"total": 0,
"time_in_millis": 0,
"current": 0,
"memory_size_in_bytes": -1,
"memory_size": "-1b",
"queries": 0
}
},
"nodes": {
"count": {
"total": 15,
"master_only": 3,
"data_only": 12,
"master_data": 0,
"client": 0
},
"versions": [
"2.1.1"
],
"os": {
"available_processors": 54,
"allocated_processors": 54,
"mem": {
"total_in_bytes": 0
},
"names": [
{
"name": "Linux",
"count": 1
}
]
},
"process": {
"cpu": {
"percent": 110
},
"open_file_descriptors": {
"min": 477,
"max": 2592,
"avg": 2097
}
},
"jvm": {
"max_uptime_in_millis": 5954479327,
"versions": [
{
"version": "1.8.0_111",
"vm_name": "Java HotSpot(TM) 64-Bit Server VM",
"vm_version": "25.111-b14",
"vm_vendor": "Oracle Corporation",
"count": 15
}
],
"mem": {
"heap_used_in_bytes": 76959532864,
"heap_max_in_bytes": 176696721408
},
"threads": 957
},
"fs": {
"total_in_bytes": 13111924850688,
"free_in_bytes": 6990873366528,
"available_in_bytes": 6324724187136,
"spins": "true"
},
"plugins": [
{
"name": "head",
"version": "master",
"description": "head - A web front end for an elastic search cluster",
"url": "/_plugin/head/",
"jvm": false,
"site": true
},
{
"name": "statistics2",
"version": "2.0.1",
"description": "statistics2",
"jvm": true,
"classname": "com.hylanda.statistics2.plugin.StatisticsPlugin",
"isolated": true,
"site": false
},
{
"name": "delete-by-query",
"version": "2.1.1",
"description": "The Delete By Query plugin allows to delete documents in Elasticsearch with a single query.",
"jvm": true,
"classname": "org.elasticsearch.plugin.deletebyquery.DeleteByQueryPlugin",
"isolated": true,
"site": false
},
{
"name": "kopf",
"version": "2.0.1",
"description": "kopf - simple web administration tool for Elasticsearch",
"url": "/_plugin/kopf/",
"jvm": false,
"site": true
},
{
"name": "knapsack",
"version": "2.1.1.0",
"description": "Knapsack export/import for Elasticsearch",
"jvm": true,
"classname": "org.xbib.elasticsearch.plugin.knapsack.KnapsackPlugin",
"isolated": true,
"site": false
},
{
"name": "siren-join",
"version": "2.1.1",
"description": "SIREn plugin that adds join capabilities to Elasticsearch",
"jvm": true,
"classname": "solutions.siren.join.SirenJoinPlugin",
"isolated": true,
"site": false
},
{
"name": "Hanlp",
"version": "1.0.1",
"description": "Hanlp",
"jvm": true,
"classname": "com.hylanda.hanlp.plugin.HanlpPlugin",
"isolated": true,
"site": false
}
]
}
}
已邀请:

laududu

赞同来自:

看着说是超过了某种限制,没见过这种问题

kennywu76 - wood@Ctrip

赞同来自:

聚合的索引数据总量是多少?  聚合语句是怎样的?

kennywu76 - wood@Ctrip

赞同来自:

@novia: 应该和上一个聚合没有关系。  如果对1万条数据的小索引聚合也报上面的错误,很可能是数据本身的问题。你是否在执行date histogram聚合

novia - 1&0

赞同来自:

没有复杂的聚合,语句如下:
 
business-log-v1/business-log/_search?search_type=count
{
"query": {},
"aggs": {
"decision": {
"terms": {
"field": "kid"
}
}
}
}

这个索引就10几万数据,而且每条数据非常小

kennywu76 - wood@Ctrip

赞同来自:

@novia 
1. GET /_cat/indices/business-log-v1 看下索引的stats是怎样的
2. kid字段是什么数据类型? 
 
我看你贴的报错里面,查询hit了12个shard,其中11个成功了,只有shard 3报错,这点比较奇怪。可以 GET /_cat/shards/business-log-v1查下shard分布,看看shard 3所在的node是否正常?

novia - 1&0

赞同来自:

@kennywu76 多谢,分析了下,最终也没有定位出问题,最后还是重启了集群解决

novia - 1&0

赞同来自:

@kennywu76 昨天异常的时候,红色框里的指标都到了5G以上,而且一直持续,不释放
 

mi.png

 

BrickXu - BlackOps@Qunar

赞同来自:

看提示应该是触发了断路器了,你看下自己的断路器配置的内存limit是多少。
 
PS:如果看着fielddata不爽就先clear cache下。

要回复问题请先登录注册