es集群gc问题

Elasticsearch | 作者 wwwlll | 发布于2017年09月25日 | 阅读数:11903

[2017-09-22T09:19:00,655][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779980][148521] duration [23.9s], collections [1]/[24.6s], total [23.9s]/[4.1h], memory [29.7gb]->[29.3gb]/[30.8gb], all_pools {[young] [803.4mb]->[347.8mb]/[1.4gb]}{[survivor] [40.8mb]->[0b]/[191.3mb]}{[old] [28.8gb]->[29gb]/[29.1gb]}
[2017-09-22T09:19:00,747][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779980] overhead, spent [24.1s] collecting in the last [24.6s]
[2017-09-22T09:19:23,515][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779981][148522] duration [21.9s], collections [1]/[22.8s], total [21.9s]/[4.1h], memory [29.3gb]->[29.1gb]/[30.8gb], all_pools {[young] [347.8mb]->[76.2mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:19:23,515][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779981] overhead, spent [22.3s] collecting in the last [22.8s]
[2017-09-22T09:19:48,815][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779983][148523] duration [23.7s], collections [1]/[24.2s], total [23.7s]/[4.1h], memory [30.7gb]->[30.4gb]/[30.8gb], all_pools {[young] [1.4gb]->[1.3gb]/[1.4gb]}{[survivor] [141.6mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:19:48,815][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779983] overhead, spent [23.7s] collecting in the last [24.2s]
[2017-09-22T09:20:12,046][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779985][148524] duration [21.8s], collections [1]/[22.2s], total [21.8s]/[4.1h], memory [30.7gb]->[29gb]/[30.8gb], all_pools {[young] [1.4gb]->[28.7mb]/[1.4gb]}{[survivor] [122.2mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29gb]/[29.1gb]}
[2017-09-22T09:20:12,047][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779985] overhead, spent [21.8s] collecting in the last [22.2s]
[2017-09-22T09:20:36,439][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779986][148525] duration [23.8s], collections [1]/[24.3s], total [23.8s]/[4.1h], memory [29gb]->[29.1gb]/[30.8gb], all_pools {[young] [28.7mb]->[4.3mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:20:36,439][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779986] overhead, spent [23.8s] collecting in the last [24.3s]
[2017-09-22T09:20:58,895][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779987][148526] duration [21.9s], collections [1]/[22.4s], total [21.9s]/[4.1h], memory [29.1gb]->[29.1gb]/[30.8gb], all_pools {[young] [4.3mb]->[1.4mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:20:58,895][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779987] overhead, spent [21.9s] collecting in the last [22.4s]
[2017-09-22T09:21:23,694][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779988][148527] duration [23.7s], collections [1]/[24.7s], total [23.7s]/[4.1h], memory [29.1gb]->[30.2gb]/[30.8gb], all_pools {[young] [1.4mb]->[1gb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:21:23,694][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779988] overhead, spent [23.7s] collecting in the last [24.7s]
[2017-09-22T09:21:47,572][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779990][148528] duration [22.6s], collections [1]/[22.8s], total [22.6s]/[4.1h], memory [30.7gb]->[29.1gb]/[30.8gb], all_pools {[young] [1.4gb]->[66.7mb]/[1.4gb]}{[survivor] [151.5mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:21:47,572][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779990] overhead, spent [22.6s] collecting in the last [22.8s]
[2017-09-22T09:22:13,027][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779992][148529] duration [23.4s], collections [1]/[24.4s], total [23.4s]/[4.1h], memory [30.7gb]->[30.4gb]/[30.8gb], all_pools {[young] [1.4gb]->[1.3gb]/[1.4gb]}{[survivor] [89.9mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:22:13,032][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779992] overhead, spent [23.4s] collecting in the last [24.4s]
[2017-09-22T09:22:35,663][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779993][148530] duration [21.6s], collections [1]/[22.6s], total [21.6s]/[4.1h], memory [30.4gb]->[29.1gb]/[30.8gb], all_pools {[young] [1.3gb]->[70.6mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:22:35,663][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779993] overhead, spent [21.6s] collecting in the last [22.6s]
[2017-09-22T09:22:59,552][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779994][148531] duration [23.3s], collections [1]/[23.8s], total [23.3s]/[4.1h], memory [29.1gb]->[29.1gb]/[30.8gb], all_pools {[young] [70.6mb]->[36.3mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:22:59,552][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779994] overhead, spent [23.3s] collecting in the last [23.8s]
[2017-09-22T09:23:47,539][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779997][148533] duration [45.2s], collections [2]/[45.8s], total [45.2s]/[4.1h], memory [30.7gb]->[29.1gb]/[30.8gb], all_pools {[young] [1.4gb]->[8.2mb]/[1.4gb]}{[survivor] [173.3mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:23:47,539][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779997] overhead, spent [45.2s] collecting in the last [45.8s]
[2017-09-22T09:24:11,978][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1779998][148534] duration [23.8s], collections [1]/[24.4s], total [23.8s]/[4.1h], memory [29.1gb]->[29gb]/[30.8gb], all_pools {[young] [8.2mb]->[4mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29.1gb]->[29gb]/[29.1gb]}
[2017-09-22T09:24:11,978][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1779998] overhead, spent [23.8s] collecting in the last [24.4s]
[2017-09-22T09:24:37,304][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1780001][148535] duration [23.2s], collections [1]/[23.3s], total [23.2s]/[4.2h], memory [30.6gb]->[29gb]/[30.8gb], all_pools {[young] [1.4gb]->[28.7mb]/[1.4gb]}{[survivor] [162.5mb]->[0b]/[191.3mb]}{[old] [29gb]->[29gb]/[29.1gb]}
[2017-09-22T09:24:37,304][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1780001] overhead, spent [23.2s] collecting in the last [23.3s]
[2017-09-22T09:25:02,598][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1780003][148536] duration [23.7s], collections [1]/[24.2s], total [23.7s]/[4.2h], memory [30.3gb]->[29gb]/[30.8gb], all_pools {[young] [1.1gb]->[6mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [29gb]->[29gb]/[29.1gb]}
[2017-09-22T09:25:02,598][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1780003] overhead, spent [24.1s] collecting in the last [24.2s]
[2017-09-22T09:25:25,435][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1780004][148537] duration [22.3s], collections [1]/[22.8s], total [22.3s]/[4.2h], memory [29gb]->[29.1gb]/[30.8gb], all_pools {[young] [6mb]->[32.3mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29gb]->[29.1gb]/[29.1gb]}
[2017-09-22T09:25:25,435][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1780004] overhead, spent [22.3s] collecting in the last [22.8s]
[2017-09-22T09:25:50,017][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1780006][148538] duration [23.2s], collections [1]/[23.5s], total [23.2s]/[4.2h], memory [30.7gb]->[29.8gb]/[30.8gb], all_pools {[young] [1.4gb]->[801.2mb]/[1.4gb]}{[survivor] [177.8mb]->[0b]/[191.3mb]}{[old] [29.1gb]->[29gb]/[29.1gb]}
[2017-09-22T09:25:50,018][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][1780006] overhead, spent [23.2s] collecting in the last [23.5s]
[2017-09-22T09:26:13,420][WARN ][o.e.m.j.JvmGcMonitorService] . [gc][old][1780007][148539] duration [22.7s], collections [1]/[23.1s], total [22.7s]/[4.2h], memory [29.8gb]->[29.4gb]/[30.8gb], all_pools {[young] [801.2mb]->[412.2mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [29gb]->[29.1gb]/[29.1gb]}


12台机器的集群 5.3.2 java1.8.131 已经配置内存锁定31g。现在不定期的其中某台会出现大量gc的记录然后导致集群无法工作,当我手工停掉这台es进程后,其他所有节点也产生了这样gc记录,最后没办法只能所有节点一起重启。

请问有啥办法吗还是配置我需要注意什么
已邀请:

kennywu76 - wood@Ctrip

赞同来自: wwwlll luohuanfeng

cyberdak - 58.com - 长期内推58

赞同来自: wwwlll

 看这个记录大概觉得是CMS 然后FULL GC太多,年轻带只有1.4G,意味着堆分配不合理。。
 
那么最简单的办法就是直接切换到G1GC。
 
 

Loading Zhang

赞同来自: wwwlll

使用G1

ybtsdst - focus on lucene & es

赞同来自: wwwlll

集群在干啥? old区一直是满的

hufuman

赞同来自: wwwlll

完整的ES启动命令行发一下
 
机器的配置发一下

simooge - 80后IT男

赞同来自:

兄弟,解决了么?

hapjin

赞同来自:

从GC日志看,老年代一直回收不了。ElasticSearch默认使用CMS垃圾回收器,当老年代使用了75%时就开始FullGC了,貌似官方还是不推荐使用G1,因为G1有些bug未解决会导致ES索引数据失败。具体可参考a-heap-of-trouble
楼上有回答  "新生代只有1.4GB",说堆配置不合理。我看了下默认情况下,新生代最大就只有1.4GB。(MaxNewSize参数)
  先 ps aux | grep elastic 找到你的ES进程pid,然后 jmap -heap pid 查看ES进程当前的堆 使用情况 和 GC配置,比如我的:
GC配置信息:
Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 34359738368 (32768.0MB)
   NewSize                  = 1570308096 (1497.5625MB)
   MaxNewSize               = 1570308096 (1497.5625MB)
   OldSize                  = 32789430272 (31270.4375MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)
 
堆使用情况:
Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 1413283840 (1347.8125MB)
   used     = 71170512 (67.87348937988281MB)
   free     = 1342113328 (1279.9390106201172MB)
   5.035825782880246% used
Eden Space:
   capacity = 1256259584 (1198.0625MB)
   used     = 34256104 (32.669166564941406MB)
   free     = 1222003480 (1165.3933334350586MB)
   2.726833246591176% used
 
确保配置是合理的。另外,貌似ES官方不推荐给一个ElasticSearch进程分配大内存,32GB肯定是不能超过的。不然会有性能问题(指针压缩),参考:heap-size
其实为了保险起见,配置的内存最好不超过26GB,配置的内存越大,FullGC耗时越长,虽然我自己配的就是32GB(尴尬了……)
另外ES6.3.2自动生成的gc日志文件:gc.log里面其实是有JVM相关配置的,比如我的:
Java HotSpot(TM) 64-Bit Server VM (25.161-b12) for linux-amd64 JRE (1.8.0_161-b12), built on Dec 19 2017 16:12:43 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 65596108k(1934556k free), swap 7813116k(7813112k free)
CommandLine flags: -XX:+AlwaysPreTouch -XX:CICompilerCount=12 -XX:CMSInitiatingOccupancyFraction=75 -XX:ErrorFile=logs/hs_err_pid%p.log -XX:GCLogFileSize=67108864 -XX:+Hea
pDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:InitialHeapSize=34359738368 -XX:MaxHeapSize=34359738368 -XX:MaxNewSize=1570308096 -XX:MaxTenuringThreshold=6 -XX:MinHeapD
eltaBytes=196608 -XX:NewSize=1570308096 -XX:NumberOfGCLogFiles=32 -XX:OldPLABSize=16 -XX:OldSize=32789430272 -XX:-OmitStackTraceInFastThrow -XX:+PrintGC -XX:+PrintGCApplic
ationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:ThreadStackSize=1024 -XX:+UseCMSInitiatingOccupancyOn
ly -XX:+UseConcMarkSweepGC -XX:+UseFastUnorderedTimeStamps -XX:+UseGCLogFileRotation -XX:+UseParNewGC

 
然后为什么老年代一直回收不了,是不是 执行了什么使ES 缓存 数据的操作???
 
多台机器组成一个ES集群,如果其中有台机一直在FullGC,CMS FullGC 有个阶段是存在Stop The World的,导致ES进程无法响应其他节点的ping请求,于是master就认为这台机器挂了,机器挂了,就会进行自动分片迁移(index.unassigned.node_left.delayed_timeout),如果这台FullGC的机器自己是master的话,就会触发master选举。这些都会使集群工作不正常。
 
 
 

zqc0512 - andy zhou

赞同来自:

大集群 G1GC试试 可以一定就解决了
segment 索引过多了。调整优化下。

要回复问题请先登录注册