索引分片突然崩溃下线,分片无法重新分配

Elasticsearch | 作者 code4j | 发布于2018年10月26日 | 阅读数:514

我有一个索引,已经创建好了然后写了好几款一个小时的数据,下午的时候突然一个主分片没了:
 

6CD477F2-89A5-4915-A7DA-CB3367253E56.png

 
机器没有crash,说明是分片自己的问题
 
看了下unsinged.reason 是 ALLOCATION_FAILED分配失败。
 
然后看了下es的日志,发现了这么一段:
 
[2018-10-26 14:49:42,794][ERROR][index.engine             ] [node2783] [lc_app_business-20181026][0] failed to merge
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
[2018-10-26 14:49:42,855][DEBUG][index.translog ] [node2783] [lc_app_business-20181026][0] translog closed
[2018-10-26 14:49:42,855][DEBUG][index.engine ] [node2783] [lc_app_business-20181026][0] engine closed [engine failed on: [merge failed]]
[2018-10-26 14:49:42,855][WARN ][index.engine ] [node2783] [lc_app_business-20181026][0] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
[2018-10-26 14:49:42,866][DEBUG][index ] [node2783] [lc_app_business-20181026] [0] closing... (reason: [engine failure, reason [merge failed]])
[2018-10-26 14:49:42,869][DEBUG][index.shard ] [node2783] [lc_app_business-20181026][0] state: [STARTED]->[CLOSED], reason [engine failure, reason [merge failed]]
[2018-10-26 14:49:42,869][DEBUG][index.shard ] [node2783] [lc_app_business-20181026][0] operations counter reached 0, will not accept any further writes
[2018-10-26 14:49:42,869][DEBUG][index.store ] [node2783] [lc_app_business-20181026][0] store reference count on close: 0
[2018-10-26 14:49:42,869][DEBUG][index ] [node2783] [lc_app_business-20181026] [0] closed (reason: [engine failure, reason [merge failed]])

看起来这意思像是merge的时候出了点问题导致分片损坏了,就自己关闭了,但是并不知道具体为什么会这样
 
在 /data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index 目录下能找到一个文件叫corrupted_34ejg-QUR6e3bR_12dB6vQ,说的其实也是上面那个Lucene的异常信息。
 
不知道数据是否能恢复?而且也找不到merge failed的原因呢
已邀请:

rockybean - Elastic Certified Engineer, ElasticStack Fans,公众号:ElasticTalk

赞同来自:

磁盘是做的 Raid0,是不是有坏道?
没设置副本?

laoyang360 - [死磕Elasitcsearch]知识星球地址:http://t.cn/RmwM3N9;微信公众号:铭毅天下; 博客:blog.csdn.net/laoyang360

赞同来自:

貌似磁盘硬件问题,通过副本恢复吧。

zqc0512 - andy zhou

赞同来自:

你这玩意不设置副本不是么?
万一硬盘出点问题就傻了……

lvwendong

赞同来自:

我也遇到了这个问题,请问找到根本原因了吗,我把坏的索引库删掉才好

要回复问题请先登录注册