使用 shuf 来打乱一个文件中的行或是选择文件中一个随机的行。

索引分片突然崩溃下线,分片无法重新分配

Elasticsearch | 作者 code4j | 发布于2018年10月26日 | 阅读数:6602

我有一个索引,已经创建好了然后写了好几款一个小时的数据,下午的时候突然一个主分片没了:
 

6CD477F2-89A5-4915-A7DA-CB3367253E56.png

 
机器没有crash,说明是分片自己的问题
 
看了下unsinged.reason 是 ALLOCATION_FAILED分配失败。
 
然后看了下es的日志,发现了这么一段:
 
[2018-10-26 14:49:42,794][ERROR][index.engine             ] [node2783] [lc_app_business-20181026][0] failed to merge
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
[2018-10-26 14:49:42,855][DEBUG][index.translog ] [node2783] [lc_app_business-20181026][0] translog closed
[2018-10-26 14:49:42,855][DEBUG][index.engine ] [node2783] [lc_app_business-20181026][0] engine closed [engine failed on: [merge failed]]
[2018-10-26 14:49:42,855][WARN ][index.engine ] [node2783] [lc_app_business-20181026][0] failed engine [merge failed]
org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))
[2018-10-26 14:49:42,866][DEBUG][index ] [node2783] [lc_app_business-20181026] [0] closing... (reason: [engine failure, reason [merge failed]])
[2018-10-26 14:49:42,869][DEBUG][index.shard ] [node2783] [lc_app_business-20181026][0] state: [STARTED]->[CLOSED], reason [engine failure, reason [merge failed]]
[2018-10-26 14:49:42,869][DEBUG][index.shard ] [node2783] [lc_app_business-20181026][0] operations counter reached 0, will not accept any further writes
[2018-10-26 14:49:42,869][DEBUG][index.store ] [node2783] [lc_app_business-20181026][0] store reference count on close: 0
[2018-10-26 14:49:42,869][DEBUG][index ] [node2783] [lc_app_business-20181026] [0] closed (reason: [engine failure, reason [merge failed]])

看起来这意思像是merge的时候出了点问题导致分片损坏了,就自己关闭了,但是并不知道具体为什么会这样
 
在 /data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index 目录下能找到一个文件叫corrupted_34ejg-QUR6e3bR_12dB6vQ,说的其实也是上面那个Lucene的异常信息。
 
不知道数据是否能恢复?而且也找不到merge failed的原因呢
已邀请:

rockybean - Elastic Certified Engineer, ElasticStack Fans,公众号:ElasticTalk

赞同来自:

磁盘是做的 Raid0,是不是有坏道?
没设置副本?

laoyang360 - 《一本书讲透Elasticsearch》作者,Elastic认证工程师 [死磕Elasitcsearch]知识星球地址:http://t.cn/RmwM3N9;微信公众号:铭毅天下; 博客:https://elastic.blog.csdn.net

赞同来自:

貌似磁盘硬件问题,通过副本恢复吧。

zqc0512 - andy zhou

赞同来自:

你这玩意不设置副本不是么?
万一硬盘出点问题就傻了……

lvwendong

赞同来自:

我也遇到了这个问题,请问找到根本原因了吗,我把坏的索引库删掉才好

kr9226 - 我愿意一步一步走向我想要的世界

赞同来自:

昨天夜里我也碰到了这个问题,我在两个节点上做了备份,发现一个shard的主分片和副本都挂掉了,这种情况下,硬件同时出现故障的可能性应该很小吧;可能是主分片挂了之后,主分片和副本数据不一致,所以副本也挂掉了吧

bingo919 - bingo

赞同来自:

6.8我也遇到了同样的问题
[2020-12-26T01:38:22,612][WARN ][o.e.i.c.IndicesClusterStateService] [bd_lass_elasticsearch-data0] [[app_2020.12.23-shrink][0]] marking and sending shard failed due to [shard failure, reason [recovery]]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=f5dvwg actual=1ukjrs4 (resource=name [_1_Lucene50_0.pos], length [88425081], checksum [f5dvwg], writtenBy [7.7.0]) (resource=VerifyingIndexOutput(_1_Lucene50_0.pos))
        at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1212) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1190) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1220) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:120) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:43) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:200) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:68) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:459) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:629) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:603) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1087) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.0.jar:6.8.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
[2020-12-26T22:56:07,517][INFO ][o.e.m.j.JvmGcMonitorService] [bd_lass_elasticsearch-data0] [gc][828469] overhead, spent [348ms] collecting in the last [1s]

要回复问题请先登录注册