索引分片突然崩溃下线，分片无法重新分配

Elasticsearch | 作者 code4j | 发布于2018年10月26日 | 阅读数：7239

我有一个索引，已经创建好了然后写了好几款一个小时的数据，下午的时候突然一个主分片没了：

机器没有crash，说明是分片自己的问题

看了下unsinged.reason 是 ALLOCATION_FAILED分配失败。

然后看了下es的日志，发现了这么一段：

[2018-10-26 14:49:42,794][ERROR][index.engine             ] [node2783] [lc_app_business-20181026][0] failed to merge

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))

[2018-10-26 14:49:42,855][DEBUG][index.translog           ] [node2783] [lc_app_business-20181026][0] translog closed

[2018-10-26 14:49:42,855][DEBUG][index.engine             ] [node2783] [lc_app_business-20181026][0] engine closed [engine failed on: [merge failed]]

[2018-10-26 14:49:42,855][WARN ][index.engine             ] [node2783] [lc_app_business-20181026][0] failed engine [merge failed]

org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=53c9e2f7 actual=ceaff7a4 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index/_16l.cfs") [slice=_16l_Lucene50_0.tim]))

[2018-10-26 14:49:42,866][DEBUG][index                    ] [node2783] [lc_app_business-20181026] [0] closing... (reason: [engine failure, reason [merge failed]])

[2018-10-26 14:49:42,869][DEBUG][index.shard              ] [node2783] [lc_app_business-20181026][0] state: [STARTED]->[CLOSED], reason [engine failure, reason [merge failed]]

[2018-10-26 14:49:42,869][DEBUG][index.shard              ] [node2783] [lc_app_business-20181026][0] operations counter reached 0, will not accept any further writes

[2018-10-26 14:49:42,869][DEBUG][index.store              ] [node2783] [lc_app_business-20181026][0] store reference count on close: 0

[2018-10-26 14:49:42,869][DEBUG][index                    ] [node2783] [lc_app_business-20181026] [0] closed (reason: [engine failure, reason [merge failed]])

看起来这意思像是merge的时候出了点问题导致分片损坏了，就自己关闭了，但是并不知道具体为什么会这样

在 /data0/esdata/es-logcluster/nodes/0/indices/lc_app_business-20181026/0/index 目录下能找到一个文件叫corrupted_34ejg-QUR6e3bR_12dB6vQ，说的其实也是上面那个Lucene的异常信息。

不知道数据是否能恢复？而且也找不到merge failed的原因呢

6 个回复

rockybean - Elastic Certified Engineer, ElasticStack Fans，公众号：ElasticTalk

磁盘是做的 Raid0，是不是有坏道？
没设置副本？

laoyang360 - 《一本书讲透Elasticsearch》作者，Elastic认证工程师 [死磕Elasitcsearch]知识星球地址：http://t.cn/RmwM3N9；微信公众号：铭毅天下; 博客：https://elastic.blog.csdn.net

貌似磁盘硬件问题，通过副本恢复吧。

zqc0512 - andy zhou

你这玩意不设置副本不是么？
万一硬盘出点问题就傻了……

lvwendong

我也遇到了这个问题，请问找到根本原因了吗，我把坏的索引库删掉才好

kr9226 - 我愿意一步一步走向我想要的世界

昨天夜里我也碰到了这个问题，我在两个节点上做了备份，发现一个shard的主分片和副本都挂掉了，这种情况下，硬件同时出现故障的可能性应该很小吧；可能是主分片挂了之后，主分片和副本数据不一致，所以副本也挂掉了吧

bingo919 - bingo

6.8我也遇到了同样的问题
[2020-12-26T01:38:22,612][WARN ][o.e.i.c.IndicesClusterStateService] [bd_lass_elasticsearch-data0] [[app_2020.12.23-shrink][0]] marking and sending shard failed due to [shard failure, reason [recovery]]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=f5dvwg actual=1ukjrs4 (resource=name [_1_Lucene50_0.pos], length [88425081], checksum [f5dvwg], writtenBy [7.7.0]) (resource=VerifyingIndexOutput(_1_Lucene50_0.pos))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1212) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1190) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1220) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:120) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:43) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:200) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:68) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:459) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:629) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:603) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1087) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.0.jar:6.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.0.jar:6.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
[2020-12-26T22:56:07,517][INFO ][o.e.m.j.JvmGcMonitorService] [bd_lass_elasticsearch-data0] [gc][828469] overhead, spent [348ms] collecting in the last [1s]

要回复问题请先登录或注册

索引分片突然崩溃下线，分片无法重新分配

6 个回复

发起人

活动推荐

相关问题

问题状态

索引分片突然崩溃下线，分片无法重新分配

与内容相关的链接

6 个回复

发起人

活动推荐

相关问题

问题状态