elasticsearch recovery问题

Elasticsearch | 作者 jingkyks | 发布于2015年12月17日 | 阅读数:4971

es集群100亿+数据,版本1.5
现在升级为1.7,restart upgrade的升级方式。
所有primary很快恢复,大部分replicas也恢复了,只有一个replica一直是initialize状态。
情形是这样:
primary在ndb4上,首先replicas放在了ndb6上。出现了异常日志。
然后replica自动放到了 ndb7上,出现了异常日志。
。。。
 ------
[2015-12-16 05:42:55,400][WARN ][indices.cluster          ] [ndb6] [[mgobject0][10]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [mgobject0][10]: Recovery failed from [ndb4][NJzaKTz4QWiRwf05pcjnFw][ndb4][inet[/192.168.40.34:9300]] into [ndb6][N-QZE3-ATNeaHdsnu
ovq2A][ndb6][inet[/192.168.40.36:9300]]
    at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:280)
    at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:70)
    at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:567)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [ndb4][inet[/192.168.40.34:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [mgobject0][10] Phase[1] Execution failed
    at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:883)
    at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:780)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
    at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [mgobject0][10] Failed to transfer [819] files with total size of [808.5gb]
    at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:431)
    at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:878)
    ... 10 more
Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/data/elasticsearch/data/cerebro/nodes/0/indices/mgobject0/10/index/_4iw9.cfs")
    at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:189)
    at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:160)
    at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.doRun(RecoverySourceHandler.java:312)
    ... 4 more
Caused by: java.io.IOException: Input/output error
    at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
    at sun.nio.ch.IOUtil.read(IOUtil.java:197)
    at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
    at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
    at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
    ... 6 more
 
 
已邀请:

medcl - 今晚打老虎。

赞同来自:

这个文件貌似损坏了,启动或者索引的过程中有出过异常么?磁盘满、突然关机等情况

jingkyks - 水果铅笔2B橡皮

赞同来自:

停机重启过程中并没有出现异常,打开cluster.routing.allocation.enabled: "all"后,进入recovery流程后,出现过TimeOutException。如下:
---------
org.elasticsearch.indices.recovery.RecoveryFailedException: [mgobject0][2]: Recovery failed from [ndb7][D2VJf9kxQl6_Vma1eYRcng][ndb7][inet[/192.168.40.37:9300]] into [ndb4][NJzaKTz4QWiRwf05pcjnFw][ndb4][in et[/192.168.40.34:9300]] (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:235)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
 ... 5 more
---------
这样的分片又自动重新分配了,而且状态良好。而无法分配成功的那一个分片异常就是1楼描述的。
如果这个分片确实坏掉了,有什么办法可以修复么?lucence-repair方法?

要回复问题请先登录注册