使用 shuf 来打乱一个文件中的行或是选择文件中一个随机的行。

ES运行一段时间后崩溃All shards failed

Elasticsearch | 作者 ww107 | 发布于2019年12月09日 | 阅读数:8755

ES运行一段时间后会自动崩溃,集群状态变为Red。删掉数据库重启,运行一段时间(通常是几天)后又崩溃。求大神指点!

ES版本5.4.1,系统是阿里云Ubuntu 12。所在磁盘分区总大小20G,es占用的空间大约有1.6G。集群基本信息如下:
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

以下是错误发生时的日志,省略了一些:
[2019-12-07T01:04:08,770][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2019-12-07T01:04:38,793][WARN ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] high disk watermark [90%] exceeded on [xRIeFFvgTMes53cAJzhcYQ][107room-node-1][/alidata/server/elasticsearch/data/nodes/0] free: 1.3gb[6.6%], shards will be relocated away from this node
[2019-12-07T01:05:08,827][INFO ][o.e.c.r.a.DiskThresholdMonitor] [107room-node-1] rerouting shards: [one or more nodes has gone under the high or low watermark]
[2019-12-07T03:19:25,815][WARN ][o.e.i.e.Engine ] [107room-node-1] [107room][3] failed engine [already closed by tragic event on the translog]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]

at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,824][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [shard failure, reason [already closed by tragic event on the translog]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_161]
at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:127) ~[elasticsearch-5.4.1.jar:5.4.1]

at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,826][WARN ][o.e.c.a.s.ShardStateAction] [107room-node-1] [107room][3] received shard failed for shard id [[107room][3]], allocation id [MqwuitbpTweTbVPquCRzDg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [NoSuchFileException[/alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp]]
java.nio.file.NoSuchFileException: /alidata/server/elasticsearch/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]

at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:627) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2019-12-07T03:19:25,864][INFO ][o.e.c.r.a.AllocationService] [107room-node-1] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[107room][3]] ...]).
[2019-12-07T03:19:25,974][WARN ][o.e.i.c.IndicesClusterStateService] [107room-node-1] [[107room][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [107room][3]: Recovery failed on {107room-node-1}{xRIeFFvgTMes53cAJzhcYQ}{mElodbIhS96k-5uqnbX8WQ}{127.0.0.1}{127.0.0.1:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1490) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.4.1.jar:5.4.1]

... 4 more
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.4.1.jar:5.4.1]

at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
... 4 more
Caused by: java.nio.file.NoSuchFileException: /alidata/server/elasticsearch-5.4.1/data/nodes/0/indices/4ZnhGezFTlqUwWV1hcvMKQ/3/translog/translog.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]

at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1238) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1486) ~[elasticsearch-5.4.1.jar:5.4.1]
... 4 more
已邀请:

Qiaoqing

赞同来自:

POST /_cluster/allocation/explain?pretty 看看

ww107

赞同来自:

结果如下:
{
"index" : "107room",
"shard" : 3,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2019-12-09T08:01:27.620Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "GxrdPo2bR1mTU8IeGA-u1A",
"node_name" : "107room-node-1",
"transport_address" : "127.0.0.1:9300",
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[107room][3], node[GxrdPo2bR1mTU8IeGA-u1A], [P], s[STARTED], a[id=lvy08tl2S9WamXLJl0jrmA]]"
}
]
}
]
}

Charele - Cisco4321

赞同来自:

这跟/_cluster/allocation/explain没有关系吧。单节点的,如果复制数不是0,肯定是黄的。
 
看样子,似乎是data里的文件被物理删除了???
 
另外,现在都7。5了,干嘛还用5.4.1那么老的版本?

locatelli

赞同来自:

磁盘空间太少了,看log已经达到high diskwatermark,无法继续写入。你要对自己的数据量有个估计

ww107

赞同来自:

多谢大家,我扩一下磁盘空间再观察一阵

要回复问题请先登录注册