沙师弟,师父的充电器掉了

单台机器故障,整个集群不可用。怪异异常。

Elasticsearch | 作者 famoss | 发布于2018年07月26日 | 阅读数:4493

我所有index的rep都是2.但是昨天单台机器故障了。整个集群都挂了。master一直报错。且不可恢复。
es版本 5.2.2


[2018-07-25T21:26:03,713][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][92] received shard failed for shard id [[flume-hour20-2018-07-25][92]], allocation id [YiY7_m-JQVypIXjoiPknOg], primary term [2], message [mark copy as stale]
[2018-07-25T21:26:03,714][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][65] received shard failed for shard id [[flume-hour20-2018-07-25][65]], allocation id [L_Q8UF-TQ5eM7PWMrezVJQ], primary term [2], message [mark copy as stale]
[2018-07-25T21:26:03,714][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][68] received shard failed for shard id [[flume-hour20-2018-07-25][68]], allocation id [TILQZxatSciBFuXtFMlT5w], primary term [1], message [mark copy as stale]
[2018-07-25T21:26:03,715][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][111] received shard failed for shard id [[flume-hour20-2018-07-25][111]], allocation id [t1v3JsoaRYO-R1PJNAfHNA], primary term [1], message [mark copy as stale]
[2018-07-25T21:26:03,715][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][131] received shard failed for shard id [[flume-hour20-2018-07-25][131]], allocation id [RaWA0kjjSe-feEoP6PY2mw], primary term [1], message [mark copy as stale]
[2018-07-25T21:26:03,716][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][19] received shard failed for shard id [[flume-hour20-2018-07-25][19]], allocation id [iu4707o9SkmxLVlYyD7UBA], primary term [1], message [mark copy as stale]
[2018-07-25T21:26:03,716][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][131] received shard failed for shard id [[flume-hour20-2018-07-25][131]], allocation id [RaWA0kjjSe-feEoP6PY2mw], primary term [1], message [mark copy as stale]
[2018-07-25T21:26:03,717][WARN ][o.e.c.a.s.ShardStateAction] [es1.1.master] [flume-hour20-2018-07-25][52] received shard failed for shard id [[flume-hour20-2018-07-25][52]], allocation id [HUCdOBhjTlOD73kpBDu6yQ], primary term [1], message [mark copy as stale]


 
EDIT:
今天好好看了下日志。 是因为挂掉的那个datanode节点进程没挂掉。 cpu load较高,但是偶尔还能ping的通,导致master没把节点t掉,所以数据写不进。 跟异常日志没有关系。 变成red,是忘记了一个老index,没有rep。出问题的时候太慌,只顾着看满屏刷不停的这个日志。
已邀请:

rochy - rochy_he

赞同来自:

能贴一下你配置文件么?
你的主节点配置了几个? 最小要求几个?
如果你的 minimum_master_nodes 设置的是2,目前主节点只有一个的话,那么整个集群是不可用的

zqc0512 - andy zhou

赞同来自:

看日志是在恢复啊。
应该可用的。

laoyang360 - 《一本书讲透Elasticsearch》作者,Elastic认证工程师 [死磕Elasitcsearch]知识星球地址:http://t.cn/RmwM3N9;微信公众号:铭毅天下; 博客:https://elastic.blog.csdn.net

赞同来自:

在1楼的思路上继续深入,最小主节点是2,而你宕机一个会导致集群重启失败。
最小节点数是防止脑裂设置的。3节点集群:建议都设置为即是:master又是data,最小主节点数设置为2,这样一个宕机,另外两个也满足集群条件。集群可以正常运行的。

要回复问题请先登录注册