不为失败找理由,要为成功找方法。

索引某个shard无法恢复的问题

Elasticsearch | 作者 es_newbee | 发布于2018年04月18日 | 阅读数:24377

es版本5.2.2_cluster/allocation/explain显示为
{
"node_id": "VCMPiqWZSYW4hnNDj_NExg",
"node_name": "es-1",
"transport_address": "127.0.0.1:9300",
"node_attributes": {
"tag": "warm"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "zZHXkwouS_SPJUmLzg3nWQ",
"store_exception": {
"type": "shard_lock_obtain_failed_exception",
"reason": "[test-test][3]: obtaining shard lock timed out after 5000ms",
"index_uuid": "HI8Z5vAdTqmM8rfw_JT0Lw",
"shard": "3",
"index": "test-test"
}
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-04-18T07:09:02.434Z], failed_attempts[10], delayed=false, details[failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[test-test][3]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
}
]
}

尝试过调用:/_cluster/reroute?retry_failed=true,但是结果还是这样
POST /_cluster/reroute?pretty
{
"commands" : [ {
"allocate_stale_primary" :
{
"index" : "test-test", "shard" : 3,
"node" : "es-2",
"accept_data_loss" : true
}
}
]
}
这样也是提示失败
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2018-04-18T07:09:02.434Z",
"failed_attempts": 10,
"delayed": false,
"details": "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[test-test][3]: obtaining shard lock timed out after 5000ms]; ",
"allocation_status": "deciders_no"
}
只能够allocate_empty_primary,但这样会导致数据完全丢失,搜了下,也没有找到好方法
已邀请:

kennywu76 - Wood

赞同来自: laoyang360 abia cccthought 小风 SpadeKing Atom dragon434更多 »

这种情况一般出现在有结点短暂离开集群,然后马上重新加入,并且有线程正在对某个shard做bulk或者scroll等长时间的写入操作。等结点重新加入集群的时候,由于shard lock没有释放,master无法allocate这个shard。  通常/_cluster/reroute?retry_failed=true可以解决问题,如果按照你说的依然无法解决,可能还有其他原因导致锁住该shard的线程长时间操作该shard无法释放锁(长时间GC?)。 
 
如果retry_failed无法解决问题,可以尝试一下allocate_stale_primary,前提是需要知道这个shard的primary在哪个结点上。实在解决不了,又不想丢数据,还可以重启一下该结点,内存锁应该可以释放。

yayg2008

赞同来自:

你是否手工动过ES的磁盘文件,或者强制退出过ES进程?
现在的错误是无法获取到shard lock。
源码:
 /**
* Tries to lock the given shards ID. A shard lock is required to perform any kind of
* write operation on a shards data directory like deleting files, creating a new index writer
* or recover from a different shard instance into it. If the shard lock can not be acquired
* a {@link ShardLockObtainFailedException} is thrown
* @param shardId the shard ID to lock
* @param lockTimeoutMS the lock timeout in milliseconds
* @return the shard lock. Call {@link ShardLock#close()} to release the lock
*/
public ShardLock shardLock(final ShardId shardId, long lockTimeoutMS) throws ShardLockObtainFailedException {
logger.trace("acquiring node shardlock on [{}], timeout [{}]", shardId, lockTimeoutMS);
final InternalShardLock shardLock;
final boolean acquired;
synchronized (shardLocks) {
if (shardLocks.containsKey(shardId)) {
shardLock = shardLocks.get(shardId);
shardLock.incWaitCount();
acquired = false;
} else {
shardLock = new InternalShardLock(shardId);
shardLocks.put(shardId, shardLock);
acquired = true;
}
}
if (acquired == false) {
boolean success = false;
try {
shardLock.acquire(lockTimeoutMS);
success = true;
} finally {
if (success == false) {
shardLock.decWaitCount();
}
}
}
logger.trace("successfully acquired shardlock for [{}]", shardId);
return new ShardLock(shardId) { // new instance prevents double closing
@Override
protected void closeInternal() {
shardLock.release();
logger.trace("released shard lock for [{}]", shardId);
}
};
}

JackGe

赞同来自:

我也遇到过无法获取文件锁的错误LockObtainFailedException[Can't lock shard [xxxx][5], timed out after 5000ms]。
当时我的操作过程是先停止数据写入es任务,关闭这个索引(POST xxxx/_close)一段时间后,再打开这个索引(POST xxxx/_open)并观察这个索引的状态(GET _cat/shards/xxxx),然后这个索引就恢复了,最后启用数据写入任务。

guopeng7216 - 90后运维

赞同来自:

[2018-04-20T17:23:11,302][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [ES-node1-Prod-alidc-0gow] [user][0]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [user][0]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:668) ~[elasticsearch-6.1.0.jar:6.1.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:587) ~[elasticsearch-6.1.0.jar:6.1.0]
at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:430) [elasticsearch-6.1.0.jar:6.1.0]
兄弟是否找到了有效解决的办法,看说停止es写入,不太显示啊,业务还得照常跑的,也试了上面的方法:
curl -XPOST '172.19.6.127:9200/_cluster/reroute?retry_failed=true&pretty'
但是没效果,

shwtz - 学物理想做演员的IT男

赞同来自:

遇到同样的问题,reroute之后貌似可以分配了

要回复问题请先登录注册