索引某个shard无法恢复的问题

Elasticsearch | 作者 es_newbee | 发布于2018年04月18日 | 阅读数：22856

es版本5.2.2_cluster/allocation/explain显示为

{

      "node_id": "VCMPiqWZSYW4hnNDj_NExg",

      "node_name": "es-1",

      "transport_address": "127.0.0.1:9300",

      "node_attributes": {

        "tag": "warm"

      },

      "node_decision": "no",

      "store": {

        "in_sync": true,

        "allocation_id": "zZHXkwouS_SPJUmLzg3nWQ",

        "store_exception": {

          "type": "shard_lock_obtain_failed_exception",

          "reason": "[test-test][3]: obtaining shard lock timed out after 5000ms",

          "index_uuid": "HI8Z5vAdTqmM8rfw_JT0Lw",

          "shard": "3",

          "index": "test-test"

        }

      },

      "deciders": [

        {

          "decider": "max_retry",

          "decision": "NO",

          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-04-18T07:09:02.434Z], failed_attempts[10], delayed=false, details[failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[test-test][3]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"

        }

      ]

    }

尝试过调用：/_cluster/reroute?retry_failed=true，但是结果还是这样

POST /_cluster/reroute?pretty

{

    "commands" : [ {

        "allocate_stale_primary" :

            {

              "index" : "test-test", "shard" : 3,

              "node" : "es-2",

              "accept_data_loss" : true

            }

        }

    ]

}

这样也是提示失败

"unassigned_info": {

                  "reason": "ALLOCATION_FAILED",

                  "at": "2018-04-18T07:09:02.434Z",

                  "failed_attempts": 10,

                  "delayed": false,

                  "details": "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[test-test][3]: obtaining shard lock timed out after 5000ms]; ",

                  "allocation_status": "deciders_no"

                }

只能够allocate_empty_primary，但这样会导致数据完全丢失，搜了下，也没有找到好方法

5 个回复

kennywu76 - Wood

赞同来自: laoyang360 、abia 、cccthought 、小风、SpadeKing 、Atom 、dragon434 更多 »

这种情况一般出现在有结点短暂离开集群，然后马上重新加入，并且有线程正在对某个shard做bulk或者scroll等长时间的写入操作。等结点重新加入集群的时候，由于shard lock没有释放，master无法allocate这个shard。通常/_cluster/reroute?retry_failed=true可以解决问题，如果按照你说的依然无法解决，可能还有其他原因导致锁住该shard的线程长时间操作该shard无法释放锁（长时间GC?)。

如果retry_failed无法解决问题，可以尝试一下allocate_stale_primary，前提是需要知道这个shard的primary在哪个结点上。实在解决不了，又不想丢数据，还可以重启一下该结点，内存锁应该可以释放。

yayg2008

你是否手工动过ES的磁盘文件，或者强制退出过ES进程？
现在的错误是无法获取到shard lock。
源码：

 /**

     * Tries to lock the given shards ID. A shard lock is required to perform any kind of

     * write operation on a shards data directory like deleting files, creating a new index writer

     * or recover from a different shard instance into it. If the shard lock can not be acquired

     * a {@link ShardLockObtainFailedException} is thrown

     * @param shardId the shard ID to lock

     * @param lockTimeoutMS the lock timeout in milliseconds

     * @return the shard lock. Call {@link ShardLock#close()} to release the lock

     */

    public ShardLock shardLock(final ShardId shardId, long lockTimeoutMS) throws ShardLockObtainFailedException {

        logger.trace("acquiring node shardlock on [{}], timeout [{}]", shardId, lockTimeoutMS);

        final InternalShardLock shardLock;

        final boolean acquired;

        synchronized (shardLocks) {

            if (shardLocks.containsKey(shardId)) {

                shardLock = shardLocks.get(shardId);

                shardLock.incWaitCount();

                acquired = false;

            } else {

                shardLock = new InternalShardLock(shardId);

                shardLocks.put(shardId, shardLock);

                acquired = true;

            }

        }

        if (acquired == false) {

            boolean success = false;

            try {

                shardLock.acquire(lockTimeoutMS);

                success = true;

            } finally {

                if (success == false) {

                    shardLock.decWaitCount();

                }

            }

        }

        logger.trace("successfully acquired shardlock for [{}]", shardId);

        return new ShardLock(shardId) { // new instance prevents double closing

            @Override

            protected void closeInternal() {

                shardLock.release();

                logger.trace("released shard lock for [{}]", shardId);

            }

        };

    }

JackGe

我也遇到过无法获取文件锁的错误LockObtainFailedException[Can't lock shard [xxxx][5], timed out after 5000ms]。
当时我的操作过程是先停止数据写入es任务，关闭这个索引(POST xxxx/_close)一段时间后，再打开这个索引(POST xxxx/_open)并观察这个索引的状态(GET _cat/shards/xxxx)，然后这个索引就恢复了，最后启用数据写入任务。

guopeng7216 - 90后运维

[2018-04-20T17:23:11,302][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [ES-node1-Prod-alidc-0gow] [user][0]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [user][0]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:668) ~[elasticsearch-6.1.0.jar:6.1.0]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:587) ~[elasticsearch-6.1.0.jar:6.1.0]
at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:430) [elasticsearch-6.1.0.jar:6.1.0]
兄弟是否找到了有效解决的办法，看说停止es写入，不太显示啊，业务还得照常跑的，也试了上面的方法：
curl -XPOST '172.19.6.127:9200/_cluster/reroute?retry_failed=true&pretty'
但是没效果，

shwtz - 学物理想做演员的IT男

遇到同样的问题，reroute之后貌似可以分配了

要回复问题请先登录或注册

索引某个shard无法恢复的问题

5 个回复

发起人

相关问题

问题状态

索引某个shard无法恢复的问题

与内容相关的链接

5 个回复

发起人

相关问题

问题状态