手动分配 Shards失败

Elasticsearch | 作者 luohuanfeng | 发布于2018年10月08日 | 阅读数：5456

es 6.3
4台机器其中一台是主节点另外三台是data节点.
早上发现集群状态是yellow ,有一个索引副本分片未分配,这个索引是6个主分片6个副本分片.

然后我尝试手动分配,执行了下面语句

POST _cluster/reroute?retry_failed=true

{

  "commands": [

    {

      "allocate_replica": {

        "index": "api-dailyrolling-2018.09.28",

        "shard": 1,

        "node": "elk03"

      }

    }

  ]

}

结果报错如下:

"reason": "[allocate_replica] allocation of [api-dailyrolling-2018.09.28][1] on node {elk03}{TnpqkUYpQYeSawD9VkEogQ}{Q9_R1OpsQg-HojucR7tQOg}{172.16.10.223}{172.16.10.223:9300}{ml.machine_memory=135355260928, rack=a082, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} is not allowed, reason: [NO(shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-09-28T07:57:49.862Z], failed_attempts[5], delayed=false, details[failed shard on node [-J0CYxPkTLGt_nJouKEHrw]: failed recovery, failure RecoveryFailedException[[api-dailyrolling-2018.09.28][1]: Recovery failed from {elk04}{tqzo_eGlQCSlREAWVR46ow}{1jq1D20eSy66SeXfRJa54Q}{172.16.10.224}{172.16.10.224:9300}{ml.machine_memory=135355260928, rack=a082, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {elk05}{-J0CYxPkTLGt_nJouKEHrw}{e_yPSA8yQ3O9ONZJSxhhsA}{172.16.10.225}{172.16.10.225:9300}{ml.machine_memory=135355260928, rack=a082, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[elk04][172.16.10.224:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[elk05][172.16.10.225:9300][internal:index/shard/recovery/translog_ops]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [22411117510/20.8gb], which is larger than the limit of [22408154316/20.8gb]]; ], allocation_status[no_attempt]]])][YES(primary shard for this replica is already active)][YES(explicitly ignoring any disabling of allocation due to manual allocation commands via the reroute API)][YES(can allocate replica shard to a node with version [6.3.0] since this is equal-or-newer than the primary version [6.3.0])][YES(the shard is not being snapshotted)][YES(ignored as shard is not being recovered from a snapshot)][YES(node passes include/exclude/require filters)][YES(the shard does not exist on the same node)][YES(enough disk for shard on node, free: [2.8tb], shard size: [0b], free after allocating shard: [2.8tb])][YES(below shard recovery limit of outgoing: [0 < 2] incoming: [0 < 2])][YES(total shard limits are disabled: [index: -1, cluster: -1] <= 0)][YES(allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it)]"

在报错里看到一个"Data too large",不知道是不是这个问题,我尝试按照网上的方法清空cache ,但实际上并没有什么用..

3 个回复

zqc0512 - andy zhou

which is larger than the limit of [22408154316/20.8gb 单个shard里面条目数超过限制了样，我影响中好像不超过20多亿，你这220多亿了，他不让干了。你把副本+1 再－1吧，

yayg2008

你这个，是在副本恢复阶段，触发network.breaker.inflight requests.limit 这个熔断器，导致恢复阶段二 translog恢复阶段被中断；这个熔断器的默认值是内存的100%。
要避免这个问题的话，考虑减少translog的大小，及时flush。
参考 https://elasticsearch.cn/article/698

luohuanfeng

新创建的索引是正常的只有那天的索引有问题,..

要回复问题请先登录或注册

手动分配 Shards失败

3 个回复

发起人

活动推荐

相关问题

问题状态

手动分配 Shards失败

与内容相关的链接

3 个回复

发起人

活动推荐

相关问题

问题状态