如题 昨天某台数据节点,磁盘发生数据坏道, 现在shard UNASSIGNED , reroute failed , 报错RecoveryFailedException。 不知道是不是磁盘的原因产生锁?请问如何恢复数据呢,求大神指教
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry,
[unassigned_info[[reason=ALLOCATION_FAILED], at[2021-02-05T20:21:45.275Z], failed_attempts[5], delayed=false,
details[
failed recovery, failure RecoveryFailedException[[index-12][1]: Recovery failed from {10.1.1.51_001}{JANcofrTS8qEInyXdQ73Kg}{eT3M0aTJRviQpQJLf9AJlg}{10.1.1.51}{10.1.1.51:9301} into {10.1.1.36_001}{Dt9ccKqGR96HXE9M4oqPRw}{qH5IjT9FTE2ummdlg0XR-g}{10.1.1.36}{10.1.1.36:9301}];
nested: RemoteTransportException[[10.1.1.51_001][10.1.1.51:9301][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[Phase[1] phase1 failed];
nested: RecoverFilesRecoveryException[Failed to transfer [133] files with total size of [14.1gb]];
nested: RemoteTransportException[[10.1.1.36_001][10.1.1.36:9301][internal:index/shard/recovery/file_chunk]];
nested: IOException[Input/output error]; ], allocation_status[no_attempt]]]
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry,
[unassigned_info[[reason=ALLOCATION_FAILED], at[2021-02-05T20:21:45.275Z], failed_attempts[5], delayed=false,
details[
failed recovery, failure RecoveryFailedException[[index-12][1]: Recovery failed from {10.1.1.51_001}{JANcofrTS8qEInyXdQ73Kg}{eT3M0aTJRviQpQJLf9AJlg}{10.1.1.51}{10.1.1.51:9301} into {10.1.1.36_001}{Dt9ccKqGR96HXE9M4oqPRw}{qH5IjT9FTE2ummdlg0XR-g}{10.1.1.36}{10.1.1.36:9301}];
nested: RemoteTransportException[[10.1.1.51_001][10.1.1.51:9301][internal:index/shard/recovery/start_recovery]];
nested: RecoveryEngineException[Phase[1] phase1 failed];
nested: RecoverFilesRecoveryException[Failed to transfer [133] files with total size of [14.1gb]];
nested: RemoteTransportException[[10.1.1.36_001][10.1.1.36:9301][internal:index/shard/recovery/file_chunk]];
nested: IOException[Input/output error]; ], allocation_status[no_attempt]]]
2 个回复
Ombres
赞同来自:
/_cluster/reroute?retry_failed=true,使用reroute重新分配副本,尝试之前reroute失败的任务
可以先观察集群,大概有几种情况
1. shard副本丢失,/_cluster/reroute?retry_failed=true会自动重新分配副本,看报错情况再决定如何操作
2. 如果shard主副本都丢失了,那么数据肯定回不来,需要重新索引数据,使用allocate_empty_primary分配空的主本,副本会自动创建
rane - 上升期资深工程师
赞同来自: