Elasticsearch生产集群健康状态为yellow的排查和解决过程

Elasticsearch

前言：之前对两台 es 集群的服务器进行 cpu 缩容，完事之后忘记启动。今天查询发现集群状态为 yellow 了。

当集群恢复后，Elasticsearch 会自我修复，这个过程会主要消耗 IO 和网络。若处于重建过程中，则状态为 yellow 是暂时的，我们只需要等待一段时间再判断是否异常。

问题分析

我们可以用 kibana 的开发者工具进行查询，有其他的 es 管理工具也可以的。

1. 查看集群状态

GET /_cluster/health?pretty

{
  "cluster_name" : "elastic-apm",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 114,
  "active_shards" : 206,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 20,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 90.35087719298247
}

我们主要关心的是initializing_shards 和unassigned_shards 这两个参数。

initializing_shards 表示多少个分片处于初始化操作，unassigned_shards 表示没有正常分配的副本数量

2.查看索引状态

GET _cat/indices

yellow open apm-7.16.2-error-000002          k4R-iBq-SMGQEfWKJmuBsg 1 1     29566    0  63.3mb  63.3mb
yellow open apm-7.16.2-error-000003          6NsYFOVHTHyb4j6P1RkztQ 1 1     31811    0  66.4mb  66.4mb
yellow open apm-7.16.2-error-000004          WjOYXanpS6Wmv3yngnp8FQ 1 1     25679    0  52.9mb  52.9mb
.......

3. 分析未分配分片的原因

GET /_cluster/allocation/explain

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : "apm-7.16.2-span-000006",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2025-08-14T11:42:26.309Z",
    "details" : "node_left [sr-o2sCnSIuTp-mSEKPbTw]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "throttled",
  "allocate_explanation" : "allocation temporarily throttled",
  "node_allocation_decisions" : [
    {
      "node_id" : "IOKXMDeKRGqFBLFDIFHT0w",
      "node_name" : "node-1",
      "transport_address" : "xxxx",
      "node_attributes" : {
        "ml.machine_memory" : "67386658816",
        "ml.max_open_jobs" : "512",
        "xpack.installed" : "true",
        "ml.max_jvm_size" : "8589934592",
        "transform.node" : "true"
      },
      "node_decision" : "throttled",
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [F5E_rkzmTvG48DqWTibAvA] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_id" : "sr-o2sCnSIuTp-mSEKPbTw",
      "node_name" : "node-3",
      "transport_address" : "xxxx",
      "node_attributes" : {
        "ml.machine_memory" : "67386658816",
        "ml.max_open_jobs" : "512",
        "xpack.installed" : "true",
        "ml.max_jvm_size" : "8589934592",
        "transform.node" : "true"
      },
      "node_decision" : "throttled",
      "store" : {
        "matching_size_in_bytes" : 0
      },
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_id" : "F5E_rkzmTvG48DqWTibAvA",
      "node_name" : "node-2",
      "transport_address" : "xxxx",
      "node_attributes" : {
        "ml.machine_memory" : "67385413632",
        "ml.max_open_jobs" : "512",
        "xpack.installed" : "true",
        "ml.max_jvm_size" : "8589934592",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "matching_size_in_bytes" : 13937383997
      },
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[apm-7.16.2-span-000006][0], node[F5E_rkzmTvG48DqWTibAvA], [P], s[STARTED], a[id=CRWms_WnQh6nq5EdptXrxw]]"
        },
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [F5E_rkzmTvG48DqWTibAvA] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    }
  ]
}

我们可以看到三个节点都在恢复数据，throttling 表示流量控制策略，用于限制节点同时进行的 “出站分片恢复” 数量，避免资源耗尽，这里显示限制是 2。

当当前正在进行的入站恢复操作达到这个数量时，新的恢复任务会排队等待，直到已有任务完成后才会继续。
可以通过调整上述两个集群配置参数来修改这个限制（前者专门控制入站恢复，后者是更通用的恢复并发控制参数）。

这种限制是为了避免过多的恢复操作占用节点的网络带宽、磁盘 I/O 或内存资源，从而影响集群的整体稳定性。如果需要加快分片恢复速度（例如在节点重启或扩容后），可以适当调大这个参数值。

分析并解决问题

1. 适合临时调整流量控制策略，不需要重启集群，立即生效：

# 设置入站分片恢复的并发数（仅控制入站）
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 4  # 例如改为4
  }
}

# 或者设置更通用的分片恢复并发数（控制所有类型的恢复）
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 4  # 例如改为4
  }
}

2. 由于笔者生产环境的 es 集群主要是存储日志，这部分数据并没有特别重要，而且比较大，恢复比较慢，优先可以考虑把这部分数据给删了，加速恢复。

 # 删除索引
  DELETE /旧索引名

3.过了一会再次查询集群状态，发现已经是 green 了

GET /_cluster/health?pretty

{
  "cluster_name" : "elastic-apm",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 114,
  "active_shards" : 228,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

如果觉得文章对你有用，请随意赞赏

Elasticsearch生产集群健康状态为yellow的排查和解决过程

https://www.zhangyan1997.xyz/archives/wei-ming-ming-wen-zhang

作者

张颜

发布于

2025-08-19

更新于

2025-08-19

许可协议

CC BY 4.0

Elasticsearch生产集群健康状态为yellow的排查和解决过程

问题分析

分析并解决问题

作者

发布于

更新于

许可协议

评论