yugabyte/yugabyte-db
在 GitHub 查看[DocDB] Add the lag column to list_all_masters yb-admin output
Open
#28,675 建立於 2025年9月23日
area/docdbgood first issuekind/bugpriority/medium
描述
Jira Link: DB-18374
Description
Steps to reproduce:
- Start group of 3 masters:
./bin/yb-master \
--master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
--fs_data_dirs=$HOME/yugabyte/node1/data \
--rpc_bind_addresses=127.0.0.1:7100
sudo ifconfig lo0 alias 127.0.0.2
./bin/yb-master \
--master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
--fs_data_dirs=$HOME/yugabyte/node2/data \
--rpc_bind_addresses=127.0.0.2:7100
sudo ifconfig lo0 alias 127.0.0.3
./bin/yb-master \
--master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
--fs_data_dirs=$HOME/yugabyte/node3/data \
--rpc_bind_addresses=127.0.0.3:7100
- Check they are healthy:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID RPC Host/Port State Role Broadcast Host/Port
af08844be93d4cdf9e0b94858fe33675 127.0.0.1:7100 ALIVE FOLLOWER N/A
8bff6598e2624fbdbd20000c5dde8f0f 127.0.0.2:7100 ALIVE FOLLOWER N/A
240ce9373a8a42d18b9efa7e44021969 127.0.0.3:7100 ALIVE LEADER N/A
- Stop
node3and clear its' data:
rm -fr $HOME/yugabyte/node3/data/yb-data/*
- Start it again:
./bin/yb-master \
--master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
--fs_data_dirs=$HOME/yugabyte/node3/data \
--rpc_bind_addresses=127.0.0.3:7100
- Check list of masters:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID RPC Host/Port State Role Broadcast Host/Port
af08844be93d4cdf9e0b94858fe33675 127.0.0.1:7100 ALIVE LEADER N/A
8bff6598e2624fbdbd20000c5dde8f0f 127.0.0.2:7100 ALIVE FOLLOWER N/A
6e9269eaa24740eaa5bc7bccda343917 127.0.0.3:7100 ALIVE FOLLOWER N/A
node3 looks like a healthy FOLLOWER
- But if you try to promote it to LEADER:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 master_leader_stepdown 6e9269eaa24740eaa5bc7bccda343917
E0923 21:02:23.128075 47841792 yb-admin_client.cc:729] LeaderStepDown for af08844be93d4cdf9e0b94858fe33675received error code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 851 errors: "\000" }
Error running master_leader_stepdown: Illegal state (yb/consensus/raft_consensus.cc:851): Suggested peer is not caught up yet
It turns out it's not healthy actually. It remains in this state indefinitely - i.e. it doesn't catch up.
This is very misleading and can cause serious troubles if you continue working on cluster in this state. For example if you change disk of another yb-master, then it will lead to cluster meta becoming unavailable (due to yb-master raft group losing quorum I suppose)
Expected behavior: Such yb-master node is shown as non-healthy in the masters list
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.