yugabyte/yugabyte-db

[DocDB] Add the lag column to list_all_masters yb-admin output

Open

#28,675 opened on Sep 23, 2025

View on GitHub
 (7 comments) (0 reactions) (0 assignees)C (8,229 stars) (1,003 forks)batch import
area/docdbgood first issuekind/bugpriority/medium

Description

Jira Link: DB-18374

Description

Steps to reproduce:

  1. Start group of 3 masters:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node1/data \
    --rpc_bind_addresses=127.0.0.1:7100

sudo ifconfig lo0 alias 127.0.0.2

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node2/data \
    --rpc_bind_addresses=127.0.0.2:7100

sudo ifconfig lo0 alias 127.0.0.3

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check they are healthy:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters                                       
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	FOLLOWER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
240ce9373a8a42d18b9efa7e44021969 	127.0.0.3:7100       	ALIVE    	LEADER 	N/A
  1. Stop node3 and clear its' data:
rm -fr $HOME/yugabyte/node3/data/yb-data/*
  1. Start it again:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check list of masters:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	LEADER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
6e9269eaa24740eaa5bc7bccda343917 	127.0.0.3:7100       	ALIVE    	FOLLOWER 	N/A 

node3 looks like a healthy FOLLOWER

  1. But if you try to promote it to LEADER:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 master_leader_stepdown 6e9269eaa24740eaa5bc7bccda343917
E0923 21:02:23.128075 47841792 yb-admin_client.cc:729] LeaderStepDown for af08844be93d4cdf9e0b94858fe33675received error code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 851 errors: "\000" }
Error running master_leader_stepdown: Illegal state (yb/consensus/raft_consensus.cc:851): Suggested peer is not caught up yet

It turns out it's not healthy actually. It remains in this state indefinitely - i.e. it doesn't catch up.

This is very misleading and can cause serious troubles if you continue working on cluster in this state. For example if you change disk of another yb-master, then it will lead to cluster meta becoming unavailable (due to yb-master raft group losing quorum I suppose)

Expected behavior: Such yb-master node is shown as non-healthy in the masters list

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Contributor guide