[DocDB] Add the lag column to list_all_masters yb-admin output · yugabyte/yugabyte-db#28675

(7 comments) (0 reactions) (1 assignee)C (1,003 forks)batch import

area/docdbgood first issuekind/bugpriority/medium

Repository metrics

Stars: (8,229 stars)
PR merge metrics: (Avg merge 17d 21h) (92 merged PRs in 30d)

Description

Jira Link: DB-18374

Description

Steps to reproduce:

Start group of 3 masters:

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node1/data \
    --rpc_bind_addresses=127.0.0.1:7100

sudo ifconfig lo0 alias 127.0.0.2

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node2/data \
    --rpc_bind_addresses=127.0.0.2:7100

sudo ifconfig lo0 alias 127.0.0.3

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100

Check they are healthy:

% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters                                       
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	FOLLOWER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
240ce9373a8a42d18b9efa7e44021969 	127.0.0.3:7100       	ALIVE    	LEADER 	N/A

Stop node3 and clear its' data:

rm -fr $HOME/yugabyte/node3/data/yb-data/*

Start it again:

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100

Check list of masters:

% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	LEADER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
6e9269eaa24740eaa5bc7bccda343917 	127.0.0.3:7100       	ALIVE    	FOLLOWER 	N/A

node3 looks like a healthy FOLLOWER

But if you try to promote it to LEADER:

% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 master_leader_stepdown 6e9269eaa24740eaa5bc7bccda343917
E0923 21:02:23.128075 47841792 yb-admin_client.cc:729] LeaderStepDown for af08844be93d4cdf9e0b94858fe33675received error code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 851 errors: "\000" }
Error running master_leader_stepdown: Illegal state (yb/consensus/raft_consensus.cc:851): Suggested peer is not caught up yet

It turns out it's not healthy actually. It remains in this state indefinitely - i.e. it doesn't catch up.

This is very misleading and can cause serious troubles if you continue working on cluster in this state. For example if you change disk of another yb-master, then it will lead to cluster meta becoming unavailable (due to yb-master raft group losing quorum I suppose)

Expected behavior: Such yb-master node is shown as non-healthy in the masters list

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

Contributor guide

Research direction: Examine the list all masters output format in yb admin, locate the source code that populates the table, and add a new column for the replication lag of each master. The lag can be obtained from the Raft consensus state of each master. Investigate how to retrieve this information from the master service.
Tech stack: cpp
Domain: backenddatabase
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Mostly clear
Prerequisites: GitC++Raft consensus understanding
Newbie friendliness: 50

Repository metrics

Description

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

Contributor guide

Get fresh easy issues in your inbox.