cockroachdb/cockroach

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over

Open

#65,659 opened on 2021年5月25日

GitHub で見る
 (6 comments) (0 reactions) (0 assignees)Go (32,150 stars) (4,124 forks)batch import
A-kv-observabilityC-enhancementO-sreT-kvgood first issue

説明

Is your feature request related to a problem? Please describe. Here are some example logs:

I210511 17:06:12.977887 1 cli/start.go:821 ⋮ initiating graceful shutdown of server
...
I210511 17:07:06.103135 18255896 server/drain.go:174 ⋮ [server drain process] drain remaining: 3
I210511 17:07:06.103174 18255896 server/drain.go:176 ⋮ [server drain process] drain details: range lease iterations: 3
I210511 17:07:07.360491 171 server/status/runtime.go:525 ⋮ [n1] runtime stats: 728 MiB RSS, 472 goroutines, 93 MiB/202 MiB/177 MiB GO alloc/idle/total, 350 MiB/452 MiB CGO alloc/total, 0.1 CGO/sec, 2.1/1.0 %(u/s)time, 0.0 %gc (0x), 358 KiB/1.0 MiB (r/w)net
I210511 17:07:07.978903 18255897 cli/start.go:831 ⋮ 26 running tasks
E210511 17:07:12.121121 18258961 server/server.go:1901 ⋮ [n1,client=‹10.0.2.129:53070›] serving SQL client conn: server is not accepting clients
I210511 17:07:12.978900 18255897 cli/start.go:831 ⋮ 26 running tasks

Above logs show 1m passing without graceful drain finishing (without all range lease transfers moving over), leading to some impact during update.

More detailed logs might help with debugging. It's not clear what is going wrong from the above logs.

Describe the solution you'd like

I expect it would help debugging if we logged more verbose info once we got down to N=10 or so range leases still held on the node. We could log:

  1. Range IDs, min & max keys, etc. of all ranges still leased by node.
  2. Range lease transfer attempt info. Log when a transfer attempt is made, log when error is returned if any, & log showing how long blocked on transfer attempt. (Tracing would help here too.)

Describe alternatives you've considered We could do nothing, relying on operators using vmodule to get the additional verbosity when needed.

Additional context On CC, graceful drain doesn't always finish before SIGKILL time. We want to fix this.

Jira issue: CRDB-7706

Epic CRDB-54646

コントリビューターガイド