etcd-io/etcd

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly.

Open

#18,879 opened on 2024年11月12日

GitHub で見る
 (18 comments) (0 reactions) (0 assignees)Go (51,701 stars) (10,352 forks)batch import
help wantedpriority/important-soontype/bug

説明

Bug report criteria

What happened?

We used the in-process etcdserver of v3client. Then we created a client, created a watch connection to the same resource every second, without freeing them, and ran it for more than 1 minute. When the client maintains a large number of watch connections, we kill the client process. After the client process is killed, when other clients attempt to establish watch connections for the same resource, the new watch connections cannot obtain new event changes.

What did you expect to happen?

After the client is killed, the new watch connection for the same resource can properly listen to event changes. And after analysis, the blocking problem exists. Although it is unreasonable for the client to establish a large number of watch connections with the same resource at the same time, can the etcd server do something to avoid the blocking?

How can we reproduce it (as minimally and precisely as possible)?

We created a large number of Watch connections to the same configmap resource in a loop through a process using code similar to the following: main.txt After running this program for 1 minute, kill the program. When you continue to run the kubectl get configmap -A -w command, after the configmap is modified, the configmap change cannot be watched.

Anything else we need to know?

After the client is killed, a large number of watch connections are disconnected. The code analysis shows that the Send() function of WatchCancelRequest in case ws := <-w.closingc of the (w *watchGrpcStream) run() method in etcd/client/v3/watch.go is blocked and unable to continue processing. It is suspected that a large number of WatchCancelRequests cause the channel in watchGrpcStream to be fully occupied. As a result, new WatchResponse cannot be pushed into sws.ctrlStream. The WatchResponse obtained from ctrlStream and new WatchResponse are blocked in case pbresp := <-w.respc and case ws := <-w.closingc in (w *watchGrpcStream) run().

Etcd version (please run commands below)

$ etcd --version
# 3.5.11
$ etcdctl version
# 3.5.11

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

コントリビューターガイド