Kafka engine — zombie consumer-group sessions after ClickHouse restart block partition assignment
#104,305 建立於 2026年5月7日
描述
Describe the situation
After ClickHouse restarts with Kafka engine tables, the Kafka consumer group is left in a broken state: old consumer sessions are never formally closed, turning them into zombie members that block partition assignment for the newly started consumers.
This was found via the test, which restarts ClickHouse multiple times while a Kafka-to-MergeTree materialized-view pipeline is running and expects all 240 000 messages to be present in the target table within 120 s. The test fails with count: 0 and unique: 0 on every retry — even after 7 restart cycles (≈ 145 s total) — because no messages are ever consumed after any restart.
This issue:
- Causes the Kafka consumer to return
NO_ASSIGNMENTafter restart, resulting in zero rows consumed - Accumulates one new zombie session per restart cycle, making recovery progressively slower
- Affects any scenario where ClickHouse is restarted with live Kafka-engine tables (e.g., rolling upgrades, crash recovery)
How to reproduce the behavior
Environment
- Version: >=26.4
- Kafka:
confluentinc/cp-kafka:5.2.0(Apache Kafka 2.2.x)
Steps
- Start ClickHouse and produce 240 000 messages to a Kafka topic with 12 partitions:
DROP TABLE IF EXISTS source_table SYNC;
CREATE TABLE source_table ENGINE = Log AS SELECT toUInt32(number) AS id FROM numbers(240000);
clickhouse client --query='SELECT * FROM source_table FORMAT JSONEachRow' \
| kafka-console-producer --broker-list kafka1:9092 --topic dummytopic
- Create the Kafka pipeline:
CREATE TABLE dummy_queue (id UInt32)
ENGINE = Kafka('kafka1:9092,kafka2:9092,kafka3:9092', 'dummytopic', 'dummytopic_consumer_group2', 'JSONEachRow');
CREATE TABLE dummy (host String, id UInt32) ENGINE = MergeTree ORDER BY id;
CREATE MATERIALIZED VIEW dummy_mv TO dummy
AS SELECT hostName() AS host, id FROM dummy_queue;
- Kill ClickHouse immediately after table creation and restart it:
kill -TERM $(cat /tmp/clickhouse-server.pid)
# wait for process to die
clickhouse server --config-file=/etc/clickhouse-server/config.xml --pidfile=/tmp/clickhouse-server.pid --daemon
- Wait 15 s, then check:
SELECT count(), uniqExact(id) FROM dummy FORMAT TabSeparated;
- Repeat steps 3–4 several times.
Expected behavior
After restarting ClickHouse, the Kafka consumer should rejoin the group and resume consuming messages from the committed (or earliest) offset. Within 15–30 s of the restart, SELECT count() FROM dummy should return progress towards 240 000.
Actual behavior
SELECT count(), uniqExact(id) FROM dummy returns 0\t0 after every restart. The consumer group cannot be deleted at the end of the test because it still shows active members.
Root cause analysis
Why this showed up in 2026 (root PR)
The regression is triggered by ClickHouse PR #100388 — Fix DROP TABLE hang on Kafka tables after consumer heartbeat error — merged 2026-03-24 (merge commit d3fa644b68e on upstream master). It closes #100511.
That PR bumps contrib/cppkafka so that cppkafka::Consumer::~Consumer() finally respects RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE. The PR description states explicitly:
ClickHouse already sets this flag, but the destructor was ignoring it and calling
close()anyway.
So before March 2026, ClickHouse set NO_CONSUMER_CLOSE, but cppkafka still called close() → rd_kafka_consumer_close() ran → LeaveGroup was sent and the broker dropped the member promptly.
After the cppkafka fix, the flag is honored → consumer close is skipped on normal destructor teardown → no LeaveGroup → old members become zombies until session.timeout.ms, which matches the sudden failure of restart-heavy tests in 2026.
Related submodule bumps on the same theme: commits 9ee87d33151 (Update cppkafka to include fix for Consumer close deadlock), 1f30ba6a1b8 (Bump cppkafka (includes cppkafka/pull/5)).
The failure is caused by (A) that 2026 cppkafka behavior change plus (B) older ClickHouse choices that already avoided unsubscribe() in consumerGracefulStop() — together they prevent a timely LeaveGroup when the process restarts:
1. RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE flag (now effective end-to-end)
In KafkaConsumer::createConsumer() (flag set since system.kafka_consumers work, commit 78ccf909258, July 2025):
consumer->set_destroy_flags(RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE);
This flag asks librdkafka to skip rd_kafka_consumer_close() on destroy. Until PR #100388, cppkafka ignored it and still closed the consumer (sending LeaveGroup). After PR #100388, the skip is real: no LeaveGroup on destroy, so Kafka's coordinator keeps the consumer registered until session.timeout.ms (librdkafka default: 45 000 ms).
2. consumerGracefulStop() avoids unsubscribe()
In StorageKafkaUtils::consumerGracefulStop() (introduced in commit 360f92f74a1, March 6 2025, to avoid a deadlock caused by rebalance callbacks racing with toppar-queue teardown):
// To mitigate this, we now:
// (1) Avoid calling unsubscribe (letting rebalance occur naturally via consumer group timeout).
Instead of calling consumer->unsubscribe() (which would trigger an immediate LeaveGroup), the function just pauses partitions, disconnects queues, and drains events.
Combined effect
After ClickHouse shuts down its Kafka consumer:
- The broker does not receive a
LeaveGrouprequest. - The old consumer session remains in the group for up to
session.timeout.ms(45 s by default with librdkafka, or 10 s with some broker configs). - When ClickHouse restarts and the new consumer subscribes, a rebalance is triggered — but the rebalance cannot complete until the zombie session times out.
- The new consumer polls for assignment for up to
MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS = 15 000 ms(KafkaConsumer.cpp:42). If no assignment arrives within 15 s, it returnsNO_ASSIGNMENTand reschedules with a 500 ms delay. - Each restart adds one more zombie to the group, progressively extending the rebalance delay. After 7 restarts the group accumulates 7 zombie sessions that all need to expire before any new consumer gets partitions.
Additional context
Related commits / PRs
- PR #100388 (2026-03-24) — cppkafka respects
NO_CONSUMER_CLOSE; this is the change that makes the regression visible in 2026 9ee87d33151,1f30ba6a1b8(2026-03-24) — cppkafka submodule bumps tied to the same fix415418b49d0(2026-04-02) — partial revert of the StorageKafka part of the heartbeat hang fix (cppkafka behavior may still include the flag respect)360f92f74a1(2025-03-06) — "let's try to avoid unsubscribe" —consumerGracefulStop()avoidsunsubscribe()78ccf909258(2025-07-02) — setsRD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE(was ineffective for LeaveGroup until cppkafka fix)
Potential fixes
- Set a shorter
session.timeout.msin the librdkafka consumer config (e.g.,10000 ms) so zombie sessions expire within theMAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MSwindow. This is a config-level workaround. - Increase
MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MSto exceedsession.timeout.ms, so the consumer waits long enough for rebalance to complete. The downside is that every poll cycle takes longer. - Find a way to send
LeaveGroupsafely without triggering the toppar-queue deadlock that motivatedconsumerGracefulStop(). The original deadlock was between the rebalance callback and toppar-queue teardown — a careful ordering of unsubscribe + drain might avoid it. - Test-level fix: add a sleep before the first restart to allow initial consumption, and increase per-restart wait to 30+ s. This does not fix the underlying engine issue.
Does it reproduce on the most recent release?
Yes