Kafka engine — zombie consumer-group sessions after ClickHouse restart block partition assignment · ClickHouse/ClickHouse#104305

(0 留言) (0 反應) (0 負責人)C++ (47,419 star) (8,400 fork)batch import

help wantednot plannedpotential bug

描述

Describe the situation

After ClickHouse restarts with Kafka engine tables, the Kafka consumer group is left in a broken state: old consumer sessions are never formally closed, turning them into zombie members that block partition assignment for the newly started consumers.

This was found via the test, which restarts ClickHouse multiple times while a Kafka-to-MergeTree materialized-view pipeline is running and expects all 240 000 messages to be present in the target table within 120 s. The test fails with count: 0 and unique: 0 on every retry — even after 7 restart cycles (≈ 145 s total) — because no messages are ever consumed after any restart.

This issue:

Causes the Kafka consumer to return NO_ASSIGNMENT after restart, resulting in zero rows consumed
Accumulates one new zombie session per restart cycle, making recovery progressively slower
Affects any scenario where ClickHouse is restarted with live Kafka-engine tables (e.g., rolling upgrades, crash recovery)

How to reproduce the behavior

Environment

Version: >=26.4
Kafka: confluentinc/cp-kafka:5.2.0 (Apache Kafka 2.2.x)

Steps

Start ClickHouse and produce 240 000 messages to a Kafka topic with 12 partitions:

DROP TABLE IF EXISTS source_table SYNC;
CREATE TABLE source_table ENGINE = Log AS SELECT toUInt32(number) AS id FROM numbers(240000);

clickhouse client --query='SELECT * FROM source_table FORMAT JSONEachRow' \
  | kafka-console-producer --broker-list kafka1:9092 --topic dummytopic

Create the Kafka pipeline:

CREATE TABLE dummy_queue (id UInt32)
ENGINE = Kafka('kafka1:9092,kafka2:9092,kafka3:9092', 'dummytopic', 'dummytopic_consumer_group2', 'JSONEachRow');

CREATE TABLE dummy (host String, id UInt32) ENGINE = MergeTree ORDER BY id;

CREATE MATERIALIZED VIEW dummy_mv TO dummy
AS SELECT hostName() AS host, id FROM dummy_queue;

Kill ClickHouse immediately after table creation and restart it:

kill -TERM $(cat /tmp/clickhouse-server.pid)
# wait for process to die
clickhouse server --config-file=/etc/clickhouse-server/config.xml --pidfile=/tmp/clickhouse-server.pid --daemon

Wait 15 s, then check:

SELECT count(), uniqExact(id) FROM dummy FORMAT TabSeparated;

Repeat steps 3–4 several times.

Expected behavior

After restarting ClickHouse, the Kafka consumer should rejoin the group and resume consuming messages from the committed (or earliest) offset. Within 15–30 s of the restart, SELECT count() FROM dummy should return progress towards 240 000.

Actual behavior

SELECT count(), uniqExact(id) FROM dummy returns 0\t0 after every restart. The consumer group cannot be deleted at the end of the test because it still shows active members.

Root cause analysis

Why this showed up in 2026 (root PR)

The regression is triggered by ClickHouse PR #100388 — Fix DROP TABLE hang on Kafka tables after consumer heartbeat error — merged 2026-03-24 (merge commit d3fa644b68e on upstream master). It closes #100511.

That PR bumps contrib/cppkafka so that cppkafka::Consumer::~Consumer() finally respects RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE. The PR description states explicitly:

ClickHouse already sets this flag, but the destructor was ignoring it and calling close() anyway.

So before March 2026, ClickHouse set NO_CONSUMER_CLOSE, but cppkafka still called close() → rd_kafka_consumer_close() ran → LeaveGroup was sent and the broker dropped the member promptly.

After the cppkafka fix, the flag is honored → consumer close is skipped on normal destructor teardown → no LeaveGroup → old members become zombies until session.timeout.ms, which matches the sudden failure of restart-heavy tests in 2026.

Related submodule bumps on the same theme: commits 9ee87d33151 (Update cppkafka to include fix for Consumer close deadlock), 1f30ba6a1b8 (Bump cppkafka (includes cppkafka/pull/5)).

The failure is caused by (A) that 2026 cppkafka behavior change plus (B) older ClickHouse choices that already avoided unsubscribe() in consumerGracefulStop() — together they prevent a timely LeaveGroup when the process restarts:

1. `RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE` flag (now effective end-to-end)

In KafkaConsumer::createConsumer() (flag set since system.kafka_consumers work, commit 78ccf909258, July 2025):

consumer->set_destroy_flags(RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE);

This flag asks librdkafka to skip rd_kafka_consumer_close() on destroy. Until PR #100388, cppkafka ignored it and still closed the consumer (sending LeaveGroup). After PR #100388, the skip is real: no LeaveGroup on destroy, so Kafka's coordinator keeps the consumer registered until session.timeout.ms (librdkafka default: 45 000 ms).

2. `consumerGracefulStop()` avoids `unsubscribe()`

In StorageKafkaUtils::consumerGracefulStop() (introduced in commit 360f92f74a1, March 6 2025, to avoid a deadlock caused by rebalance callbacks racing with toppar-queue teardown):

// To mitigate this, we now:
//   (1) Avoid calling unsubscribe (letting rebalance occur naturally via consumer group timeout).

Instead of calling consumer->unsubscribe() (which would trigger an immediate LeaveGroup), the function just pauses partitions, disconnects queues, and drains events.

Combined effect

After ClickHouse shuts down its Kafka consumer:

The broker does not receive a LeaveGroup request.
The old consumer session remains in the group for up to session.timeout.ms (45 s by default with librdkafka, or 10 s with some broker configs).
When ClickHouse restarts and the new consumer subscribes, a rebalance is triggered — but the rebalance cannot complete until the zombie session times out.
The new consumer polls for assignment for up to MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS = 15 000 ms (KafkaConsumer.cpp:42). If no assignment arrives within 15 s, it returns NO_ASSIGNMENT and reschedules with a 500 ms delay.
Each restart adds one more zombie to the group, progressively extending the rebalance delay. After 7 restarts the group accumulates 7 zombie sessions that all need to expire before any new consumer gets partitions.

Additional context

Related commits / PRs

PR #100388 (2026-03-24) — cppkafka respects NO_CONSUMER_CLOSE; this is the change that makes the regression visible in 2026
9ee87d33151, 1f30ba6a1b8 (2026-03-24) — cppkafka submodule bumps tied to the same fix
415418b49d0 (2026-04-02) — partial revert of the StorageKafka part of the heartbeat hang fix (cppkafka behavior may still include the flag respect)
360f92f74a1 (2025-03-06) — "let's try to avoid unsubscribe" — consumerGracefulStop() avoids unsubscribe()
78ccf909258 (2025-07-02) — sets RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE (was ineffective for LeaveGroup until cppkafka fix)

Potential fixes

Set a shorter session.timeout.ms in the librdkafka consumer config (e.g., 10000 ms) so zombie sessions expire within the MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS window. This is a config-level workaround.
Increase MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS to exceed session.timeout.ms, so the consumer waits long enough for rebalance to complete. The downside is that every poll cycle takes longer.
Find a way to send LeaveGroup safely without triggering the toppar-queue deadlock that motivated consumerGracefulStop(). The original deadlock was between the rebalance callback and toppar-queue teardown — a careful ordering of unsubscribe + drain might avoid it.
Test-level fix: add a sleep before the first restart to allow initial consumption, and increase per-restart wait to 30+ s. This does not fix the underlying engine issue.

Does it reproduce on the most recent release?

Yes

貢獻者指南

技術棧: cppkafka
領域: databasebackend
議題類型: bug
難度: 5
預計時間: over 1 week
活動狀態: fresh
清晰度: clear
前置要求: C++ proficiencyKafka consumer group conceptsClickHouse system architecturelibrdkafka basics
新手友善度: 20
研究方向: Investigate the interplay between consumerGracefulStop() in StorageKafkaUtils and the new cppkafka behavior that now respects RD KAFKA DESTROY F NO CONSUMER CLOSE. Look at the consumer destroy sequence in KafkaConsumer.cpp and consider implementing a forced LeaveGroup via consumer >unsubscribe() or setting a shorter session.timeout.ms in the librdkafka config. Also review the partial revert commit 415418b49d0 to understand if any safe path exists.

描述