ClickHouse/ClickHouse

Kafka engine — zombie consumer-group sessions after ClickHouse restart block partition assignment

Open

#104,305 建立於 2026年5月7日

在 GitHub 查看
 (0 留言) (0 反應) (0 負責人)C++ (47,419 star) (8,400 fork)batch import
help wantednot plannedpotential bug

描述

Describe the situation

After ClickHouse restarts with Kafka engine tables, the Kafka consumer group is left in a broken state: old consumer sessions are never formally closed, turning them into zombie members that block partition assignment for the newly started consumers.

This was found via the test, which restarts ClickHouse multiple times while a Kafka-to-MergeTree materialized-view pipeline is running and expects all 240 000 messages to be present in the target table within 120 s. The test fails with count: 0 and unique: 0 on every retry — even after 7 restart cycles (≈ 145 s total) — because no messages are ever consumed after any restart.

This issue:

  • Causes the Kafka consumer to return NO_ASSIGNMENT after restart, resulting in zero rows consumed
  • Accumulates one new zombie session per restart cycle, making recovery progressively slower
  • Affects any scenario where ClickHouse is restarted with live Kafka-engine tables (e.g., rolling upgrades, crash recovery)

How to reproduce the behavior

Environment

  • Version: >=26.4
  • Kafka: confluentinc/cp-kafka:5.2.0 (Apache Kafka 2.2.x)

Steps

  1. Start ClickHouse and produce 240 000 messages to a Kafka topic with 12 partitions:
DROP TABLE IF EXISTS source_table SYNC;
CREATE TABLE source_table ENGINE = Log AS SELECT toUInt32(number) AS id FROM numbers(240000);
clickhouse client --query='SELECT * FROM source_table FORMAT JSONEachRow' \
  | kafka-console-producer --broker-list kafka1:9092 --topic dummytopic
  1. Create the Kafka pipeline:
CREATE TABLE dummy_queue (id UInt32)
ENGINE = Kafka('kafka1:9092,kafka2:9092,kafka3:9092', 'dummytopic', 'dummytopic_consumer_group2', 'JSONEachRow');

CREATE TABLE dummy (host String, id UInt32) ENGINE = MergeTree ORDER BY id;

CREATE MATERIALIZED VIEW dummy_mv TO dummy
AS SELECT hostName() AS host, id FROM dummy_queue;
  1. Kill ClickHouse immediately after table creation and restart it:
kill -TERM $(cat /tmp/clickhouse-server.pid)
# wait for process to die
clickhouse server --config-file=/etc/clickhouse-server/config.xml --pidfile=/tmp/clickhouse-server.pid --daemon
  1. Wait 15 s, then check:
SELECT count(), uniqExact(id) FROM dummy FORMAT TabSeparated;
  1. Repeat steps 3–4 several times.

Expected behavior

After restarting ClickHouse, the Kafka consumer should rejoin the group and resume consuming messages from the committed (or earliest) offset. Within 15–30 s of the restart, SELECT count() FROM dummy should return progress towards 240 000.


Actual behavior

SELECT count(), uniqExact(id) FROM dummy returns 0\t0 after every restart. The consumer group cannot be deleted at the end of the test because it still shows active members.


Root cause analysis

Why this showed up in 2026 (root PR)

The regression is triggered by ClickHouse PR #100388Fix DROP TABLE hang on Kafka tables after consumer heartbeat error — merged 2026-03-24 (merge commit d3fa644b68e on upstream master). It closes #100511.

That PR bumps contrib/cppkafka so that cppkafka::Consumer::~Consumer() finally respects RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE. The PR description states explicitly:

ClickHouse already sets this flag, but the destructor was ignoring it and calling close() anyway.

So before March 2026, ClickHouse set NO_CONSUMER_CLOSE, but cppkafka still called close()rd_kafka_consumer_close() ran → LeaveGroup was sent and the broker dropped the member promptly.

After the cppkafka fix, the flag is honored → consumer close is skipped on normal destructor teardown → no LeaveGroup → old members become zombies until session.timeout.ms, which matches the sudden failure of restart-heavy tests in 2026.

Related submodule bumps on the same theme: commits 9ee87d33151 (Update cppkafka to include fix for Consumer close deadlock), 1f30ba6a1b8 (Bump cppkafka (includes cppkafka/pull/5)).


The failure is caused by (A) that 2026 cppkafka behavior change plus (B) older ClickHouse choices that already avoided unsubscribe() in consumerGracefulStop() — together they prevent a timely LeaveGroup when the process restarts:

1. RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE flag (now effective end-to-end)

In KafkaConsumer::createConsumer() (flag set since system.kafka_consumers work, commit 78ccf909258, July 2025):

consumer->set_destroy_flags(RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE);

This flag asks librdkafka to skip rd_kafka_consumer_close() on destroy. Until PR #100388, cppkafka ignored it and still closed the consumer (sending LeaveGroup). After PR #100388, the skip is real: no LeaveGroup on destroy, so Kafka's coordinator keeps the consumer registered until session.timeout.ms (librdkafka default: 45 000 ms).

2. consumerGracefulStop() avoids unsubscribe()

In StorageKafkaUtils::consumerGracefulStop() (introduced in commit 360f92f74a1, March 6 2025, to avoid a deadlock caused by rebalance callbacks racing with toppar-queue teardown):

// To mitigate this, we now:
//   (1) Avoid calling unsubscribe (letting rebalance occur naturally via consumer group timeout).

Instead of calling consumer->unsubscribe() (which would trigger an immediate LeaveGroup), the function just pauses partitions, disconnects queues, and drains events.

Combined effect

After ClickHouse shuts down its Kafka consumer:

  1. The broker does not receive a LeaveGroup request.
  2. The old consumer session remains in the group for up to session.timeout.ms (45 s by default with librdkafka, or 10 s with some broker configs).
  3. When ClickHouse restarts and the new consumer subscribes, a rebalance is triggered — but the rebalance cannot complete until the zombie session times out.
  4. The new consumer polls for assignment for up to MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS = 15 000 ms (KafkaConsumer.cpp:42). If no assignment arrives within 15 s, it returns NO_ASSIGNMENT and reschedules with a 500 ms delay.
  5. Each restart adds one more zombie to the group, progressively extending the rebalance delay. After 7 restarts the group accumulates 7 zombie sessions that all need to expire before any new consumer gets partitions.

Additional context

Related commits / PRs

  • PR #100388 (2026-03-24) — cppkafka respects NO_CONSUMER_CLOSE; this is the change that makes the regression visible in 2026
  • 9ee87d33151, 1f30ba6a1b8 (2026-03-24) — cppkafka submodule bumps tied to the same fix
  • 415418b49d0 (2026-04-02) — partial revert of the StorageKafka part of the heartbeat hang fix (cppkafka behavior may still include the flag respect)
  • 360f92f74a1 (2025-03-06) — "let's try to avoid unsubscribe" — consumerGracefulStop() avoids unsubscribe()
  • 78ccf909258 (2025-07-02) — sets RD_KAFKA_DESTROY_F_NO_CONSUMER_CLOSE (was ineffective for LeaveGroup until cppkafka fix)

Potential fixes

  1. Set a shorter session.timeout.ms in the librdkafka consumer config (e.g., 10000 ms) so zombie sessions expire within the MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS window. This is a config-level workaround.
  2. Increase MAX_TIME_TO_WAIT_FOR_ASSIGNMENT_MS to exceed session.timeout.ms, so the consumer waits long enough for rebalance to complete. The downside is that every poll cycle takes longer.
  3. Find a way to send LeaveGroup safely without triggering the toppar-queue deadlock that motivated consumerGracefulStop(). The original deadlock was between the rebalance callback and toppar-queue teardown — a careful ordering of unsubscribe + drain might avoid it.
  4. Test-level fix: add a sleep before the first restart to allow initial consumption, and increase per-restart wait to 30+ s. This does not fix the underlying engine issue.

Does it reproduce on the most recent release?

Yes

貢獻者指南