rebalance never starts · citusdata/citus#7103

(12 留言) (0 反應) (0 負責人)C (9,388 star) (625 fork)batch import

good first issuewarm-up

描述

With citus 11.3 I've added a node and triggered a rebalance. The rebalance has been scheduled correctly but never starts running despite having 1 runnable task (and 10 blocked ones).

I'm using the docker image citusdata/citus:11.3 in all nodes. The connection between the nodes works (primary is at 10.132.0.2):

SELECT * FROM citus_get_active_worker_nodes();
 node_name  | node_port
------------+-----------
 10.132.0.4 |      5432
 10.132.0.5 |      5432
(2 rows)

Command history:

staging=# SELECT * from citus_add_node('10.132.0.5', 5432);
 citus_add_node
----------------
             10
(1 row)

Time: 623.522 ms
staging=# SELECT citus_rebalance_start();
NOTICE:  Scheduled 10 moves as job 1
DETAIL:  Rebalance scheduled as background job
HINT:  To monitor progress, run: SELECT * FROM citus_rebalance_status();
 citus_rebalance_start
-----------------------
                     1
(1 row)

Time: 26.101 ms
staging=# SELECT * FROM citus_rebalance_status();
 job_id |   state   | job_type  |           description           | started_at | finished_at |                              details
--------+-----------+-----------+---------------------------------+------------+-------------+--------------------------------------------------------------------
      1 | scheduled | rebalance | Rebalance all colocation groups |            |             | {"tasks": [], "task_state_counts": {"blocked": 10, "runnable": 1}}
(1 row)

Time: 3.200 ms
staging=# SELECT pg_terminate_backend(pg_stat_activity.pid)
FROM pg_stat_activity
WHERE pg_stat_activity.datname = 'staging'
  AND pid <> pg_backend_pid();
 pg_terminate_backend
----------------------
 t
 t
 t
 t
 t
 t
 t
 t
(8 rows)
staging=# SELECT get_rebalance_table_shards_plan();
               get_rebalance_table_shards_plan
-------------------------------------------------------------
 (sensor_datapoint,102183,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102182,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102185,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102184,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102187,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102186,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102189,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102188,0,10.132.0.2,5432,10.132.0.5,5432)
 (sensor_datapoint,102191,0,10.132.0.4,5432,10.132.0.5,5432)
 (sensor_datapoint,102190,0,10.132.0.2,5432,10.132.0.5,5432)
(10 rows)

Time: 4.475 ms
staging=# SELECT * from pg_dist_node;
 nodeid | groupid |  nodename  | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
      1 |       0 | 10.132.0.2 |     5432 | default  | t           | t        | primary  | default     | t              | t
      6 |       5 | 10.132.0.4 |     5432 | default  | t           | t        | primary  | default     | t              | t
     10 |       9 | 10.132.0.5 |     5432 | default  | t           | t        | primary  | default     | t              | t
staging=# ALTER SYSTEM SET citus.max_background_task_executors_per_node = 2;
ALTER SYSTEM
Time: 9.613 ms
staging=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

Time: 1.585 ms
staging=# SELECT * FROM citus_rebalance_status() \gx
-[ RECORD 1 ]-------------------------------------------------------------------
job_id      | 1
state       | scheduled
job_type    | rebalance
description | Rebalance all colocation groups
started_at  |
finished_at |
details     | {"tasks": [], "task_state_counts": {"blocked": 10, "runnable": 1}}

Time: 3.033 ms

I've been waiting for a long time and nothing changes.

貢獻者指南

技術棧: cpostgresqlsql
領域: database
議題類型: bug
難度: 3
預計時間: 1-3 hours
活動狀態: active
清晰度: mostly clear
前置要求: PostgreSQLCitusDocker
新手友善度: 20
研究方向: Investigate the Citus rebalance background task scheduler. Check if the citus.max background task executors per node setting is being respected and if the background workers are actually started. Look at the source code for background task execution, particularly in src/backend/distributed/worker/ or similar. The issue reports that the rebalance remains in 'scheduled' state despite having runnable tasks. Possibly the background worker processes are not picking up the tasks due to a configuration or deadlock issue. Confirm by checking logs and the pg stat activity for any Citus background worker processes.