cube-js/cube

CubeJS Server crashes with "terminating connection due to conflict with recovery" (PG read replica)

Open

#3,904 建立於 2022年1月10日

在 GitHub 查看
 (5 留言) (2 反應) (0 負責人)Rust (19,563 star) (1,965 fork)batch import
help wanted

描述

Describe the bug When using a PostgreSQL read replica as datasource, the Cube Server crashes when a running query is interrupted because there are pending WAL entries that conflict with the query for more than max_standby_archive_delay (or max_standby_streaming_delay) (both are 30s by default).

As per the documentation: "Note that max_standby_archive_delay is not the same as the maximum length of time a query can run before cancellation (...) if one query has resulted in significant delay, subsequent conflicting queries will have much less grace time until the standby server has caught up again."

To Reproduce

  1. (optional) To make reproduction easier reduce max_standby_archive_delay and max_standby_streaming_delay to a few milliseconds instead of the default value of 30s.
  2. Trigger a long running query against a PostgreSQL read replica.
  3. CubeJS server crashes.
error: terminating connection due to conflict with recovery
  at Parser.parseErrorMessage (/app/index.js:402934:98)
  at Parser.handlePacket (/app/index.js:402773:29)
  at Parser.parse (/app/index.js:402686:38)
  (...)

Expected behavior When a query is aborted by a PostgreSQL read replica due "to conflict with recovery", the query should be retried once or twice with backoff.

Our current CubeJS server configuration (below) already includes a query execution timeout smaller than the allowed replication delay, however if one or more previous queries already took a significant portion of the time, a relatively fast query can still get aborted.

  orchestratorOptions: {
    queryCacheOptions: {
      (...)
      queueOptions: {
        executionTimeout: 25,
        orphanedTimeout: 20,
        heartBeatInterval: 5,
      },
    },

Version: @cubejs-backend/server-core: "0.29.17" @cubejs-backend/postgres-driver: "0.29.17"

Additional context Also tried to increase the max_standby_archive_delay and max_standby_streaming_delay to 60s (double) but the problem still occurs frequently (twice a day or more).

貢獻者指南