bytedance/deer-flow

[Stability][BUG-004] Persistence can remain stale running or hit SQLite lock under stress

Closed

#3115 opened on May 21, 2026

View on GitHub
 (6 comments) (0 reactions) (1 assignee)Python (67,767 stars) (9,005 forks)batch import
help wantedquestion

Description

Parent stability dashboard: #3107

This issue tracks BUG-004 from #3107.

Problem

Under long Ultra runs with repeated large writes, persistence can fail to record final run state correctly. The UI/DB may continue to show a run as running even after backend logs indicate execution has failed or completed internally.

Evidence

Source: gateway log, run worker + SQLAlchemy/SQLite stack traces.

Run ... failed: database disk image is malformed
sqlite3.DatabaseError: database disk image is malformed
sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

Source: gateway log, SQLAlchemy failed update parameters.

[parameters: ('error', 'database disk image is malformed', ..., '<run_id>')]

A later local DB integrity check returned ok, so the durable symptom was stale run state and failed persistence, not necessarily permanent DB corruption.

Relationship to other issues

This should be treated as a persistence robustness issue, but it may be triggered by upstream runtime pressure:

  • stale config can keep model output limits on old values;
  • old limits can cause truncated large artifact writes;
  • failed writes can repeatedly echo large payloads into state;
  • more retries create more checkpoint/write/update pressure;
  • SQLite then sees more concurrent writes and recovery pressure.

Impact

  • A task can appear stuck even after backend execution has ended.
  • Token and message counters may not be persisted.
  • Users cannot trust the final run status in the UI.

Expected behavior

  • Run finalization should be durable and recoverable.
  • SQLite lock/retry behavior should not leave runs permanently stale.
  • If persistence fails, the UI should surface a clear error state rather than indefinite running.

Contributor guide