[Stability][v2.0-m1-rc1] Release-blocking issues found in Ultra/user workflow testing · bytedance/deer-flow#3107

(3 comments) (1 reaction) (0 assignees)Python (9,005 forks)batch import

help wanted

Repository metrics

Stars: (67,767 stars)
PR merge metrics: (Avg merge 2d 5h) (229 merged PRs in 30d)

Description

Summary

During local Docker testing of the v2.0-m1-rc1 release candidate, several stability issues were found in realistic user workflows, especially Ultra mode, subagent execution, artifact generation, export, and frontend state rendering.

This issue intentionally groups the findings together because the failures appear connected in practice:

stale runtime config can keep model/runtime components on old settings;
old or inconsistent model output limits can trigger truncated artifact generation;
failed large artifact writes can feed large payloads back into the model context;
subagent result propagation failures cause the lead agent to repeat work;
high-volume retries increase checkpoint/persistence pressure;
frontend/export paths then expose confusing or unsafe user-visible behavior.

The examples below use code paths, representative logs, and screenshots.

Evidence note:

Log snippets are labeled with their source, for example gateway log, frontend browser log, checkpoint/state inspection, or database inspection.
Screenshots are embedded below as GitHub issue attachments.
Local thread/run identifiers are intentionally omitted; the issue focuses on reproducible failure classes and code/log evidence.

Environment

Release candidate: v2.0-m1-rc1
Deployment: local Docker stack
Model under test: deepseek-v4-pro
Test style: browser-based user workflows plus backend log/checkpoint/DB inspection
Main stress workflow: Ultra mode, high reasoning effort, multi-step web research, subagents, and HTML artifact generation

BUG-001: Gateway uses stale `AppConfig` after `config.yaml` changes

Symptom

Changing config.yaml while the gateway process is running does not reliably affect subsequent runs. In particular, model settings such as max_tokens can remain stuck on the value captured at gateway startup until the gateway process is restarted.

This is more serious than a single model-parameter refresh bug. The gateway can enter a split-brain state where:

global get_app_config() may reload the changed file;
request.app.state.config still points to the startup snapshot;
already-initialized runtime components still use the old config;
some fallback paths may see the new config while the main run path still uses the old config.

Code evidence

Startup loads config once:

# backend/app/gateway/app.py
app.state.config = get_app_config()

Per-request dependency returns the startup snapshot:

# backend/app/gateway/deps.py
def get_config(request: Request) -> AppConfig:
    return getattr(request.app.state, "config", None)

The run context passes that snapshot forward:

# backend/app/gateway/deps.py
return RunContext(..., app_config=config)

The worker injects the same snapshot into the runtime and agent factory:

# backend/packages/harness/deerflow/runtime/runs/worker.py
runtime_ctx = _build_runtime_context(..., ctx.app_config)
agent = agent_factory(config=runnable_config, app_config=ctx.app_config)

The lead agent then skips get_app_config() because runtime_app_config is already present:

# backend/packages/harness/deerflow/agents/lead_agent/agent.py
return _make_lead_agent(config, app_config=runtime_app_config or get_app_config())

get_app_config() itself does have mtime-based reload logic:

# backend/packages/harness/deerflow/config/app_config.py
should_reload = _app_config is None or _app_config_path != resolved_path or _app_config_mtime != current_mtime

The issue is that the gateway run path can bypass that reload by passing the old AppConfig object explicitly.

Runtime component impact

langgraph_runtime(app) initializes these components from app.state.config at startup:

stream bridge
persistence engine
checkpointer
store
run event store
run manager

Those components are not rebuilt when config.yaml changes. This means config changes to database, checkpointer, run_events, or related runtime settings cannot be expected to take effect safely without restart.

Observed behavior

After changing max_tokens in config.yaml, earlier runs still produced repeated completions capped at 8192 output tokens:

Source: gateway log, token usage middleware.

LLM token usage: input=32433 output=8192 total=40625 ... finish_reason=length
LLM token usage: input=40653 output=8192 total=48845 ... finish_reason=length
LLM token usage: input=57478 output=8192 total=65670 ... finish_reason=length

After restarting the gateway process, payload debug logs showed the updated model setting being sent:

Source: gateway log, patched DeepSeek request payload debug.

[deepseek-payload-debug] model=deepseek-v4-pro self.max_tokens=384000 payload.max_tokens=384000

The post-restart run then produced a write_file call above the old 8192 output ceiling:

Source: gateway log, token usage middleware.

LLM token usage: input=29460 output=10579 total=40039

Impact

Config changes appear to apply in the file but not in actual user runs.
Model/runtime behavior can differ before and after gateway restart.
Runtime persistence components can remain initialized from stale config.
This stale-config path likely contributed to downstream artifact retries and persistence pressure.

Expected behavior

Either:

config.yaml changes should be applied consistently across the request/run/runtime path; or
config hot reload should be explicitly unsupported for these fields, and users should receive a clear restart-required boundary.

BUG-002: Subagent task completes internally but parent task result fails

Symptom

In Ultra mode, subagents are launched and complete internally, but the parent task tool reports failure instead of returning the subagent result to the lead agent.

Representative logs:

Source: gateway log, subagent executor + task tool + tool error middleware.

Subagent general-purpose completed async execution
Subagent general-purpose final messages count: ...
Task ... status: completed
Tool execution failed (async): name=task
TypeError: 'AsyncCallbackManager' object is not iterable

Stack root:

deerflow/tools/builtins/task_tool.py
  _report_subagent_usage(runtime, result)
  _find_usage_recorder(runtime)
  for cb in callbacks:
TypeError: 'AsyncCallbackManager' object is not iterable

Observed behavior

Across multiple Ultra runs:

subagents were started;
subagents logged internal completion;
parent-visible task tool results contained only wrapper errors;
lead agent explicitly fell back to direct work;
run accounting showed subagent_tokens=0.

Representative parent-visible result:

Source: checkpoint/state inspection of parent tool result.

Error: Tool 'task' failed with TypeError: 'AsyncCallbackManager' object is not iterable. Continue with available context, or choose an alternative tool.

Impact

Ultra mode loses the value of subagent work.
The lead agent repeats overlapping fetch/research work.
User-perceived latency and token usage increase.
Frontend task state can become misleading because the task completed internally but failed at the parent wrapper layer.

Expected behavior

Completed subagent results should be delivered back to the lead agent.
Usage-reporting failures should not turn a successful subagent result into a failed task tool result.
Subagent token accounting should be reflected correctly.

BUG-003: Large `write_file` failures amplify token usage

Symptom

When generating a large HTML artifact, write_file can fail because the model output is truncated or the tool arguments become incomplete. The failure path can echo large attempted file contents back into the conversation state, causing subsequent model calls to carry much larger context.

Evidence

Representative token sequence during artifact generation:

Source: gateway log, token usage middleware.

LLM token usage: input=29324 output=8192 total=37516
LLM token usage: input=46564 output=8192 total=54756
LLM token usage: input=63274 output=3903 total=67177
LLM token usage: input=71117 output=2682 total=73799

Representative failed tool-result sizes found in checkpoint state:

Source: checkpoint/state inspection of write_file tool messages.

write_file error payload: ~23.7K chars
write_file error payload: ~24.1K chars
write_file error payload: ~10.6K chars

Another run showed a missing required argument after an 8192-token truncated output:

Source: checkpoint/state inspection of AI message usage + following write_file tool result.

write_file output=8192 finish_reason=length
write_file missing required path
tool error echoed ~23K chars of attempted HTML content

Mechanism

The token growth pattern is:

model tries to generate a large HTML report as one write_file call;
output hits a limit or tool args become incomplete;
write_file fails;
the tool error includes a large portion of the attempted content;
that large error becomes part of conversation state;
the next LLM call has a much larger input context;
the agent retries with another writing strategy.

Impact

Token usage can grow from a normal large task into a million-token class run.
Runtime cost becomes hard for users to predict.
Persistence/checkpoint writes also increase, which may contribute to DB pressure.
The final artifact may eventually succeed, but after expensive retries.

Expected behavior

Tool errors should not echo large content arguments back into model context.
Large artifact generation should use a bounded, reliable writing strategy.
If an artifact cannot be written, the error returned to the model should be concise and structured.

BUG-004: Persistence can remain stale running or hit SQLite lock under stress

Symptom

Under long Ultra runs with repeated large writes, persistence can fail to record final run state correctly. The UI/DB may continue to show a run as running even after the backend logs indicate the run has failed or completed internally.

Representative logs:

Source: gateway log, run worker + SQLAlchemy/SQLite stack traces.

Run ... failed: database disk image is malformed
sqlite3.DatabaseError: database disk image is malformed
sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

The attempted status update also failed:

Source: gateway log, SQLAlchemy failed update parameters.

[parameters: ('error', 'database disk image is malformed', ..., '<run_id>')]

In the inspected local DB, a later integrity check returned ok, so the durable symptom was stale run state and failed persistence, not necessarily permanent DB corruption.

Relationship to other bugs

This should be treated as a persistence robustness issue, but it may be triggered by upstream runtime problems:

stale config can keep model output limits on old values;
old limits can cause truncated large artifact writes;
failed writes can repeatedly echo large payloads into state;
more retries create more checkpoint/write/update pressure;
SQLite then sees more concurrent writes and recovery pressure.

Impact

A task can appear stuck even after backend execution has ended.
Token and message counters may not be persisted.
Users cannot trust the final run status in the UI.

Expected behavior

Run finalization should be durable and recoverable.
SQLite lock/retry behavior should not leave runs permanently stale.
If persistence fails, the UI should surface a clear error state rather than indefinite running.

BUG-005: Active run token and raw stream observability is insufficient

Symptom

During a long run, checkpoint message metadata already contains substantial token usage, while the run row/API still shows zero totals.

Observed active-run state:

Source: database inspection while the run was still active.

runs.status=running
runs.total_tokens=0
runs.llm_call_count=0
runs.message_count=0

At the same time, checkpoint state already summed to hundreds of thousands of tokens.

Raw stream was also not durably available:

Source: gateway log, run worker stream-mode setup.

'events' stream_mode not supported in gateway (requires astream_events + checkpoint callbacks). Skipping.

Actual stream modes:

['messages', 'custom', 'updates', 'values']

run_events was memory-backed, and no durable run event rows were available after the fact.

Impact

Operators cannot monitor runaway token cost from normal run records while the run is active.
After-the-fact debugging depends on checkpoints/logs rather than a durable raw event stream.
It is difficult to tell whether a long-running task is healthy, stuck, or burning budget.

Expected behavior

Active runs should expose current token/LLM-call/message counters.
Raw stream or equivalent trace should be optionally persisted for debugging long tasks.
Cost visibility should not require manually inspecting checkpoint internals.

BUG-006: Chat export includes hidden context, memory, reasoning, and trace

Symptom

Normal chat export can include content that is not visible in the chat transcript.

Observed exported content included:

<system-reminder>
<memory>
<current_date>
Thinking/reasoning details
tool call names / trace-like information

Important distinction: this was not observed as raw system prompt leakage. The concrete issue is that hidden dynamic context, memory, reasoning content, and debug trace can be included in a normal user export.

Code evidence

The chat UI has hidden-message filtering, but export does not appear to apply the same boundary:

frontend/src/core/threads/export.ts

Markdown export includes reasoning blocks:

<details>
<summary>Thinking</summary>
...
</details>

JSON export maps raw messages more directly and can include tool-related fields.

Related state evidence

Checkpoint/state stores provider-returned reasoning content in:

Source: checkpoint/state inspection of AI messages.

AIMessage.additional_kwargs.reasoning_content

So even if the normal UI hides it, export paths must explicitly filter it.

Impact

A user export is not a clean transcript.
Memory injected into model context can be exported as if it were part of the conversation.
Reasoning/tool traces can expose internal behavior that users did not ask to export.
Product privacy/debug boundaries are ambiguous.

Expected behavior

Default export should include only the user-visible transcript:

visible user messages;
visible assistant final answers;
visible artifact/file references if already shown to the user.

Default export should exclude:

hidden messages marked hide_from_ui;
dynamic context reminders;
memory injection;
thinking/reasoning content;
tool calls and tool results.

If raw trace export is needed, it should be a separate explicit debug/admin export surface.

BUG-007: Subagent completed task still rendered as running

Symptom

After a long Ultra task completed, the frontend still displayed a task card as 子任务运行中.

Screenshot:

Backend state at the same time indicated the run was terminal:

Source: database inspection after the run had ended.

runs.status=success
threads_meta.status=idle

The relevant parent-visible tool result was:

Source: checkpoint/state inspection of parent task tool result.

Error: Tool 'task' failed with TypeError: 'AsyncCallbackManager' object is not iterable. Continue with available context, or choose an alternative tool.

Code evidence

The frontend currently maps only these task result prefixes to terminal states:

Task Succeeded. Result:
Task failed.
Task timed out

The actual result starts with:

Error: Tool 'task' failed ...

So it falls through and remains rendered as in_progress.

Impact

A completed conversation looks like it is still doing work.
Users may wait unnecessarily.
It can make task retries or duplicate subagent work harder to reason about from the UI.

Expected behavior

Any terminal task-tool error should render as failed, not in-progress. Frontend state reconstruction should not depend on only a few exact English text prefixes.

BUG-008: HTML artifact preview flickers and temporarily shows mojibake during writes

Symptom

While an HTML artifact is still being generated, the right-side preview can flicker frequently and temporarily render garbled text. After the artifact is fully written and reopened, the final file can render normally.

Screenshot showing temporary mojibake:

This issue explicitly includes the mojibake shown in 2.png. It is grouped under artifact preview because the final artifact was not necessarily corrupted; the visible problem was the in-progress preview rendering partial/incomplete HTML.

Code evidence

The frontend auto-opens the artifact panel for the latest in-progress write_file step:

frontend/src/components/workspace/messages/message-group.tsx

isLoading && isLast && autoOpen && autoSelect && path && !result

The selected item can be a write-file: pseudo-artifact rather than a completed output file.

The HTML preview renders current content through a blob URL:

frontend/src/components/workspace/artifacts/artifact-file-detail.tsx

new Blob([content ?? ""], { type: "text/html" })
URL.createObjectURL(blob)
<iframe src={htmlPreviewUrl} />

Likely mechanism

The preview points at partial or intermediate write_file content.
Each streamed content update recreates the blob URL and reloads the iframe.
If the current partial HTML lacks a complete <meta charset="UTF-8">, the iframe may guess encoding incorrectly and render mojibake.
Once the final complete HTML is loaded, the artifact can render normally.

Impact

Users see a broken-looking artifact while generation is still in progress.
Long-running report generation feels unstable even when the final output is eventually valid.
The preview can mislead users into thinking the model generated corrupted text.

Expected behavior

Do not auto-render incomplete HTML as a live iframe preview, or debounce/stabilize it.
For in-progress HTML writes, prefer code/loading view until the write completes.
Switch to rendered preview only after a completed file result or present_files.

BUG-009: Chat history timestamps are timezone-shifted

Symptom

Recently completed chats can appear in the history/search list as roughly 8 hours ago when tested in Asia/Shanghai.

Representative API shape:

Source: threads/history API response inspected in the browser/runtime.

{
  "created_at": "2026-05-20T06:10:22.970977",
  "updated_at": "2026-05-20T06:12:31.333753"
}

These timestamps have no timezone suffix such as Z or +00:00.

Mechanism

Browser JavaScript parses timezone-less ISO strings as local time:

new Date("2026-05-20T06:12:31.333753")

In Asia/Shanghai, this is interpreted as local 06:12, not UTC 06:12. If the backend intended UTC, the displayed relative time is shifted by about 8 hours.

Code evidence

frontend/src/app/workspace/chats/page.tsx
frontend/src/core/utils/datetime.ts

The frontend passes the raw timestamp string into date formatting without normalizing timezone-less backend timestamps.

Impact

Recent threads look stale.
History/search ordering and user trust in persistence/status are affected.

Expected behavior

Backend should return timezone-aware ISO timestamps, preferably UTC with Z, for example:

2026-05-20T06:12:31.333753Z

Alternatively, the frontend should normalize DeerFlow API timestamps without timezone as UTC before formatting.

BUG-010: Workspace negative performance timestamp runtime error

Symptom

Opening the workspace can trigger a full-screen Next.js runtime error overlay:

Runtime TypeError
Failed to execute 'measure' on 'Performance': 'WorkspacePage' cannot have a negative time stamp.

Representative frontend log:

Source: frontend browser console log captured by the local frontend service.

[browser] Uncaught TypeError: Failed to execute 'measure' on 'Performance': 'WorkspacePage' cannot have a negative time stamp.

Impact

A tester/user can see a framework error instead of the workspace.
It disrupts the core entry flow even if persisted thread data is not corrupted.

Open question

This needs confirmation outside the local dev/Turbopack-style environment. If it only happens in development, it may be lower priority; if it can happen in the Docker/release path, it should be fixed before stable tagging.

WATCH-001: Plan/search loop may continue after enough information

Symptom

A normal research prompt can continue issuing search/fetch tool calls after the model has already reasoned that it has enough information to summarize.

Representative prompt:

总结本周体育新闻

Observed failure shape:

many web_search and web_fetch calls;
reasoning indicated enough information had been collected;
the model still issued more search/fetch calls;
no final answer was produced before manual interruption;
token usage reached roughly the 200K class.

A later run of the same prompt succeeded with much lower token usage, so this appears intermittent.

Additional evidence:

A small local HTML comparison report was generated for this item: plan-search-loop-token-report.zip.
If maintainers want to inspect the turn-by-turn evidence, that HTML report includes the failing/successful run comparison, the point where the model appeared to have enough information, and the subsequent extra search calls.
The zip contains the HTML report. It can be downloaded and opened locally to inspect the turn-by-turn comparison.

Impact

Common research tasks can burn tokens without a final answer.
Users have no clear signal that the agent is looping.

Expected behavior

Once the agent determines it has enough information, it should produce the answer instead of continuing search.
Tool-call loops should have a convergence or budget guard.
If the agent cannot finish, it should return a clear partial/failure response.

Release-blocking priority

Suggested priority before a stable community tag:

P0/P1: BUG-001, BUG-002, BUG-003, BUG-006, BUG-007
P1: BUG-004, BUG-005, BUG-008
P2 unless widely reproducible: BUG-009, BUG-010
Watch / targeted regression test: WATCH-001

Local evidence

gateway logs with AsyncCallbackManager task failures
gateway logs with repeated 8192 output-token completions
payload debug logs after gateway restart showing payload.max_tokens=384000
checkpoint summaries showing large write_file error payloads
frontend logs with negative performance timestamp runtime error
plan/search loop comparison report: plan-search-loop-token-report.zip

Contributor guide

Research direction: Trace the config flow from startup to runtime across the gateway, worker, and agent components. Identify where stale AppConfig is used and determine if a consistent hot reload or clear restart required boundary can be implemented.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Clear
Prerequisites: PythonGitDocker
Newbie friendliness: 25

Repository metrics

Description

Summary

Environment

BUG-001: Gateway uses stale AppConfig after config.yaml changes

Symptom

Code evidence

Runtime component impact

Observed behavior

Impact

Expected behavior

BUG-002: Subagent task completes internally but parent task result fails

Symptom

Observed behavior

Impact

Expected behavior

BUG-003: Large write_file failures amplify token usage

Symptom

Evidence

Mechanism

Impact

Expected behavior

BUG-004: Persistence can remain stale running or hit SQLite lock under stress

Symptom

Relationship to other bugs

Impact

Expected behavior

BUG-005: Active run token and raw stream observability is insufficient

Symptom

Impact

Expected behavior

BUG-006: Chat export includes hidden context, memory, reasoning, and trace

Symptom

Code evidence

Related state evidence

Impact

Expected behavior

BUG-007: Subagent completed task still rendered as running

Symptom

Code evidence

Impact

Expected behavior

BUG-008: HTML artifact preview flickers and temporarily shows mojibake during writes

Symptom

Code evidence

Likely mechanism

Impact

Expected behavior

BUG-009: Chat history timestamps are timezone-shifted

Symptom

Mechanism

Code evidence

Impact

Expected behavior

BUG-010: Workspace negative performance timestamp runtime error

Symptom

Impact

Open question

WATCH-001: Plan/search loop may continue after enough information

Symptom

Impact

Expected behavior

Release-blocking priority

Local evidence

Contributor guide

Get fresh easy issues in your inbox.

BUG-001: Gateway uses stale `AppConfig` after `config.yaml` changes

BUG-003: Large `write_file` failures amplify token usage