[Stability][v2.0-m1-rc1] Release-blocking issues found in Ultra/user workflow testing
#3107 opened on May 20, 2026
Description
Summary
During local Docker testing of the v2.0-m1-rc1 release candidate, several stability issues were found in realistic user workflows, especially Ultra mode, subagent execution, artifact generation, export, and frontend state rendering.
This issue intentionally groups the findings together because the failures appear connected in practice:
- stale runtime config can keep model/runtime components on old settings;
- old or inconsistent model output limits can trigger truncated artifact generation;
- failed large artifact writes can feed large payloads back into the model context;
- subagent result propagation failures cause the lead agent to repeat work;
- high-volume retries increase checkpoint/persistence pressure;
- frontend/export paths then expose confusing or unsafe user-visible behavior.
The examples below use code paths, representative logs, and screenshots.
Evidence note:
- Log snippets are labeled with their source, for example gateway log, frontend browser log, checkpoint/state inspection, or database inspection.
- Screenshots are embedded below as GitHub issue attachments.
- Local thread/run identifiers are intentionally omitted; the issue focuses on reproducible failure classes and code/log evidence.
Environment
- Release candidate:
v2.0-m1-rc1 - Deployment: local Docker stack
- Model under test:
deepseek-v4-pro - Test style: browser-based user workflows plus backend log/checkpoint/DB inspection
- Main stress workflow: Ultra mode, high reasoning effort, multi-step web research, subagents, and HTML artifact generation
BUG-001: Gateway uses stale AppConfig after config.yaml changes
Symptom
Changing config.yaml while the gateway process is running does not reliably affect subsequent runs. In particular, model settings such as max_tokens can remain stuck on the value captured at gateway startup until the gateway process is restarted.
This is more serious than a single model-parameter refresh bug. The gateway can enter a split-brain state where:
- global
get_app_config()may reload the changed file; request.app.state.configstill points to the startup snapshot;- already-initialized runtime components still use the old config;
- some fallback paths may see the new config while the main run path still uses the old config.
Code evidence
Startup loads config once:
# backend/app/gateway/app.py
app.state.config = get_app_config()
Per-request dependency returns the startup snapshot:
# backend/app/gateway/deps.py
def get_config(request: Request) -> AppConfig:
return getattr(request.app.state, "config", None)
The run context passes that snapshot forward:
# backend/app/gateway/deps.py
return RunContext(..., app_config=config)
The worker injects the same snapshot into the runtime and agent factory:
# backend/packages/harness/deerflow/runtime/runs/worker.py
runtime_ctx = _build_runtime_context(..., ctx.app_config)
agent = agent_factory(config=runnable_config, app_config=ctx.app_config)
The lead agent then skips get_app_config() because runtime_app_config is already present:
# backend/packages/harness/deerflow/agents/lead_agent/agent.py
return _make_lead_agent(config, app_config=runtime_app_config or get_app_config())
get_app_config() itself does have mtime-based reload logic:
# backend/packages/harness/deerflow/config/app_config.py
should_reload = _app_config is None or _app_config_path != resolved_path or _app_config_mtime != current_mtime
The issue is that the gateway run path can bypass that reload by passing the old AppConfig object explicitly.
Runtime component impact
langgraph_runtime(app) initializes these components from app.state.config at startup:
- stream bridge
- persistence engine
- checkpointer
- store
- run event store
- run manager
Those components are not rebuilt when config.yaml changes. This means config changes to database, checkpointer, run_events, or related runtime settings cannot be expected to take effect safely without restart.
Observed behavior
After changing max_tokens in config.yaml, earlier runs still produced repeated completions capped at 8192 output tokens:
Source: gateway log, token usage middleware.
LLM token usage: input=32433 output=8192 total=40625 ... finish_reason=length
LLM token usage: input=40653 output=8192 total=48845 ... finish_reason=length
LLM token usage: input=57478 output=8192 total=65670 ... finish_reason=length
After restarting the gateway process, payload debug logs showed the updated model setting being sent:
Source: gateway log, patched DeepSeek request payload debug.
[deepseek-payload-debug] model=deepseek-v4-pro self.max_tokens=384000 payload.max_tokens=384000
The post-restart run then produced a write_file call above the old 8192 output ceiling:
Source: gateway log, token usage middleware.
LLM token usage: input=29460 output=10579 total=40039
Impact
- Config changes appear to apply in the file but not in actual user runs.
- Model/runtime behavior can differ before and after gateway restart.
- Runtime persistence components can remain initialized from stale config.
- This stale-config path likely contributed to downstream artifact retries and persistence pressure.
Expected behavior
Either:
config.yamlchanges should be applied consistently across the request/run/runtime path; or- config hot reload should be explicitly unsupported for these fields, and users should receive a clear restart-required boundary.
BUG-002: Subagent task completes internally but parent task result fails
Symptom
In Ultra mode, subagents are launched and complete internally, but the parent task tool reports failure instead of returning the subagent result to the lead agent.
Representative logs:
Source: gateway log, subagent executor + task tool + tool error middleware.
Subagent general-purpose completed async execution
Subagent general-purpose final messages count: ...
Task ... status: completed
Tool execution failed (async): name=task
TypeError: 'AsyncCallbackManager' object is not iterable
Stack root:
deerflow/tools/builtins/task_tool.py
_report_subagent_usage(runtime, result)
_find_usage_recorder(runtime)
for cb in callbacks:
TypeError: 'AsyncCallbackManager' object is not iterable
Observed behavior
Across multiple Ultra runs:
- subagents were started;
- subagents logged internal completion;
- parent-visible
tasktool results contained only wrapper errors; - lead agent explicitly fell back to direct work;
- run accounting showed
subagent_tokens=0.
Representative parent-visible result:
Source: checkpoint/state inspection of parent tool result.
Error: Tool 'task' failed with TypeError: 'AsyncCallbackManager' object is not iterable. Continue with available context, or choose an alternative tool.
Impact
- Ultra mode loses the value of subagent work.
- The lead agent repeats overlapping fetch/research work.
- User-perceived latency and token usage increase.
- Frontend task state can become misleading because the task completed internally but failed at the parent wrapper layer.
Expected behavior
- Completed subagent results should be delivered back to the lead agent.
- Usage-reporting failures should not turn a successful subagent result into a failed
tasktool result. - Subagent token accounting should be reflected correctly.
BUG-003: Large write_file failures amplify token usage
Symptom
When generating a large HTML artifact, write_file can fail because the model output is truncated or the tool arguments become incomplete. The failure path can echo large attempted file contents back into the conversation state, causing subsequent model calls to carry much larger context.
Evidence
Representative token sequence during artifact generation:
Source: gateway log, token usage middleware.
LLM token usage: input=29324 output=8192 total=37516
LLM token usage: input=46564 output=8192 total=54756
LLM token usage: input=63274 output=3903 total=67177
LLM token usage: input=71117 output=2682 total=73799
Representative failed tool-result sizes found in checkpoint state:
Source: checkpoint/state inspection of write_file tool messages.
write_file error payload: ~23.7K chars
write_file error payload: ~24.1K chars
write_file error payload: ~10.6K chars
Another run showed a missing required argument after an 8192-token truncated output:
Source: checkpoint/state inspection of AI message usage + following write_file tool result.
write_file output=8192 finish_reason=length
write_file missing required path
tool error echoed ~23K chars of attempted HTML content
Mechanism
The token growth pattern is:
- model tries to generate a large HTML report as one
write_filecall; - output hits a limit or tool args become incomplete;
write_filefails;- the tool error includes a large portion of the attempted
content; - that large error becomes part of conversation state;
- the next LLM call has a much larger input context;
- the agent retries with another writing strategy.
Impact
- Token usage can grow from a normal large task into a million-token class run.
- Runtime cost becomes hard for users to predict.
- Persistence/checkpoint writes also increase, which may contribute to DB pressure.
- The final artifact may eventually succeed, but after expensive retries.
Expected behavior
- Tool errors should not echo large
contentarguments back into model context. - Large artifact generation should use a bounded, reliable writing strategy.
- If an artifact cannot be written, the error returned to the model should be concise and structured.
BUG-004: Persistence can remain stale running or hit SQLite lock under stress
Symptom
Under long Ultra runs with repeated large writes, persistence can fail to record final run state correctly. The UI/DB may continue to show a run as running even after the backend logs indicate the run has failed or completed internally.
Representative logs:
Source: gateway log, run worker + SQLAlchemy/SQLite stack traces.
Run ... failed: database disk image is malformed
sqlite3.DatabaseError: database disk image is malformed
sqlite3.OperationalError: database is locked
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
The attempted status update also failed:
Source: gateway log, SQLAlchemy failed update parameters.
[parameters: ('error', 'database disk image is malformed', ..., '<run_id>')]
In the inspected local DB, a later integrity check returned ok, so the durable symptom was stale run state and failed persistence, not necessarily permanent DB corruption.
Relationship to other bugs
This should be treated as a persistence robustness issue, but it may be triggered by upstream runtime problems:
- stale config can keep model output limits on old values;
- old limits can cause truncated large artifact writes;
- failed writes can repeatedly echo large payloads into state;
- more retries create more checkpoint/write/update pressure;
- SQLite then sees more concurrent writes and recovery pressure.
Impact
- A task can appear stuck even after backend execution has ended.
- Token and message counters may not be persisted.
- Users cannot trust the final run status in the UI.
Expected behavior
- Run finalization should be durable and recoverable.
- SQLite lock/retry behavior should not leave runs permanently stale.
- If persistence fails, the UI should surface a clear error state rather than indefinite running.
BUG-005: Active run token and raw stream observability is insufficient
Symptom
During a long run, checkpoint message metadata already contains substantial token usage, while the run row/API still shows zero totals.
Observed active-run state:
Source: database inspection while the run was still active.
runs.status=running
runs.total_tokens=0
runs.llm_call_count=0
runs.message_count=0
At the same time, checkpoint state already summed to hundreds of thousands of tokens.
Raw stream was also not durably available:
Source: gateway log, run worker stream-mode setup.
'events' stream_mode not supported in gateway (requires astream_events + checkpoint callbacks). Skipping.
Actual stream modes:
['messages', 'custom', 'updates', 'values']
run_events was memory-backed, and no durable run event rows were available after the fact.
Impact
- Operators cannot monitor runaway token cost from normal run records while the run is active.
- After-the-fact debugging depends on checkpoints/logs rather than a durable raw event stream.
- It is difficult to tell whether a long-running task is healthy, stuck, or burning budget.
Expected behavior
- Active runs should expose current token/LLM-call/message counters.
- Raw stream or equivalent trace should be optionally persisted for debugging long tasks.
- Cost visibility should not require manually inspecting checkpoint internals.
BUG-006: Chat export includes hidden context, memory, reasoning, and trace
Symptom
Normal chat export can include content that is not visible in the chat transcript.
Observed exported content included:
<system-reminder><memory><current_date>- Thinking/reasoning details
- tool call names / trace-like information
Important distinction: this was not observed as raw system prompt leakage. The concrete issue is that hidden dynamic context, memory, reasoning content, and debug trace can be included in a normal user export.
Code evidence
The chat UI has hidden-message filtering, but export does not appear to apply the same boundary:
frontend/src/core/threads/export.ts
Markdown export includes reasoning blocks:
<details>
<summary>Thinking</summary>
...
</details>
JSON export maps raw messages more directly and can include tool-related fields.
Related state evidence
Checkpoint/state stores provider-returned reasoning content in:
Source: checkpoint/state inspection of AI messages.
AIMessage.additional_kwargs.reasoning_content
So even if the normal UI hides it, export paths must explicitly filter it.
Impact
- A user export is not a clean transcript.
- Memory injected into model context can be exported as if it were part of the conversation.
- Reasoning/tool traces can expose internal behavior that users did not ask to export.
- Product privacy/debug boundaries are ambiguous.
Expected behavior
Default export should include only the user-visible transcript:
- visible user messages;
- visible assistant final answers;
- visible artifact/file references if already shown to the user.
Default export should exclude:
- hidden messages marked
hide_from_ui; - dynamic context reminders;
- memory injection;
- thinking/reasoning content;
- tool calls and tool results.
If raw trace export is needed, it should be a separate explicit debug/admin export surface.
BUG-007: Subagent completed task still rendered as running
Symptom
After a long Ultra task completed, the frontend still displayed a task card as 子任务运行中.
Screenshot:
Backend state at the same time indicated the run was terminal:
Source: database inspection after the run had ended.
runs.status=success
threads_meta.status=idle
The relevant parent-visible tool result was:
Source: checkpoint/state inspection of parent task tool result.
Error: Tool 'task' failed with TypeError: 'AsyncCallbackManager' object is not iterable. Continue with available context, or choose an alternative tool.
Code evidence
The frontend currently maps only these task result prefixes to terminal states:
Task Succeeded. Result:
Task failed.
Task timed out
The actual result starts with:
Error: Tool 'task' failed ...
So it falls through and remains rendered as in_progress.
Impact
- A completed conversation looks like it is still doing work.
- Users may wait unnecessarily.
- It can make task retries or duplicate subagent work harder to reason about from the UI.
Expected behavior
Any terminal task-tool error should render as failed, not in-progress. Frontend state reconstruction should not depend on only a few exact English text prefixes.
BUG-008: HTML artifact preview flickers and temporarily shows mojibake during writes
Symptom
While an HTML artifact is still being generated, the right-side preview can flicker frequently and temporarily render garbled text. After the artifact is fully written and reopened, the final file can render normally.
Screenshot showing temporary mojibake:
This issue explicitly includes the mojibake shown in 2.png. It is grouped under artifact preview because the final artifact was not necessarily corrupted; the visible problem was the in-progress preview rendering partial/incomplete HTML.
Code evidence
The frontend auto-opens the artifact panel for the latest in-progress write_file step:
frontend/src/components/workspace/messages/message-group.tsx
isLoading && isLast && autoOpen && autoSelect && path && !result
The selected item can be a write-file: pseudo-artifact rather than a completed output file.
The HTML preview renders current content through a blob URL:
frontend/src/components/workspace/artifacts/artifact-file-detail.tsx
new Blob([content ?? ""], { type: "text/html" })
URL.createObjectURL(blob)
<iframe src={htmlPreviewUrl} />
Likely mechanism
- The preview points at partial or intermediate
write_filecontent. - Each streamed content update recreates the blob URL and reloads the iframe.
- If the current partial HTML lacks a complete
<meta charset="UTF-8">, the iframe may guess encoding incorrectly and render mojibake. - Once the final complete HTML is loaded, the artifact can render normally.
Impact
- Users see a broken-looking artifact while generation is still in progress.
- Long-running report generation feels unstable even when the final output is eventually valid.
- The preview can mislead users into thinking the model generated corrupted text.
Expected behavior
- Do not auto-render incomplete HTML as a live iframe preview, or debounce/stabilize it.
- For in-progress HTML writes, prefer code/loading view until the write completes.
- Switch to rendered preview only after a completed file result or
present_files.
BUG-009: Chat history timestamps are timezone-shifted
Symptom
Recently completed chats can appear in the history/search list as roughly 8 hours ago when tested in Asia/Shanghai.
Representative API shape:
Source: threads/history API response inspected in the browser/runtime.
{
"created_at": "2026-05-20T06:10:22.970977",
"updated_at": "2026-05-20T06:12:31.333753"
}
These timestamps have no timezone suffix such as Z or +00:00.
Mechanism
Browser JavaScript parses timezone-less ISO strings as local time:
new Date("2026-05-20T06:12:31.333753")
In Asia/Shanghai, this is interpreted as local 06:12, not UTC 06:12. If the backend intended UTC, the displayed relative time is shifted by about 8 hours.
Code evidence
frontend/src/app/workspace/chats/page.tsx
frontend/src/core/utils/datetime.ts
The frontend passes the raw timestamp string into date formatting without normalizing timezone-less backend timestamps.
Impact
- Recent threads look stale.
- History/search ordering and user trust in persistence/status are affected.
Expected behavior
Backend should return timezone-aware ISO timestamps, preferably UTC with Z, for example:
2026-05-20T06:12:31.333753Z
Alternatively, the frontend should normalize DeerFlow API timestamps without timezone as UTC before formatting.
BUG-010: Workspace negative performance timestamp runtime error
Symptom
Opening the workspace can trigger a full-screen Next.js runtime error overlay:
Runtime TypeError
Failed to execute 'measure' on 'Performance': 'WorkspacePage' cannot have a negative time stamp.
Representative frontend log:
Source: frontend browser console log captured by the local frontend service.
[browser] Uncaught TypeError: Failed to execute 'measure' on 'Performance': 'WorkspacePage' cannot have a negative time stamp.
Impact
- A tester/user can see a framework error instead of the workspace.
- It disrupts the core entry flow even if persisted thread data is not corrupted.
Open question
This needs confirmation outside the local dev/Turbopack-style environment. If it only happens in development, it may be lower priority; if it can happen in the Docker/release path, it should be fixed before stable tagging.
WATCH-001: Plan/search loop may continue after enough information
Symptom
A normal research prompt can continue issuing search/fetch tool calls after the model has already reasoned that it has enough information to summarize.
Representative prompt:
总结本周体育新闻
Observed failure shape:
- many
web_searchandweb_fetchcalls; - reasoning indicated enough information had been collected;
- the model still issued more search/fetch calls;
- no final answer was produced before manual interruption;
- token usage reached roughly the 200K class.
A later run of the same prompt succeeded with much lower token usage, so this appears intermittent.
Additional evidence:
- A small local HTML comparison report was generated for this item: plan-search-loop-token-report.zip.
- If maintainers want to inspect the turn-by-turn evidence, that HTML report includes the failing/successful run comparison, the point where the model appeared to have enough information, and the subsequent extra search calls.
- The zip contains the HTML report. It can be downloaded and opened locally to inspect the turn-by-turn comparison.
Impact
- Common research tasks can burn tokens without a final answer.
- Users have no clear signal that the agent is looping.
Expected behavior
- Once the agent determines it has enough information, it should produce the answer instead of continuing search.
- Tool-call loops should have a convergence or budget guard.
- If the agent cannot finish, it should return a clear partial/failure response.
Release-blocking priority
Suggested priority before a stable community tag:
- P0/P1: BUG-001, BUG-002, BUG-003, BUG-006, BUG-007
- P1: BUG-004, BUG-005, BUG-008
- P2 unless widely reproducible: BUG-009, BUG-010
- Watch / targeted regression test: WATCH-001
Local evidence
- gateway logs with
AsyncCallbackManagertask failures - gateway logs with repeated
8192output-token completions - payload debug logs after gateway restart showing
payload.max_tokens=384000 - checkpoint summaries showing large
write_fileerror payloads - frontend logs with negative performance timestamp runtime error
- plan/search loop comparison report: plan-search-loop-token-report.zip