Repository metrics

Stars: (67,767 stars)
PR merge metrics: (Avg merge 2d 5h) (229 merged PRs in 30d)

Description

[RFC] SWE-bench Recursion Limits and Long-Horizon Benchmark Suggestions

Summary

We ran GAIA and SWE-bench through the DeerFlow agent harness and found that both benchmarks scored significantly lower than direct model calls on the same tasks. For SWE-bench in particular, the main blocker was the recursion limit: the agent often exhausts its turn budget during repository exploration before it can produce a patch. We also encountered the LoopDetectionMiddleware issues described in #2517, which further degrade performance on tool-heavy workloads.

This RFC documents those limitations and proposes a practical benchmarking path that better reflects DeerFlow's strengths as a long-horizon, multi-tool agent system.

Motivation

DeerFlow's core strength is long-horizon orchestration across tools, sub-agents, and sandboxed execution. However, our recent GAIA and SWE-bench runs suggest that the current benchmark setup does not yet capture that strength reliably. In both cases, harness-based evaluation underperformed compared with direct model usage.

This does not necessarily indicate a flaw in DeerFlow itself. Rather, it points to a mismatch between benchmark demands and current harness configuration. If we want benchmark results to say something meaningful about DeerFlow's capabilities, we should first remove obvious sources of evaluation noise and failure.

Recursion Limit Threshold Suggestions

SWE-bench instances frequently fail with the following error:

agent error: Recursion limit of 150 reached without hitting a stop condition.

In practice, the agent spends most of its budget reading files and navigating the codebase, but never reaches the stage where it can actually implement a fix. A typical SWE-bench trajectory looks like this:

Read the issue.
Explore the repository, often requiring 20–30 file reads for a large project.
Locate the relevant source files.
Understand the bug and surrounding implementation.
Write a fix.
Run tests.
Debug and iterate if the tests fail.

Steps 2–4 alone can consume 50–80 turns. Our current limit of 150 looked reasonable on paper, but in practice it is too low for many real SWE-bench instances.

For reference, here are the threshold values we'd suggest per benchmark:

Benchmark	Suggested recursion limit
GAIA	100–150
SWE-bench	250+

These are rough estimates based on our runs. The right threshold depends on the repo size and task complexity — larger monorepos may need even more headroom. We'd suggest making this configurable per benchmark run rather than using a single global default.

That said, raising the limit alone is not sufficient. Higher limits will simply amplify the noisy behavior described in the next section unless the loop-detection issues are addressed first.

LoopDetectionMiddleware (#2517)

LoopDetectionMiddleware has a direct impact on benchmark quality. Relevant issues include #1055, #1905, #1987, #2590, and #2724.

At the moment, the middleware does not reliably distinguish between two very different behaviors:

Legitimate exploration, where the agent reads many different files while building context.
Pathological looping, where the agent repeatedly invokes the same tool without making progress.

This creates two failure modes:

False positives: normal exploration gets flagged as a loop, forcing the agent to stop early and producing incomplete solutions.
False negatives: the agent varies tool arguments just enough to avoid detection, but still fails to make progress and eventually hits the recursion limit.

Work is already underway in #2590 and #2724, so this RFC does not propose a separate fix. However, until #2517 is resolved in a stable release, it remains a blocker for obtaining clean results on tool-heavy benchmarks.

Smaller Models and Harness Use

We also observed that smaller models such as DeepSeek do not use the harness especially well. Common patterns include unnecessary tool calls, incorrect tool arguments, and weak tool selection overall. This may be partly a model capability issue and partly a prompting issue, but the practical consequence is the same: benchmark results for smaller models under the harness are often hard to interpret.

For future benchmark runs, we suggest prioritizing newer models such as k2.6 or similar, so that we can get a cleaner signal on the harness itself rather than on basic tool-use limitations.

Suggested Benchmarks

Once the current blockers are addressed, DeerFlow would benefit from a benchmark mix that better aligns with long-horizon agent workflows:

τ-bench: multi-turn customer service tasks in domains such as retail and airlines, with state tracking, API usage, and exception handling.
WebArena: real website interaction with multi-step browser tasks such as e-commerce, forum workflows, and code management.
MINT: multi-turn interactive tool use with code execution and file manipulation.
MLE-bench: Kaggle-style machine learning workflows that are naturally long-horizon and sandbox-friendly.
AgentBench: broad coverage across 8 environments, including OS, databases, knowledge graphs, web, and games.
OSWorld: realistic operating system interaction involving the file system, terminal, and browser.

These benchmarks are more likely to exercise the parts of DeerFlow that differentiate it from direct single-shot model calls.

Next Steps

Wait for a stable release that resolves #2517, including the related work in #2590 and #2724 or equivalent fixes.
After that, increase the SWE-bench recursion limit to 250+ and first rerun SWE-bench and GAIA, since they are both relatively easy to run and widely adopted.
Run those benchmark rounds with newer models such as k2.6 to get a cleaner signal on harness behavior.
Keep longer-term benchmark expansion on the roadmap, including τ-bench, MINT, WebArena, and MLE-bench.

Happy to help implement any of the above if this direction sounds reasonable.

Contributor guide

Research direction: Read the referenced issues and run benchmarks with higher recursion limits to understand the blocking behavior.
Tech stack: python
Domain: backend
Issue type: Research
Difficulty: 2
Estimated time: Under 1 hour
Activity status: Active
Clarity: Clear
Prerequisites: Git
Newbie friendliness: 80