ggml-org/llama.cpp

Feature Request: Graceful reasoning budget termination. Avoid mid-sentence cutoff.

Open

#20632 opened on Mar 16, 2026

View on GitHub
 (6 comments) (0 reactions) (1 assignee)C++ (110,169 stars) (18,202 forks)batch import
enhancementgood first issuehelp wanted

Description

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

PR #20297 (merged ~2025-03-11) added --reasoning-budget-message, which injects a string at the hard token cutoff to signal the model to wrap up. The injection still happens at exactly token N with zero warning. The model gets no time to finish its current thought before the </think> boundary.

Graceful termination before the cutoff is expected to improve reasoning performance. See Motivation and Possible Implementation for details.

Motivation

Raw truncation of the thinking trace measurably reduces answer quality compared to graceful termination. Muennighoff et al. (s1: Simple test-time scaling, arXiv:2501.19393) showed that appending an end-of-thinking delimiter with "Final Answer:" before the hard cut outperforms naïve truncation. This was confirmed by follow-up work (arXiv:2505.05315), which states directly that "the S1 approach performs better than directly truncating the full reasoning trajectory, underscoring the importance of preserving the solution segment."

The current --reasoning-budget-message implementation injects a wrap-up message at exactly token N, leaving the model zero tokens to actually act on it. This is functionally equivalent to truncation with a cosmetic suffix — the model cannot produce a meaningful conclusion within 0 remaining tokens.

Possible Implementation

Three approaches in increasing complexity:

Option A — message offset Inject the budget message at budget - offset instead of budget, giving the model offset tokens to act on the wrap-up signal. The s1 paper (https://arxiv.org/abs/2501.19393) shows this outperforms raw truncation. Change in the COUNTING state of reasoning-budget.cpp:

if (ctx->remaining <= ctx->warn_offset)  // was: <= 0

Files: reasoning-budget.cpp, reasoning-budget.h, common.h, arg.cpp, sampling.cpp. Default 0 preserves current behavior.

Option B — separate budgeting Split total budget into thinking (t) and conclusion (s) phases, t + s = budget. Force </think> at token t, leave s tokens for a conclusion before the answer. Follow-up work on s1 (https://arxiv.org/abs/2505.05315) shows this outperforms the basic offset approach. Would add a second parameter, e.g. --reasoning-budget-conclusion M.

Option C — logit biasing toward </think> Ramp up the logit weight of </think> as the budget runs low, then force it at the hard limit. No dependency on an injected phrase, model-agnostic. NVIDIA NIM uses this approach with a 10% extension window for sentence completion. More invasive (sampler changes) but cleanest long-term.

Options A and B build on the existing message injection architecture. Option C is a parallel approach that could coexist with either.

Happy to test on ROCm/gfx1100 (Qwen3.5-27B) and contribute a draft of Option A or B if there is interest.

Contributor guide