rtk-ai/rtk

Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)

Open

#2.012 geöffnet am 21. Mai 2026

Auf GitHub ansehen
 (1 Kommentar) (0 Reaktionen) (0 zugewiesene Personen)Rust (2.914 Forks)batch import
area:cienhancementhelp wantedpriority:low

Repository-Metriken

Stars
 (48.085 Stars)
PR-Merge-Metriken
 (Durchschn. Merge 11T 1h) (45 gemergte PRs in 30 T)

Beschreibung

Title: Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)

Related: #839 (qte77's compression-ratio benchmark — different axis), #1599, #1351

TL;DR

LogDx-CI measures what happens to downstream LLM diagnosis quality when RTK (or another token reducer) sits between a CI failure log and the LLM. Complementary to #839, which measures RTK's compression ratio per command — we measure the downstream effect on diagnosis correctness.

Across 35 real GitHub Actions failure cases × 3 LLM families (Haiku 4.5, Sonnet 4.6, gpt-5-mini):

Context method Diagnosis score Confident-error rate
raw log (baseline) [X.XX] [X.X%]
hybrid grep+tail ~0.67
LLM-based summarizer (claude-sonnet-summarize) ~0.63–0.66
rtk-read 0.349 1.0%
rtk-err-cat 0.470 2.9%
rtk-log 0.249 13.3%

Score range 0–1, higher is better. Confident-error rate = share of cases where the LLM commits to a high-confidence diagnosis that turns out to be wrong (full definition in methodology).

rtk-log ranks 10th of 11 context methods on the full leaderboard: https://logdx-bench.github.io/leaderboard.html

Methodology

Each (case, context_method, diagnoser) triple is an independent run. Diagnoses are scored deterministically against AI-drafted, author-verified ground truth using a calibrated formula (diagnosis_score_v1_1). RTK invocations are stock — no .rtkrc tuning. Direct Anthropic and OpenAI APIs were used; no agent harness in the loop.

Caveats (please read before citing)

  • 35 cases — small. Treat as directional, not statistical.
  • Single-author benchmark — no third-party verification (the HF dataset mirror is by the same author). Independent replication is the strongest follow-up.
  • AI-assisted human ground-truth review — calibration is by LLM-as-judge (claude-opus-4-7) + 1 author review pass. Not inter-rater-validated.
  • Stock RTK invocation — no custom .rtkrc, no tuning. A CI-tuned config might close some of this gap; we did not test that.

How this relates to existing issues

  • #1599 (go build false-success) — our confident_error_rate metric quantifies the failure mode described there: rtk-log and rtk-err-cat lose enough signal that downstream LLMs commit to confident wrong diagnoses ~13% and ~3% of the time respectively.
  • #1351 (codex token usage) — our agent-loop measurement (Sonnet 4.6 + 4 deterministic tools on raw.log) shows rtk-log recovers diagnostic quality via tool calls but needs 2.60 tools/case, the highest of any context method (vs 0.97 for the top hybrid). Same direction as the codex finding; magnitude is smaller because CI diagnosis is more constrained than terminal-bench tasks.
  • #839 (compression-ratio benchmark) — orthogonal axis. qte77 measures "does RTK compress X% as claimed?". We measure "given RTK's compression, can the LLM still diagnose?". Combined picture: on commands where RTK doesn't compress (qte77), users see no savings; on commands where it does compress (CI logs, ~99% byte reduction), downstream cost is non-trivial.

How we can help

Concretely:

  1. Replication offer — if the RTK team wants to run the benchmark against a tuned .rtkrc, the eval pipeline is one python3 tools/run_diagnosis.py invocation per (split, diagnoser) tuple. Happy to walk through it.
  2. CI-log-tuned rewrite rules — if RTK has interest in a "CI log mode" that preserves test-failure signal better (e.g. always keep pytest summary / cargo error / GHA ##[error] blocks verbatim), we have 35 case-specific failure-mode breakdowns that could inform rule design.
  3. Combined benchmark with #839 — qte77's compression-ratio data + our downstream-LLM data could go in a joint analysis showing per-command which commands hit the "RTK doesn't compress" failure mode (#839) vs the "RTK compresses but loses signal" failure mode (this issue). Different fixes for each.

No commitment expected — flagging that the data is CC-BY-4.0 and available if useful.

Links

Contributor Guide