Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)
#2,012 建立於 2026年5月21日
倉庫指標
- Star
- (48,085 star)
- PR 合併指標
- (平均合併 11天 1小時) (30 天內合併 45 個 PR)
描述
Title: Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)
Related: #839 (qte77's compression-ratio benchmark — different axis), #1599, #1351
TL;DR
LogDx-CI measures what happens to downstream LLM diagnosis quality when RTK (or another token reducer) sits between a CI failure log and the LLM. Complementary to #839, which measures RTK's compression ratio per command — we measure the downstream effect on diagnosis correctness.
Across 35 real GitHub Actions failure cases × 3 LLM families (Haiku 4.5, Sonnet 4.6, gpt-5-mini):
| Context method | Diagnosis score | Confident-error rate |
|---|---|---|
| raw log (baseline) | [X.XX] |
[X.X%] |
| hybrid grep+tail | ~0.67 | — |
LLM-based summarizer (claude-sonnet-summarize) |
~0.63–0.66 | — |
rtk-read |
0.349 | 1.0% |
rtk-err-cat |
0.470 | 2.9% |
rtk-log |
0.249 | 13.3% |
Score range 0–1, higher is better. Confident-error rate = share of cases where the LLM commits to a high-confidence diagnosis that turns out to be wrong (full definition in methodology).
rtk-log ranks 10th of 11 context methods on the full leaderboard: https://logdx-bench.github.io/leaderboard.html
Methodology
Each (case, context_method, diagnoser) triple is an independent run. Diagnoses are scored deterministically against AI-drafted, author-verified ground truth using a calibrated formula (diagnosis_score_v1_1). RTK invocations are stock — no .rtkrc tuning. Direct Anthropic and OpenAI APIs were used; no agent harness in the loop.
Caveats (please read before citing)
- 35 cases — small. Treat as directional, not statistical.
- Single-author benchmark — no third-party verification (the HF dataset mirror is by the same author). Independent replication is the strongest follow-up.
- AI-assisted human ground-truth review — calibration is by LLM-as-judge (
claude-opus-4-7) + 1 author review pass. Not inter-rater-validated. - Stock RTK invocation — no custom
.rtkrc, no tuning. A CI-tuned config might close some of this gap; we did not test that.
How this relates to existing issues
- #1599 (go build false-success) — our
confident_error_ratemetric quantifies the failure mode described there:rtk-logandrtk-err-catlose enough signal that downstream LLMs commit to confident wrong diagnoses ~13% and ~3% of the time respectively. - #1351 (codex token usage) — our agent-loop measurement (Sonnet 4.6 + 4 deterministic tools on
raw.log) showsrtk-logrecovers diagnostic quality via tool calls but needs 2.60 tools/case, the highest of any context method (vs 0.97 for the top hybrid). Same direction as the codex finding; magnitude is smaller because CI diagnosis is more constrained than terminal-bench tasks. - #839 (compression-ratio benchmark) — orthogonal axis. qte77 measures "does RTK compress X% as claimed?". We measure "given RTK's compression, can the LLM still diagnose?". Combined picture: on commands where RTK doesn't compress (qte77), users see no savings; on commands where it does compress (CI logs, ~99% byte reduction), downstream cost is non-trivial.
How we can help
Concretely:
- Replication offer — if the RTK team wants to run the benchmark against a tuned
.rtkrc, the eval pipeline is onepython3 tools/run_diagnosis.pyinvocation per (split, diagnoser) tuple. Happy to walk through it. - CI-log-tuned rewrite rules — if RTK has interest in a "CI log mode" that preserves test-failure signal better (e.g. always keep pytest summary / cargo error / GHA
##[error]blocks verbatim), we have 35 case-specific failure-mode breakdowns that could inform rule design. - Combined benchmark with #839 — qte77's compression-ratio data + our downstream-LLM data could go in a joint analysis showing per-command which commands hit the "RTK doesn't compress" failure mode (#839) vs the "RTK compresses but loses signal" failure mode (this issue). Different fixes for each.
No commitment expected — flagging that the data is CC-BY-4.0 and available if useful.
Links
- Leaderboard: https://logdx-bench.github.io/
- Code + data: https://github.com/eyuansu62/LogDx
- Cases on HF: https://huggingface.co/datasets/eyuansu71/logdx-ci
- v1.2 release notes: https://github.com/eyuansu62/LogDx/blob/main/RELEASE_NOTES_v1_2.md