Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families) · rtk-ai/rtk#2012

Repository-Metriken

Stars: (48.085 Stars)
PR-Merge-Metriken: (Durchschn. Merge 11T 1h) (45 gemergte PRs in 30 T)

Beschreibung

Title: Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)

Related: #839 (qte77's compression-ratio benchmark — different axis), #1599, #1351

TL;DR

LogDx-CI measures what happens to downstream LLM diagnosis quality when RTK (or another token reducer) sits between a CI failure log and the LLM. Complementary to #839, which measures RTK's compression ratio per command — we measure the downstream effect on diagnosis correctness.

Across 35 real GitHub Actions failure cases × 3 LLM families (Haiku 4.5, Sonnet 4.6, gpt-5-mini):

Context method	Diagnosis score	Confident-error rate
raw log (baseline)	`[X.XX]`	`[X.X%]`
hybrid grep+tail	~0.67	—
LLM-based summarizer (`claude-sonnet-summarize`)	~0.63–0.66	—
`rtk-read`	0.349	1.0%
`rtk-err-cat`	0.470	2.9%
`rtk-log`	0.249	13.3%

Score range 0–1, higher is better. Confident-error rate = share of cases where the LLM commits to a high-confidence diagnosis that turns out to be wrong (full definition in methodology).

rtk-log ranks 10th of 11 context methods on the full leaderboard: https://logdx-bench.github.io/leaderboard.html

Methodology

Each (case, context_method, diagnoser) triple is an independent run. Diagnoses are scored deterministically against AI-drafted, author-verified ground truth using a calibrated formula (diagnosis_score_v1_1). RTK invocations are stock — no .rtkrc tuning. Direct Anthropic and OpenAI APIs were used; no agent harness in the loop.

Caveats (please read before citing)

35 cases — small. Treat as directional, not statistical.
Single-author benchmark — no third-party verification (the HF dataset mirror is by the same author). Independent replication is the strongest follow-up.
AI-assisted human ground-truth review — calibration is by LLM-as-judge (claude-opus-4-7) + 1 author review pass. Not inter-rater-validated.
Stock RTK invocation — no custom .rtkrc, no tuning. A CI-tuned config might close some of this gap; we did not test that.

How this relates to existing issues

#1599 (go build false-success) — our confident_error_rate metric quantifies the failure mode described there: rtk-log and rtk-err-cat lose enough signal that downstream LLMs commit to confident wrong diagnoses ~13% and ~3% of the time respectively.
#1351 (codex token usage) — our agent-loop measurement (Sonnet 4.6 + 4 deterministic tools on raw.log) shows rtk-log recovers diagnostic quality via tool calls but needs 2.60 tools/case, the highest of any context method (vs 0.97 for the top hybrid). Same direction as the codex finding; magnitude is smaller because CI diagnosis is more constrained than terminal-bench tasks.
#839 (compression-ratio benchmark) — orthogonal axis. qte77 measures "does RTK compress X% as claimed?". We measure "given RTK's compression, can the LLM still diagnose?". Combined picture: on commands where RTK doesn't compress (qte77), users see no savings; on commands where it does compress (CI logs, ~99% byte reduction), downstream cost is non-trivial.

How we can help

Concretely:

Replication offer — if the RTK team wants to run the benchmark against a tuned .rtkrc, the eval pipeline is one python3 tools/run_diagnosis.py invocation per (split, diagnoser) tuple. Happy to walk through it.
CI-log-tuned rewrite rules — if RTK has interest in a "CI log mode" that preserves test-failure signal better (e.g. always keep pytest summary / cargo error / GHA ##[error] blocks verbatim), we have 35 case-specific failure-mode breakdowns that could inform rule design.
Combined benchmark with #839 — qte77's compression-ratio data + our downstream-LLM data could go in a joint analysis showing per-command which commands hit the "RTK doesn't compress" failure mode (#839) vs the "RTK compresses but loses signal" failure mode (this issue). Different fixes for each.

No commitment expected — flagging that the data is CC-BY-4.0 and available if useful.

Links

Leaderboard: https://logdx-bench.github.io/
Code + data: https://github.com/eyuansu62/LogDx
Cases on HF: https://huggingface.co/datasets/eyuansu71/logdx-ci
v1.2 release notes: https://github.com/eyuansu62/LogDx/blob/main/RELEASE_NOTES_v1_2.md

Contributor Guide

Research-Richtung: Führen Sie den Benchmark erneut mit einer optimierten .rtkrc Konfiguration für die 35 Fälle durch und messen Sie die Änderung der Diagnosebewertung.
Tech Stack: rustpython
Domain: testingperformancedeveloper experience
Issue Type: Recherche
Schwierigkeit: 3
Geschätzte Zeit: 1-2 Tage
Aktivitätsstatus: Aktiv
Klarheit: Klar
Voraussetzungen: GitPythonBasic understanding of LLMs
Einsteigerfreundlichkeit: 65