Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families) · rtk-ai/rtk#2012

(1 留言) (0 反應) (0 負責人)Rust (2,914 fork)batch import

area:cienhancementhelp wantedpriority:low

倉庫指標

Star: (48,085 star)
PR 合併指標: (平均合併 11天 1小時) (30 天內合併 45 個 PR)

描述

Title: Benchmark: RTK's effect on downstream LLM diagnosis on CI failure logs (35 cases × 3 model families)

Related: #839 (qte77's compression-ratio benchmark — different axis), #1599, #1351

TL;DR

LogDx-CI measures what happens to downstream LLM diagnosis quality when RTK (or another token reducer) sits between a CI failure log and the LLM. Complementary to #839, which measures RTK's compression ratio per command — we measure the downstream effect on diagnosis correctness.

Across 35 real GitHub Actions failure cases × 3 LLM families (Haiku 4.5, Sonnet 4.6, gpt-5-mini):

Context method	Diagnosis score	Confident-error rate
raw log (baseline)	`[X.XX]`	`[X.X%]`
hybrid grep+tail	~0.67	—
LLM-based summarizer (`claude-sonnet-summarize`)	~0.63–0.66	—
`rtk-read`	0.349	1.0%
`rtk-err-cat`	0.470	2.9%
`rtk-log`	0.249	13.3%

Score range 0–1, higher is better. Confident-error rate = share of cases where the LLM commits to a high-confidence diagnosis that turns out to be wrong (full definition in methodology).

rtk-log ranks 10th of 11 context methods on the full leaderboard: https://logdx-bench.github.io/leaderboard.html

Methodology

Each (case, context_method, diagnoser) triple is an independent run. Diagnoses are scored deterministically against AI-drafted, author-verified ground truth using a calibrated formula (diagnosis_score_v1_1). RTK invocations are stock — no .rtkrc tuning. Direct Anthropic and OpenAI APIs were used; no agent harness in the loop.

Caveats (please read before citing)

35 cases — small. Treat as directional, not statistical.
Single-author benchmark — no third-party verification (the HF dataset mirror is by the same author). Independent replication is the strongest follow-up.
AI-assisted human ground-truth review — calibration is by LLM-as-judge (claude-opus-4-7) + 1 author review pass. Not inter-rater-validated.
Stock RTK invocation — no custom .rtkrc, no tuning. A CI-tuned config might close some of this gap; we did not test that.

How this relates to existing issues

#1599 (go build false-success) — our confident_error_rate metric quantifies the failure mode described there: rtk-log and rtk-err-cat lose enough signal that downstream LLMs commit to confident wrong diagnoses ~13% and ~3% of the time respectively.
#1351 (codex token usage) — our agent-loop measurement (Sonnet 4.6 + 4 deterministic tools on raw.log) shows rtk-log recovers diagnostic quality via tool calls but needs 2.60 tools/case, the highest of any context method (vs 0.97 for the top hybrid). Same direction as the codex finding; magnitude is smaller because CI diagnosis is more constrained than terminal-bench tasks.
#839 (compression-ratio benchmark) — orthogonal axis. qte77 measures "does RTK compress X% as claimed?". We measure "given RTK's compression, can the LLM still diagnose?". Combined picture: on commands where RTK doesn't compress (qte77), users see no savings; on commands where it does compress (CI logs, ~99% byte reduction), downstream cost is non-trivial.

How we can help

Concretely:

Replication offer — if the RTK team wants to run the benchmark against a tuned .rtkrc, the eval pipeline is one python3 tools/run_diagnosis.py invocation per (split, diagnoser) tuple. Happy to walk through it.
CI-log-tuned rewrite rules — if RTK has interest in a "CI log mode" that preserves test-failure signal better (e.g. always keep pytest summary / cargo error / GHA ##[error] blocks verbatim), we have 35 case-specific failure-mode breakdowns that could inform rule design.
Combined benchmark with #839 — qte77's compression-ratio data + our downstream-LLM data could go in a joint analysis showing per-command which commands hit the "RTK doesn't compress" failure mode (#839) vs the "RTK compresses but loses signal" failure mode (this issue). Different fixes for each.

No commitment expected — flagging that the data is CC-BY-4.0 and available if useful.

貢獻者指南

研究方向: 使用調整後的 .rtkrc 配置對 35 個案例重新運行基準測試，衡量診斷分數的變化。
技術棧: rustpython
領域: testingperformancedeveloper experience
議題類型: 調研
難度: 3
預計時間: 1-2 天
活動狀態: 活躍
清晰度: 清晰
前置要求: GitPythonBasic understanding of LLMs
新手友善度: 65