rtk-ai/rtk

Empirical benchmark: 5 repos, 2,100 measurements — how do actual savings compare to claims?

Open

#839 aperta il 26 mar 2026

Vedi su GitHub
 (1 commento) (8 reazioni) (0 assegnatari)Rust (2914 fork)batch import
area:docsdocumentationeffort-mediumfilter-qualityhelp wantedpriority:medium

Metriche repository

Star
 (48.085 star)
Metriche merge PR
 (Merge medio 11g 1h) (45 PR mergiate in 30 g)

Descrizione

Empirical benchmark: 5 repos, 2,100 measurements — how do actual savings compare to claims?

Related: #590, #538, #545, #827

Context

RTK is a genuinely useful tool — ls, docker ps, and verbose git output compress well, and the Rust implementation is fast. Impressive work for a small team shipping at this pace.

This issue isn't about whether RTK has value (it does), but about closing the gap between what the README promises and what users actually experience. We ran a benchmark to quantify that gap — and we'd like to help close it.

Setup

RTK v0.33.1 tested across 5 repos (83–74K files, Shell/Bash/Python), 9 categories, 10 iterations, 3 independent runs (2,100 total measurements). Results are deterministic: 699/700 ops showed <10-byte variance per run, and cross-run results are byte-identical when repo state is unchanged.

Results

Category N Actual Claimed Notes
ls 100 72% 80% Matches well, especially ls -laR on larger repos
git-log 150 98% 80% Truncation rather than compression — 544K→1.8K
git-status 100 46% 80% Verbose format compresses; -s/--porcelain pass through
git-diff 100 20% 75% Full diff compresses somewhat; --stat passes through
docker 20 38% 80% docker ps = 75%; docker ps -a = 0%
tree 50 4% 80% Minimal compression
cat 50 0% 70% Passes through unchanged
grep 100 -0% 80% Slightly larger output (RTK overhead)
ruff 30 -1% 80% Slightly larger output

Also tested separately: pytest --co -q = 0% (185,962B unchanged, claimed -90%).

The gap in the aggregate claim

The README's savings table includes several categories that show no compression in practice. These account for over half the claimed total:

Category Claimed saved/session Measured
cat/read (20x) 28,000 0%
grep/rg (8x) 12,800 -0%
pytest (4x) 7,200 0%
ruff (3x) 2,400 slightly negative
Subtotal 50,400 (54% of total claim) ~0

The overall 72% grand total in our benchmark is real, but it's driven almost entirely by git log truncation and ls -laR on a large repo — not broad compression across all categories.

Environment & Repos

RTK 0.33.1, Linux 6.8.0-1044-azure x86_64, Bash 5.2.37
Direct CLI (command vs rtk command) — no CC hook involved
Repo Files Size Language
dotfiles 83 568K Shell
gha-github-mirror-action 194 1.4M Bash/YAML
ralph-template 1,732 12M Bash/Python
so101-biolab-automation 8,371 365M Python
Agents-eval 74,785 8.2G Python

Per-repo detail tables and raw output files available on request. The benchmark script (bench.sh) can be shared for independent reproduction.

How we can help

We'd be happy to contribute to closing the gap between claims and reality — either by improving the docs or the tool itself:

  1. README accuracy: We can PR an updated savings table with measured ranges per command, replacing the single-point estimates. Honest numbers build more trust than optimistic ones.

  2. Benchmark script as CI: Our bench.sh could be adapted into a regression test that runs on tagged releases, keeping the README table in sync with actual performance.

  3. Compression for passthrough commands: cat, grep, and tree currently pass through unchanged. If there's interest, we can explore compression strategies for these (e.g., grep result deduplication, tree structure compaction, cat truncation with line-count summaries).

  4. ruff/pytest rewrite rules: These currently add overhead or pass through. Happy to help design rewrite rules — we have deep experience with Python tooling output formats.

RTK does real work for the commands it supports. The headline just outpaced the feature coverage — which is understandable at this growth pace. We'd rather help fix it than just point it out.

Guida contributor