Empirical benchmark: 5 repos, 2,100 measurements — how do actual savings compare to claims?
#839 opened on 2026年3月26日
Repository metrics
- Stars
- (48,085 stars)
- PR merge metrics
- (平均マージ 11d 1h) (30d で 45 merged PRs)
説明
Empirical benchmark: 5 repos, 2,100 measurements — how do actual savings compare to claims?
Related: #590, #538, #545, #827
Context
RTK is a genuinely useful tool — ls, docker ps, and verbose git output compress well, and the Rust implementation is fast. Impressive work for a small team shipping at this pace.
This issue isn't about whether RTK has value (it does), but about closing the gap between what the README promises and what users actually experience. We ran a benchmark to quantify that gap — and we'd like to help close it.
Setup
RTK v0.33.1 tested across 5 repos (83–74K files, Shell/Bash/Python), 9 categories, 10 iterations, 3 independent runs (2,100 total measurements). Results are deterministic: 699/700 ops showed <10-byte variance per run, and cross-run results are byte-identical when repo state is unchanged.
Results
| Category | N | Actual | Claimed | Notes |
|---|---|---|---|---|
ls |
100 | 72% | 80% | Matches well, especially ls -laR on larger repos |
git-log |
150 | 98% | 80% | Truncation rather than compression — 544K→1.8K |
git-status |
100 | 46% | 80% | Verbose format compresses; -s/--porcelain pass through |
git-diff |
100 | 20% | 75% | Full diff compresses somewhat; --stat passes through |
docker |
20 | 38% | 80% | docker ps = 75%; docker ps -a = 0% |
tree |
50 | 4% | 80% | Minimal compression |
cat |
50 | 0% | 70% | Passes through unchanged |
grep |
100 | -0% | 80% | Slightly larger output (RTK overhead) |
ruff |
30 | -1% | 80% | Slightly larger output |
Also tested separately: pytest --co -q = 0% (185,962B unchanged, claimed -90%).
The gap in the aggregate claim
The README's savings table includes several categories that show no compression in practice. These account for over half the claimed total:
| Category | Claimed saved/session | Measured |
|---|---|---|
| cat/read (20x) | 28,000 | 0% |
| grep/rg (8x) | 12,800 | -0% |
| pytest (4x) | 7,200 | 0% |
| ruff (3x) | 2,400 | slightly negative |
| Subtotal | 50,400 (54% of total claim) | ~0 |
The overall 72% grand total in our benchmark is real, but it's driven almost entirely by git log truncation and ls -laR on a large repo — not broad compression across all categories.
Environment & Repos
RTK 0.33.1, Linux 6.8.0-1044-azure x86_64, Bash 5.2.37
Direct CLI (command vs rtk command) — no CC hook involved
| Repo | Files | Size | Language |
|---|---|---|---|
| dotfiles | 83 | 568K | Shell |
| gha-github-mirror-action | 194 | 1.4M | Bash/YAML |
| ralph-template | 1,732 | 12M | Bash/Python |
| so101-biolab-automation | 8,371 | 365M | Python |
| Agents-eval | 74,785 | 8.2G | Python |
Per-repo detail tables and raw output files available on request. The benchmark script (bench.sh) can be shared for independent reproduction.
How we can help
We'd be happy to contribute to closing the gap between claims and reality — either by improving the docs or the tool itself:
-
README accuracy: We can PR an updated savings table with measured ranges per command, replacing the single-point estimates. Honest numbers build more trust than optimistic ones.
-
Benchmark script as CI: Our
bench.shcould be adapted into a regression test that runs on tagged releases, keeping the README table in sync with actual performance. -
Compression for passthrough commands:
cat,grep, andtreecurrently pass through unchanged. If there's interest, we can explore compression strategies for these (e.g., grep result deduplication, tree structure compaction, cat truncation with line-count summaries). -
ruff/pytestrewrite rules: These currently add overhead or pass through. Happy to help design rewrite rules — we have deep experience with Python tooling output formats.
RTK does real work for the commands it supports. The headline just outpaced the feature coverage — which is understandable at this growth pace. We'd rather help fix it than just point it out.