NF01: rtk-rewrite.sh hook adds ~56ms overhead on Apple Silicon (vs documented <5ms) · rtk-ai/rtk#1375

(2 commentaires) (0 réactions) (0 assignés)Rust (2 914 forks)batch import

area:clibugeffort-mediumenhancementhelp wantedplatform:macospriority:medium

Métriques du dépôt

Stars: (48 085 stars)
Métriques de merge PR: (Merge moyen 11j 1h) (45 PRs mergées en 30 j)

Description

Summary

On Apple Silicon (M-series), rtk-rewrite.sh hook invocations measure ~56 ms average / ~84 ms p95 per call in production use, versus the documented "<5 ms overhead" target. A 100-invocation benchmark collected during a downstream project's enforcement phase identifies the root cause as a subprocess fork of rtk for command-prefix classification inside the hook.

The latency is imperceptible in interactive use (dwarfed by typical bash wall-time) but violates the documented NF01 budget and adds up across high-frequency sessions (~5,000 hook firings / day → ~4.7 min/day of classification overhead).

Environment

rtk version: 0.35.0
Platform: Apple Silicon (M-series), macOS 14+
Invocation: PreToolUse:Bash hook (Claude Code harness) piping stdin via rtk-rewrite.sh
Shell: zsh / bash — fresh subshell per invocation

Measurement

100-invocation benchmark:

avg:  56.14 ms
p50:  54.9  ms
p95:  ~84   ms (estimated)
p99:  ~120  ms

Harness forks a fresh bash per invocation. In-session runtime (no bash re-fork) is ~30–40 ms lower because shell startup amortizes, but the rtk-side cost remains ~30–40 ms per call.

Suspected root cause

rtk-rewrite.sh forks rtk as a subprocess to classify the incoming command prefix. On Apple Silicon, cold-start of the rtk binary accounts for the bulk of the 56 ms — process setup + dylib resolution + arg parse before any classification logic runs.

Proposed upstream mitigations (any one is sufficient)

In-shell prefix parser — ship a small bash/sh function covering the top ~10 prefixes (git, grep, find, ls, cat, head, tail, wc, curl, docker) that short-circuits without forking rtk. Falls through to rtk subprocess for the long tail.
Persistent classifier daemon — optional rtk-daemon binding to a unix socket; rtk-rewrite.sh uses nc -U for classification, falls back to subprocess fork if daemon is absent. Preserves zero-install default.
Split classifier binary — ship rtk-classify as a focused, smaller binary without the full CLI surface.

Why this matters at scale

At ~5,000 hook firings / day, 56 ms/call × 5,000 = ~4.7 minutes/day of latency accrued purely from classification subprocess forks. Perceptible only on tight loops (shell-pipe-heavy scripts that trigger hooks repeatedly), but measurable.

Out of scope for this issue

Hook correctness and rewrite accuracy — functioning correctly, no regression.
rtk standalone invocation latency (rtk git status, etc.) — not measured here, may share root cause.

Downstream disposition

We've accepted the deviation internally by revising our target from p95 < 25 ms → p95 < 100 ms, and deferred a local prefix-parser optimization pending upstream response — we don't want to ship a divergent rtk-rewrite.sh that would later conflict with the upstream fix.

Happy to contribute a PR on any of the three mitigation paths if a maintainer signals preference.

Guide contributeur

Direction de recherche: Implémentez une fonction shell dans rtk rewrite.sh qui classe les préfixes courants (git, grep, find, ls, cat, head, tail, wc, curl, docker) sans bifurquer le binaire rtk. Mesurez la réduction de latence sur Apple Silicon et confirmez que la surcharge par invocation revient à <5ms.
Stack technique: rust
Domaine: clidevops
Type d'issue: Performance
Difficulté: 3
Temps estimé: Une demi journée
Statut d'activité: Active
Clarté: Claire
Prérequis: ShellRust
Accessibilité débutant: 70