feat(rewrite): canonical command digests — equivalence-aware hashing for dedup and caching · rtk-ai/rtk#1054

(1 commentaire) (0 réactions) (0 assignés)Rust (2 914 forks)batch import

area:clieffort-largeenhancementhelp wantedpriority:medium

Métriques du dépôt

Stars: (48 085 stars)
Métriques de merge PR: (Merge moyen 11j 1h) (45 PRs mergées en 30 j)

Description

Problem

rtk rewrite normalizes command names but doesn't normalize flags or produce structured output. This means semantically equivalent commands get different representations:

$ rtk rewrite "grep -rn pattern src/"
rtk grep -rn pattern src/

$ rtk rewrite "rg -n pattern src/"
rtk grep -n pattern src/
# Different strings despite being semantically identical

Same for:

git log --oneline vs git log --pretty=oneline --abbrev-commit
cat foo.txt vs head foo.txt vs tail foo.txt (all just "read file")

Proposed Feature

Add rtk canonicalize (or extend rtk rewrite --format json) that outputs a structured canonical form with a deterministic digest. Equivalent commands produce the same digest.

$ rtk canonicalize "grep -rn pattern src/"
{
  "tool": "grep",
  "flags": {"line-number": ""},
  "args": ["pattern", "src/"],
  "digest": "deaa1527537114cf"
}

$ rtk canonicalize "rg -n pattern src/"
{
  "tool": "grep",
  "flags": {"line-number": ""},
  "args": ["pattern", "src/"],
  "digest": "deaa1527537114cf"   # ← same digest!
}

Normalization rules

Tool aliases: cat/head/tail → read, rg/ag → grep, fd → find
Flag canonicalization: short → long form (-n → --line-number), sorted by key
Combined flag expansion: -rn → -r + -n
Tool-specific: grep -r stripped (canonical grep is recursive), git --oneline → --format=oneline
Sensitive masking: API keys and long tokens replaced with [MASKED]
Chain/pipe decomposition: && and | parsed into separate canonical segments

Use Cases

Caching: Same digest = same command = cache hit. Agents re-reading the same file across turns get instant results.
Telemetry dedup: Group execution events by canonical digest instead of raw strings. "How often do agents run grep?" works across rg/ag/grep variants.
Loop detection: Two semantically identical commands with different syntax get the same fingerprint, catching loops that raw string comparison misses.
Compression routing: Knowing the canonical tool lets you pick the right RTK filter even for aliased commands.

Proof of Concept

We built this in Go as the canon package in the Chitin kernel. Working equivalence tests:

cat foo.txt  ≡ head foo.txt  ≡ tail foo.txt   → digest 95d0e907bc6c155e
grep -rn X . ≡ rg -n X .                      → digest deaa1527537114cf
git log --oneline ≡ git log --pretty=oneline   → digest bd750a13fbce75f7

14 tests covering equivalence classes, chain/pipe parsing, env var prefixes, sensitive masking, and JSON round-tripping.

Happy to port to Rust if there's interest. The core is ~400 lines: tokenizer + flag expander + tool alias map + normalizer + SHA256 digest.

Relation to Existing Issues

Extends #154 (migrate rewrite to Rust) with structured output
Addresses part of #820 (rewrite normalization) at the flag level
Complements #569 (distill/compress) since canonical tool knowledge enables schema-aware compression

Guide contributeur

Direction de recherche: Étudiez l'implémentation Go existante du package canonalize dans le noyau Chitin (https://github.com/chitinhq/chitin/tree/main/canon). Portez la logique principale (tokenizer, expandeur de drapeaux, mappe d'alias d'outils, normalisateur, digest SHA256) vers Rust, en veillant à ce que les tests d'équivalence réussissent. Intégrez dans rtk en tant que `rtk canonicalize` ou étendez `rtk rewrite format json`.
Stack technique: rust
Domaine: clideveloper experienceperformance
Type d'issue: Fonctionnalité
Difficulté: 3
Temps estimé: 1-2 jours
Statut d'activité: Active
Clarté: Claire
Prérequis: RustGit
Accessibilité débutant: 70