[flyte2] Executor: add custom domain metrics (TaskAction reconcile, cache, GC, plugins)
Aperta il 29 mag 2026
Descrizione
Part of #7445. Best paired with #7453 (which makes promutils-scoped metrics actually scrapeable on the executor endpoint).
Summary
The executor emits no custom domain metrics. It gets controller-runtime built-ins for free (reconcile counts/errors/duration, workqueue depth/latency), but there's no visibility into the executor's own behavior: TaskAction reconcile outcomes, cache effectiveness, garbage-collection activity, or plugin execution latency. Add these using the metrics Scope that's already plumbed through.
Background
The plumbing exists but is unused: executor/pkg/plugin/setup_context.go already exposes MetricsScope() promutils.Scope, and executor/setup.go constructs promutils.NewScope("executor"). But a grep for metric instruments (MustNewCounter/MustNewGauge/MustNewStopWatch, .Inc()/.Observe()/.Set()/.Start()) across the executor's own code returns nothing — the controllers don't emit any.
What to do
Add metrics (under dedicated sub-scopes, e.g. scope.NewSubScope("taskaction"), "cache", "gc") to the core executor logic:
- TaskAction controller (
executor/pkg/controller/taskaction_controller.go): reconcile outcomes labeled by result/phase (success/error/requeue), and per-reconcile latency. (Note: controller-runtime already gives generic reconcile totals — add only what the generic metrics don't cover, e.g. terminal phase counts.) - TaskAction cache (
executor/pkg/controller/taskaction_cache.go): cache hit / miss / eviction counters, and current size gauge. - Garbage collector (
executor/pkg/controller/garbage_collector.go): objects deleted (counter), deletion errors (counter), and GC sweep duration. - Plugin execution (via the registry /
setupContext.MetricsScope()): per-plugin execution latency and error counts, if not already covered.
Acceptance criteria
- The above metrics appear on the executor metrics endpoint (depends on #7453 being resolved so default-registry metrics are exposed; until then they can be verified via the default registry in a unit test).
- Metrics use dedicated sub-scopes and are created once (no duplicate-registration panics).
- Unit tests assert the relevant counters/gauges update (e.g. cache hit increments on a hit; GC deletion counter increments on delete).
Pointers
executor/pkg/plugin/setup_context.go:44—MetricsScope()accessor.executor/pkg/controller/taskaction_controller.go,taskaction_cache.go,garbage_collector.go— instrumentation targets.flytestdlib/promutils/scope.go—Scopehelpers (MustNewCounter,MustNewGauge,MustNewStopWatch,NewSubScope).
Notes for contributors
- Keep label cardinality bounded — label by phase/result/plugin-type, never by action/run IDs.
- This can be split among contributors by component (controller vs cache vs gc vs plugins) — comment on which piece you're taking.