flyteorg/flyte

[flyte2] Executor: add custom domain metrics (TaskAction reconcile, cache, GC, plugins)

Open

Aperta il 29 mag 2026

Vedi su GitHub
 (1 commento) (0 reazioni) (0 assegnatari)Python (3705 star) (378 fork)batch import
flyte2good first issue

Descrizione

Part of #7445. Best paired with #7453 (which makes promutils-scoped metrics actually scrapeable on the executor endpoint).

Summary

The executor emits no custom domain metrics. It gets controller-runtime built-ins for free (reconcile counts/errors/duration, workqueue depth/latency), but there's no visibility into the executor's own behavior: TaskAction reconcile outcomes, cache effectiveness, garbage-collection activity, or plugin execution latency. Add these using the metrics Scope that's already plumbed through.

Background

The plumbing exists but is unused: executor/pkg/plugin/setup_context.go already exposes MetricsScope() promutils.Scope, and executor/setup.go constructs promutils.NewScope("executor"). But a grep for metric instruments (MustNewCounter/MustNewGauge/MustNewStopWatch, .Inc()/.Observe()/.Set()/.Start()) across the executor's own code returns nothing — the controllers don't emit any.

What to do

Add metrics (under dedicated sub-scopes, e.g. scope.NewSubScope("taskaction"), "cache", "gc") to the core executor logic:

  • TaskAction controller (executor/pkg/controller/taskaction_controller.go): reconcile outcomes labeled by result/phase (success/error/requeue), and per-reconcile latency. (Note: controller-runtime already gives generic reconcile totals — add only what the generic metrics don't cover, e.g. terminal phase counts.)
  • TaskAction cache (executor/pkg/controller/taskaction_cache.go): cache hit / miss / eviction counters, and current size gauge.
  • Garbage collector (executor/pkg/controller/garbage_collector.go): objects deleted (counter), deletion errors (counter), and GC sweep duration.
  • Plugin execution (via the registry / setupContext.MetricsScope()): per-plugin execution latency and error counts, if not already covered.

Acceptance criteria

  • The above metrics appear on the executor metrics endpoint (depends on #7453 being resolved so default-registry metrics are exposed; until then they can be verified via the default registry in a unit test).
  • Metrics use dedicated sub-scopes and are created once (no duplicate-registration panics).
  • Unit tests assert the relevant counters/gauges update (e.g. cache hit increments on a hit; GC deletion counter increments on delete).

Pointers

  • executor/pkg/plugin/setup_context.go:44MetricsScope() accessor.
  • executor/pkg/controller/taskaction_controller.go, taskaction_cache.go, garbage_collector.go — instrumentation targets.
  • flytestdlib/promutils/scope.goScope helpers (MustNewCounter, MustNewGauge, MustNewStopWatch, NewSubScope).

Notes for contributors

  • Keep label cardinality bounded — label by phase/result/plugin-type, never by action/run IDs.
  • This can be split among contributors by component (controller vs cache vs gc vs plugins) — comment on which piece you're taking.

Guida contributor