flyteorg/flyte

[flyte2] Executor: add custom domain metrics (TaskAction reconcile, cache, GC, plugins)

Open

#7,455 opened on May 29, 2026

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Python (3,705 stars) (378 forks)batch import
flyte2good first issue

Description

Part of #7445. Best paired with #7453 (which makes promutils-scoped metrics actually scrapeable on the executor endpoint).

Summary

The executor emits no custom domain metrics. It gets controller-runtime built-ins for free (reconcile counts/errors/duration, workqueue depth/latency), but there's no visibility into the executor's own behavior: TaskAction reconcile outcomes, cache effectiveness, garbage-collection activity, or plugin execution latency. Add these using the metrics Scope that's already plumbed through.

Background

The plumbing exists but is unused: executor/pkg/plugin/setup_context.go already exposes MetricsScope() promutils.Scope, and executor/setup.go constructs promutils.NewScope("executor"). But a grep for metric instruments (MustNewCounter/MustNewGauge/MustNewStopWatch, .Inc()/.Observe()/.Set()/.Start()) across the executor's own code returns nothing — the controllers don't emit any.

What to do

Add metrics (under dedicated sub-scopes, e.g. scope.NewSubScope("taskaction"), "cache", "gc") to the core executor logic:

  • TaskAction controller (executor/pkg/controller/taskaction_controller.go): reconcile outcomes labeled by result/phase (success/error/requeue), and per-reconcile latency. (Note: controller-runtime already gives generic reconcile totals — add only what the generic metrics don't cover, e.g. terminal phase counts.)
  • TaskAction cache (executor/pkg/controller/taskaction_cache.go): cache hit / miss / eviction counters, and current size gauge.
  • Garbage collector (executor/pkg/controller/garbage_collector.go): objects deleted (counter), deletion errors (counter), and GC sweep duration.
  • Plugin execution (via the registry / setupContext.MetricsScope()): per-plugin execution latency and error counts, if not already covered.

Acceptance criteria

  • The above metrics appear on the executor metrics endpoint (depends on #7453 being resolved so default-registry metrics are exposed; until then they can be verified via the default registry in a unit test).
  • Metrics use dedicated sub-scopes and are created once (no duplicate-registration panics).
  • Unit tests assert the relevant counters/gauges update (e.g. cache hit increments on a hit; GC deletion counter increments on delete).

Pointers

  • executor/pkg/plugin/setup_context.go:44MetricsScope() accessor.
  • executor/pkg/controller/taskaction_controller.go, taskaction_cache.go, garbage_collector.go — instrumentation targets.
  • flytestdlib/promutils/scope.goScope helpers (MustNewCounter, MustNewGauge, MustNewStopWatch, NewSubScope).

Notes for contributors

  • Keep label cardinality bounded — label by phase/result/plugin-type, never by action/run IDs.
  • This can be split among contributors by component (controller vs cache vs gc vs plugins) — comment on which piece you're taking.

Contributor guide