flyteorg/flyte
View on GitHub[flyte2] Instrument the runs service reconcilers (abort-reconciler) with Prometheus metrics
Open
#7,449 opened on May 29, 2026
flyte2good first issue
Description
Part of #7445. Depends on #7446 (the
/metricsendpoint +Scopemust exist first).
Summary
Add Prometheus metrics to the runs service background reconcilers (starting with the abort reconciler) to observe queue depth, processing throughput, retries, and failures.
Background
runs/service/abort_reconciler.go runs as a background worker (registered in runs/setup.go via sc.AddWorker("abort-reconciler", ...)). It has a worker pool, a bounded queue (QueueSize: 1000), and retry logic (MaxAttempts, InitialDelay, MaxDelay). None of this is currently observable via metrics.
What to do
- Thread the metrics
Scope(from #7446) intoservice.NewAbortReconciler(...)(extend its config/constructor). - Emit metrics such as:
- current queue depth / pending items (gauge)
- items processed (counter, labeled by success/failure)
- retries / attempts (counter)
- per-item processing latency (timer/histogram)
Acceptance criteria
-
/metricsexposes abort-reconciler queue depth, processed count (success/failure), retry count, and processing latency. - Metrics use a dedicated sub-scope, e.g.
scope.NewSubScope("abort_reconciler"), created once. - A unit test verifies that processing an item updates the relevant counters/gauges.
Pointers
runs/service/abort_reconciler.go— the reconciler implementation and its run loop.runs/setup.go:64-73— whereNewAbortReconcileris constructed and registered as a worker.flytestdlib/promutils/scope.go—Scopehelpers (MustNewGauge,MustNewCounter,MustNewStopWatch,NewSubScope).
Notes for contributors
- The gauge for queue depth should be updated as items are enqueued/dequeued (or sampled periodically).
- This is independent of #7447 and #7448; all three consume the same
Scopefrom #7446.