[flyte2] Instrument the actions service (watcher metrics + dropped-updates counter)
Aperta il 29 mag 2026
Descrizione
Part of #7445. Depends on #7446 (the
/metricsendpoint + initializedScopemust exist first).
Summary
Instrument the actions service with Prometheus metrics: implement the existing dropped-updates counter TODO, and add throughput / latency / queue-depth metrics for the TaskAction watcher.
Background
The actions service is already partly wired for metrics — it just has nothing to plug into yet:
actions/setup.go:39already passessc.ScopeintoNewActionsClient(...).actions/k8s/client.go:91already usesscope.NewSubScope("actions_filter")for the dedup bloom filter.actions/k8s/client.go:65has an explicit TODO:// TODO: add a prometheus counter for dropped updates when metrics are wired up.
Note on the metrics scope: When run via the unified manager (manager/cmd/main.go:75), sc.Scope is already initialized (promutils.NewScope("flyte")) before actions.Setup runs, so the bloom-filter sub-scope at client.go:91 works and there is no panic. The dependency on #7446 is because #7446 mounts the /metrics endpoint — without it, the metrics you add here are registered into the default registry but never exposed to a scrape. (#7446 also initializes sc.Scope at the framework level, which additionally makes the standalone actions/cmd/main.go binary safe — that path currently leaves sc.Scope nil, so client.go:90-91's scope.NewSubScope(...) would panic there, since RecordFilterSize defaults to 1 << 23 > 0.)
What to do
Using the Scope available on ActionsClient (passed in via NewActionsClient), add metrics under a dedicated sub-scope (e.g. scope.NewSubScope("watcher")):
- Dropped updates counter — implement the TODO at
actions/k8s/client.go:65. Increment a counter whenever a watch update is dropped (e.g. buffer full / channel send would block). - Watcher throughput — counter of TaskAction events processed, labeled by result (success/error).
- Processing latency — a timer/histogram around per-event handling in the watch worker loop.
- Queue/buffer depth — a gauge for the watch buffer occupancy (config
WatchBufferSize), updated as events are enqueued/dequeued (or sampled periodically).
Acceptance criteria
-
/metricsexposes a dropped-updates counter, watcher event throughput (by result), processing latency, and buffer depth for the actions service. - The TODO at
actions/k8s/client.go:65is implemented and removed. - Metrics are created once under a dedicated sub-scope (no Prometheus duplicate-registration panics).
- A unit test verifies the dropped-updates counter increments when an update is dropped, and that the throughput counter increments on event processing.
Pointers
actions/k8s/client.go— the watcher, worker loop, buffer, and the dropped-updates TODO (line 65); constructorNewActionsClient(line 77) already receives apromutils.Scope.actions/setup.go:31-40— whereNewActionsClientis constructed withsc.Scope.flytestdlib/promutils/scope.go—Scopehelpers (MustNewCounter,MustNewGauge,MustNewStopWatch,NewSubScope).
Notes for contributors
- Keep label cardinality bounded — label by result/status, never by action/run IDs or other user input.
- This is independent of the runs-service instrumentation issues (#7447, #7448, #7449); all consume the same
Scopefrom #7446.