[Improve][Zeta] Add observability for engine state stores (IMap / StateStore) · apache/seatunnel#10766

(10 commenti) (1 reazione) (1 assegnatario)Java (1432 fork)batch import

Zetahelp wantedimprovemetricsmonitor

Metriche repository

Star: (6897 star)
Metriche merge PR: (Merge medio 13g 21h) (143 PR mergiate in 30 g)

Descrizione

Search before asking

I searched existing issues and found related discussions about replacing Hazelcast IMap / introducing StateStore, but did not find a dedicated issue for observability of engine state stores.

Description

Background

SeaTunnel Zeta currently stores important engine state in Hazelcast-backed maps, and there is ongoing discussion about introducing a StateStore abstraction for future backend replacement.

Today, however, the engine still lacks dedicated observability for those state stores.

Current telemetry mainly exposes node / executor / partition / thread pool metrics, but it does not show the size, growth, or pressure of engine state stores such as:

engine_runningJobInfo
engine_runningJobState
engine_stateTimestamps
engine_ownedSlotProfilesIMap
engine_runningJobMetrics
engine_finishedJobState
engine_finishedJobMetrics
engine_finishedJobVertexInfo
engine_checkpoint_monitor
engine_connectorJarRefCounters

This is already an operational blind spot today, regardless of whether we keep Hazelcast IMap or move to another backend later.

Why this matters

Historical job state can accumulate and increase memory pressure.
Large state in distributed maps can increase GC pressure and may affect cluster communication.
When issues happen, we currently have no direct store-level metrics to confirm whether the bottleneck is state growth, cleanup lag, skew, or an abnormal store.
There is already at least one issue that reflects a store-size-related symptom in practice: #8558.
If we want to evaluate any future backend migration (for example RocksDB), we first need baseline observability on the current state layer.

Proposal

Add observability for engine state stores, starting with the current Hazelcast-backed implementation and keeping the design compatible with future StateStore abstractions.

Phase 1: basic store-level metrics

Expose metrics per store name, for example:

entry count
local owned entry count (when supported by backend / Hazelcast local stats)
backup entry count (when supported)
last-access / last-update related stats if available
approximate memory / heap cost if available

Phase 2: logical metrics for special stores

Some stores need business-aware metrics instead of only size():

engine_runningJobMetrics
- logical task metrics count
- active partition key count
engine_checkpoint_monitor
- job count in overview store
- in-progress checkpoint count
- retained history count
engine_finishedJob*
- finished job record count
- expiration / cleanup count if available
engine_connectorJarRefCounters
- tracked jar count
- total reference count

Phase 3: alerting/documentation readiness

document the new metrics in telemetry docs
provide Prometheus / Grafana examples
clarify which metrics are backend-specific and which are generic StateStore metrics

Scope / non-goals

This issue is not asking to replace Hazelcast IMap with RocksDB in the same step.
This issue is not asking to redesign the entire state layer first.
The goal is to make the current state layer observable, and to keep that observability usable if StateStore becomes the public internal abstraction later.

Acceptance criteria

There is a documented metric set for engine state stores.
Operators can identify which store is growing abnormally.
At least the major engine stores listed above are covered.
Metrics are available through the existing telemetry / Prometheus endpoint.
Special stores such as engine_runningJobMetrics are not represented only by outer map size when that would be misleading.

Related issues

#8558
#9851
#10181
#10209

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Guida contributor

Direzione di ricerca: Ispeziona l'implementazione delle metriche esistente in SeaTunnel Zeta e identifica le classi di state store (es. mappe basate su Hazelcast) per aggiungere metriche per il conteggio delle voci, il conteggio delle voci possedute localmente e il costo della memoria. Quindi implementa le metriche per nome store e integra con l'endpoint di telemetria.
Tech stack: java
Dominio: backendobservabilitydevops
Tipo issue: Funzionalità
Difficoltà: 3
Tempo stimato: 3-5 giorni
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: Javamonitoring concepts
Adatta ai principianti: 60