[Improve][Zeta] Add observability for engine state stores (IMap / StateStore)
Aperta il 15 apr 2026
Descrizione
Search before asking
- I searched existing issues and found related discussions about replacing Hazelcast IMap / introducing StateStore, but did not find a dedicated issue for observability of engine state stores.
Description
Background
SeaTunnel Zeta currently stores important engine state in Hazelcast-backed maps, and there is ongoing discussion about introducing a StateStore abstraction for future backend replacement.
Today, however, the engine still lacks dedicated observability for those state stores.
Current telemetry mainly exposes node / executor / partition / thread pool metrics, but it does not show the size, growth, or pressure of engine state stores such as:
engine_runningJobInfoengine_runningJobStateengine_stateTimestampsengine_ownedSlotProfilesIMapengine_runningJobMetricsengine_finishedJobStateengine_finishedJobMetricsengine_finishedJobVertexInfoengine_checkpoint_monitorengine_connectorJarRefCounters
This is already an operational blind spot today, regardless of whether we keep Hazelcast IMap or move to another backend later.
Why this matters
- Historical job state can accumulate and increase memory pressure.
- Large state in distributed maps can increase GC pressure and may affect cluster communication.
- When issues happen, we currently have no direct store-level metrics to confirm whether the bottleneck is state growth, cleanup lag, skew, or an abnormal store.
- There is already at least one issue that reflects a store-size-related symptom in practice: #8558.
- If we want to evaluate any future backend migration (for example RocksDB), we first need baseline observability on the current state layer.
Proposal
Add observability for engine state stores, starting with the current Hazelcast-backed implementation and keeping the design compatible with future StateStore abstractions.
Phase 1: basic store-level metrics
Expose metrics per store name, for example:
- entry count
- local owned entry count (when supported by backend / Hazelcast local stats)
- backup entry count (when supported)
- last-access / last-update related stats if available
- approximate memory / heap cost if available
Phase 2: logical metrics for special stores
Some stores need business-aware metrics instead of only size():
engine_runningJobMetrics- logical task metrics count
- active partition key count
engine_checkpoint_monitor- job count in overview store
- in-progress checkpoint count
- retained history count
engine_finishedJob*- finished job record count
- expiration / cleanup count if available
engine_connectorJarRefCounters- tracked jar count
- total reference count
Phase 3: alerting/documentation readiness
- document the new metrics in telemetry docs
- provide Prometheus / Grafana examples
- clarify which metrics are backend-specific and which are generic
StateStoremetrics
Scope / non-goals
- This issue is not asking to replace Hazelcast IMap with RocksDB in the same step.
- This issue is not asking to redesign the entire state layer first.
- The goal is to make the current state layer observable, and to keep that observability usable if
StateStorebecomes the public internal abstraction later.
Acceptance criteria
- There is a documented metric set for engine state stores.
- Operators can identify which store is growing abnormally.
- At least the major engine stores listed above are covered.
- Metrics are available through the existing telemetry / Prometheus endpoint.
- Special stores such as
engine_runningJobMetricsare not represented only by outer map size when that would be misleading.
Related issues
- #8558
- #9851
- #10181
- #10209
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct