apache/seatunnel

[Improve][Zeta] Add observability for engine state stores (IMap / StateStore)

Open

Aperta il 15 apr 2026

Vedi su GitHub
 (10 commenti) (1 reazione) (1 assegnatario)Java (6897 star) (1432 fork)batch import
Zetahelp wantedimprovemetricsmonitor

Descrizione

Search before asking

  • I searched existing issues and found related discussions about replacing Hazelcast IMap / introducing StateStore, but did not find a dedicated issue for observability of engine state stores.

Description

Background

SeaTunnel Zeta currently stores important engine state in Hazelcast-backed maps, and there is ongoing discussion about introducing a StateStore abstraction for future backend replacement.

Today, however, the engine still lacks dedicated observability for those state stores.

Current telemetry mainly exposes node / executor / partition / thread pool metrics, but it does not show the size, growth, or pressure of engine state stores such as:

  • engine_runningJobInfo
  • engine_runningJobState
  • engine_stateTimestamps
  • engine_ownedSlotProfilesIMap
  • engine_runningJobMetrics
  • engine_finishedJobState
  • engine_finishedJobMetrics
  • engine_finishedJobVertexInfo
  • engine_checkpoint_monitor
  • engine_connectorJarRefCounters

This is already an operational blind spot today, regardless of whether we keep Hazelcast IMap or move to another backend later.

Why this matters

  • Historical job state can accumulate and increase memory pressure.
  • Large state in distributed maps can increase GC pressure and may affect cluster communication.
  • When issues happen, we currently have no direct store-level metrics to confirm whether the bottleneck is state growth, cleanup lag, skew, or an abnormal store.
  • There is already at least one issue that reflects a store-size-related symptom in practice: #8558.
  • If we want to evaluate any future backend migration (for example RocksDB), we first need baseline observability on the current state layer.

Proposal

Add observability for engine state stores, starting with the current Hazelcast-backed implementation and keeping the design compatible with future StateStore abstractions.

Phase 1: basic store-level metrics

Expose metrics per store name, for example:

  • entry count
  • local owned entry count (when supported by backend / Hazelcast local stats)
  • backup entry count (when supported)
  • last-access / last-update related stats if available
  • approximate memory / heap cost if available

Phase 2: logical metrics for special stores

Some stores need business-aware metrics instead of only size():

  • engine_runningJobMetrics
    • logical task metrics count
    • active partition key count
  • engine_checkpoint_monitor
    • job count in overview store
    • in-progress checkpoint count
    • retained history count
  • engine_finishedJob*
    • finished job record count
    • expiration / cleanup count if available
  • engine_connectorJarRefCounters
    • tracked jar count
    • total reference count

Phase 3: alerting/documentation readiness

  • document the new metrics in telemetry docs
  • provide Prometheus / Grafana examples
  • clarify which metrics are backend-specific and which are generic StateStore metrics

Scope / non-goals

  • This issue is not asking to replace Hazelcast IMap with RocksDB in the same step.
  • This issue is not asking to redesign the entire state layer first.
  • The goal is to make the current state layer observable, and to keep that observability usable if StateStore becomes the public internal abstraction later.

Acceptance criteria

  • There is a documented metric set for engine state stores.
  • Operators can identify which store is growing abnormally.
  • At least the major engine stores listed above are covered.
  • Metrics are available through the existing telemetry / Prometheus endpoint.
  • Special stores such as engine_runningJobMetrics are not represented only by outer map size when that would be misleading.

Related issues

  • #8558
  • #9851
  • #10181
  • #10209

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Guida contributor