apache/seatunnel

[Discussion][Zeta] Reduce historical job runtime info stored in IMap / StateStore with differentiated retention

Open

#10767 opened on Apr 15, 2026

View on GitHub
 (2 comments) (1 reaction) (0 assignees)Java (6,897 stars) (1,432 forks)batch import
Zetadiscussionhelp wantedimprove

Description

Search before asking

  • I searched existing issues and found related discussions about IMAP growth, job history storage, StateStore abstraction, and IMAP compaction, but did not find a dedicated issue for differentiating retention strategy between batch and streaming historical runtime information.

Description

Background

Today, SeaTunnel Zeta stores finished job runtime information in distributed engine state, including:

  • finished job status
  • finished job metrics
  • finished job DAG / vertex info
  • related historical logs / references

There is currently a single global retention control:

  • history-job-expire-minutes

This is useful, but it is still too coarse for real workloads.

In practice, the engine should not keep all historical runtime information in IMap / StateStore with a single retention policy for all job types.

Problem

1. Batch jobs and streaming jobs have different retention needs

  • Batch jobs are naturally finite, and users often only need historical details for a bounded period.
  • Streaming jobs are long-running, and their runtime information should not accumulate in distributed state in an unbounded or over-detailed way.

Using one global expiration policy for both types can be inefficient:

  • too short: batch troubleshooting data may disappear too early
  • too long: distributed engine state grows unnecessarily and increases memory / storage pressure

2. Not all history should stay in distributed engine state

Distributed engine state should mainly keep what is necessary for:

  • current coordination
  • recent troubleshooting
  • bounded recovery / UI summary needs

It should not become the long-term storage for all historical runtime detail.

3. Existing issues already show symptoms of state growth / history pressure

Related examples:

  • #8558
  • #10039
  • #10329
  • #10766

There was also an older design discussion about job history storage:

  • #2398

Proposal

Discuss and design a differentiated retention strategy for historical runtime information.

For batch jobs

Introduce retention control dedicated to finished batch jobs, for example:

  • separate retention setting from streaming jobs
  • configurable retention for finished batch job state / metrics / DAG detail
  • optional summary-only retention after detailed data expires

For streaming jobs

Streaming jobs need a different strategy.

Possible direction:

  • do not keep all historical runtime detail in IMap / StateStore
  • keep only bounded recent detail needed for UI / troubleshooting
  • retain only latest summary / latest checkpoint summary / latest error summary in distributed state
  • move long-term detailed history to more suitable storage if needed

This is especially important because streaming jobs can run for a very long time, so "history" for them should usually be:

  • rolling window
  • compacted summary
  • bounded recent records

rather than full detail kept in distributed engine state.

Suggested design discussion topics

  1. Which data must stay in distributed engine state?
  2. Which data can be reduced to summary form?
  3. Which data should use bounded rolling retention instead of full historical retention?
  4. Should batch and streaming jobs have separate retention configuration?
  5. Should different artifacts have separate retention policies?
    • job state summary
    • detailed metrics
    • DAG / vertex detail
    • checkpoint / runtime detail
    • logs / log references

Non-goals

  • This issue is not proposing to replace Hazelcast IMap with RocksDB in the same step.
  • This issue is not proposing a big refactor before agreement on data boundaries.
  • The main goal is to define a better retention policy and reduce unnecessary historical runtime data kept in engine state.

Acceptance criteria

  • We define which historical runtime information should remain in IMap / StateStore and which should not.
  • We define different retention expectations for batch and streaming jobs.
  • We avoid relying on a single global retention policy for all job types and all historical artifacts.
  • The final direction helps reduce engine state growth without breaking troubleshooting or recovery needs.

Related issues

  • #2398
  • #8558
  • #10039
  • #10181
  • #10209
  • #10329
  • #10766

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Contributor guide