[Discussion][Zeta] Reduce historical job runtime info stored in IMap / StateStore with differentiated retention
#10767 opened on Apr 15, 2026
Description
Search before asking
- I searched existing issues and found related discussions about IMAP growth, job history storage, StateStore abstraction, and IMAP compaction, but did not find a dedicated issue for differentiating retention strategy between batch and streaming historical runtime information.
Description
Background
Today, SeaTunnel Zeta stores finished job runtime information in distributed engine state, including:
- finished job status
- finished job metrics
- finished job DAG / vertex info
- related historical logs / references
There is currently a single global retention control:
history-job-expire-minutes
This is useful, but it is still too coarse for real workloads.
In practice, the engine should not keep all historical runtime information in IMap / StateStore with a single retention policy for all job types.
Problem
1. Batch jobs and streaming jobs have different retention needs
- Batch jobs are naturally finite, and users often only need historical details for a bounded period.
- Streaming jobs are long-running, and their runtime information should not accumulate in distributed state in an unbounded or over-detailed way.
Using one global expiration policy for both types can be inefficient:
- too short: batch troubleshooting data may disappear too early
- too long: distributed engine state grows unnecessarily and increases memory / storage pressure
2. Not all history should stay in distributed engine state
Distributed engine state should mainly keep what is necessary for:
- current coordination
- recent troubleshooting
- bounded recovery / UI summary needs
It should not become the long-term storage for all historical runtime detail.
3. Existing issues already show symptoms of state growth / history pressure
Related examples:
- #8558
- #10039
- #10329
- #10766
There was also an older design discussion about job history storage:
- #2398
Proposal
Discuss and design a differentiated retention strategy for historical runtime information.
For batch jobs
Introduce retention control dedicated to finished batch jobs, for example:
- separate retention setting from streaming jobs
- configurable retention for finished batch job state / metrics / DAG detail
- optional summary-only retention after detailed data expires
For streaming jobs
Streaming jobs need a different strategy.
Possible direction:
- do not keep all historical runtime detail in IMap / StateStore
- keep only bounded recent detail needed for UI / troubleshooting
- retain only latest summary / latest checkpoint summary / latest error summary in distributed state
- move long-term detailed history to more suitable storage if needed
This is especially important because streaming jobs can run for a very long time, so "history" for them should usually be:
- rolling window
- compacted summary
- bounded recent records
rather than full detail kept in distributed engine state.
Suggested design discussion topics
- Which data must stay in distributed engine state?
- Which data can be reduced to summary form?
- Which data should use bounded rolling retention instead of full historical retention?
- Should batch and streaming jobs have separate retention configuration?
- Should different artifacts have separate retention policies?
- job state summary
- detailed metrics
- DAG / vertex detail
- checkpoint / runtime detail
- logs / log references
Non-goals
- This issue is not proposing to replace Hazelcast IMap with RocksDB in the same step.
- This issue is not proposing a big refactor before agreement on data boundaries.
- The main goal is to define a better retention policy and reduce unnecessary historical runtime data kept in engine state.
Acceptance criteria
- We define which historical runtime information should remain in IMap / StateStore and which should not.
- We define different retention expectations for batch and streaming jobs.
- We avoid relying on a single global retention policy for all job types and all historical artifacts.
- The final direction helps reduce engine state growth without breaking troubleshooting or recovery needs.
Related issues
- #2398
- #8558
- #10039
- #10181
- #10209
- #10329
- #10766
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct