[Discussion][Zeta] Reduce historical job runtime info stored in IMap / StateStore with differentiated retention · apache/seatunnel#10767

2026-04-15T03:46:27.000Z

### Search before asking - [x] I searched existing issues and found related discussions about IMAP growth, job history storage, StateStore abstraction, and IMAP compaction, but did not find a dedicated issue for differentiating retention strategy between batch and streaming historical runtime information. ### Description ## Background Today, SeaTunnel Zeta stores finished job runtime information in distributed engine state, including: - finished job status - finished job metrics - finished job DAG / vertex info - related historical logs / references There is currently a single global retention control: - `history-job-expire-minutes` This is useful, but it is still too coarse for real workloads. In practice, the engine should not keep all historical runtime information in IMap / StateStore with a single retention policy for all job types. ## Problem ### 1. Batch jobs and streaming jobs have different retention needs - Batch jobs are naturally finite, and users often only need historical details for a bounded period. - Streaming jobs are long-running, and their runtime information should not accumulate in distributed state in an unbounded or over-detailed way. Using one global expiration policy for both types can be inefficient: - too short: batch troubleshooting data may disappear too early - too long: distributed engine state grows unnecessarily and increases memory / storage pressure ### 2. Not all history should stay in distributed engine state Distributed engine state should mainly keep what is necessary for: - current coordination - recent troubleshooting - bounded recovery / UI summary needs It should not become the long-term storage for all historical runtime detail. ### 3. Existing issues already show symptoms of state growth / history pressure Related examples: - #8558 - #10039 - #10329 - #10766 There was also an older design discussion about job history storage: - #2398 ## Proposal Discuss and design a differentiated retention strategy for historical runtime information. ### For batch jobs Introduce retention control dedicated to finished batch jobs, for example: - separate retention setting from streaming jobs - configurable retention for finished batch job state / metrics / DAG detail - optional summary-only retention after detailed data expires ### For streaming jobs Streaming jobs need a different strategy. Possible direction: - do not keep all historical runtime detail in IMap / StateStore - keep only bounded recent detail needed for UI / troubleshooting - retain only latest summary / latest checkpoint summary / latest error summary in distributed state - move long-term detailed history to more suitable storage if needed This is especially important because streaming jobs can run for a very long time, so "history" for them should usually be: - rolling window - compacted summary - bounded recent records rather than full detail kept in distributed engine state. ### Suggested design discussion topics 1. Which data must stay in distributed engine state? 2. Which data can be reduced to summary form? 3. Which data should use bounded rolling retention instead of full historical retention? 4. Should batch and streaming jobs have separate retention configuration? 5. Should different artifacts have separate retention policies? - job state summary - detailed metrics - DAG / vertex detail - checkpoint / runtime detail - logs / log references ## Non-goals - This issue is not proposing to replace Hazelcast IMap with RocksDB in the same step. - This issue is not proposing a big refactor before agreement on data boundaries. - The main goal is to define a better retention policy and reduce unnecessary historical runtime data kept in engine state. ## Acceptance criteria - We define which historical runtime information should remain in IMap / StateStore and which should not. - We define different retention expectations for batch and streaming jobs. - We avoid relying on a single global retention policy for all job types and all historical artifacts. - The final direction helps reduce engine state growth without breaking troubleshooting or recovery needs. ### Related issues - #2398 - #8558 - #10039 - #10181 - #10209 - #10329 - #10766 ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's Code of Conduct

(2 comments) (1 reaction) (0 assignees)Java (1,432 forks)batch import

Zetadiscussionhelp wantedimprove

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Search before asking

I searched existing issues and found related discussions about IMAP growth, job history storage, StateStore abstraction, and IMAP compaction, but did not find a dedicated issue for differentiating retention strategy between batch and streaming historical runtime information.

Description

Background

Today, SeaTunnel Zeta stores finished job runtime information in distributed engine state, including:

finished job status
finished job metrics
finished job DAG / vertex info
related historical logs / references

There is currently a single global retention control:

history-job-expire-minutes

This is useful, but it is still too coarse for real workloads.

In practice, the engine should not keep all historical runtime information in IMap / StateStore with a single retention policy for all job types.

Problem

1. Batch jobs and streaming jobs have different retention needs

Batch jobs are naturally finite, and users often only need historical details for a bounded period.
Streaming jobs are long-running, and their runtime information should not accumulate in distributed state in an unbounded or over-detailed way.

Using one global expiration policy for both types can be inefficient:

too short: batch troubleshooting data may disappear too early
too long: distributed engine state grows unnecessarily and increases memory / storage pressure

2. Not all history should stay in distributed engine state

Distributed engine state should mainly keep what is necessary for:

current coordination
recent troubleshooting
bounded recovery / UI summary needs

It should not become the long-term storage for all historical runtime detail.

3. Existing issues already show symptoms of state growth / history pressure

Related examples:

#8558
#10039
#10329
#10766

There was also an older design discussion about job history storage:

#2398

Proposal

Discuss and design a differentiated retention strategy for historical runtime information.

For batch jobs

Introduce retention control dedicated to finished batch jobs, for example:

separate retention setting from streaming jobs
configurable retention for finished batch job state / metrics / DAG detail
optional summary-only retention after detailed data expires

For streaming jobs

Streaming jobs need a different strategy.

Possible direction:

do not keep all historical runtime detail in IMap / StateStore
keep only bounded recent detail needed for UI / troubleshooting
retain only latest summary / latest checkpoint summary / latest error summary in distributed state
move long-term detailed history to more suitable storage if needed

This is especially important because streaming jobs can run for a very long time, so "history" for them should usually be:

rolling window
compacted summary
bounded recent records

rather than full detail kept in distributed engine state.

Non-goals

This issue is not proposing to replace Hazelcast IMap with RocksDB in the same step.
This issue is not proposing a big refactor before agreement on data boundaries.
The main goal is to define a better retention policy and reduce unnecessary historical runtime data kept in engine state.

Acceptance criteria

We define which historical runtime information should remain in IMap / StateStore and which should not.
We define different retention expectations for batch and streaming jobs.
We avoid relying on a single global retention policy for all job types and all historical artifacts.
The final direction helps reduce engine state growth without breaking troubleshooting or recovery needs.

Related issues

#2398
#8558
#10039
#10181
#10209
#10329
#10766

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Contributor guide

Research direction: Analyze current retention mechanisms in SeaTunnel Zeta, identify where job runtime info is stored, and propose a design for separate batch and streaming retention policies.
Tech stack: java
Domain: backend
Issue type: Research
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: JavaGit
Newbie friendliness: 55

Repository metrics

Description

Search before asking

Description

Background

Problem

1. Batch jobs and streaming jobs have different retention needs

2. Not all history should stay in distributed engine state

3. Existing issues already show symptoms of state growth / history pressure

Proposal

For batch jobs

For streaming jobs

Suggested design discussion topics

Non-goals

Acceptance criteria

Related issues

Are you willing to submit a PR?

Code of Conduct

Contributor guide

Get fresh easy issues in your inbox.