[Feature][Zeta] Support resuming a force-stopped streaming job from the latest successful checkpoint
#10926 opened on May 20, 2026
Description
Search before asking
- I had searched in the existing issues and found related restore discussions, but not this exact Zeta capability gap.
- Related but different issues:
#10193: Flink engine checkpoint/savepoint restore compatibility#10574: sink-side data loss during restore after failure#10899: MySQL CDC timestamp startup mode cannot recover from checkpoints
What happened
In a scheduler / orchestrator environment, a long-running Zeta streaming job may be force-stopped from outside the engine (for example: task kill, Pod restart, scheduler-side takeover failure, or other non-graceful interruption).
Even when the job has already completed multiple checkpoints successfully, there does not appear to be a clear and supported Zeta-side recovery path to resume the next run from the latest successful checkpoint.
In practice, operators often have to treat the next start as a fresh submission, which weakens the operational value of checkpointing for streaming jobs.
Why this matters
This is especially painful for CDC-style long-running jobs:
- re-running from scratch can repeat snapshot work
- resubmitting without a restore contract can create replay / duplicate / manual-reconciliation risks
- external schedulers cannot reliably implement "force stop -> restart from latest checkpoint" semantics on top of Zeta today
Checkpointing is already active during runtime, so after a forced stop, operators naturally expect to be able to recover from the latest successful checkpoint rather than start over.
Steps to reproduce
- Submit a Zeta streaming job with checkpointing enabled.
- Wait until several checkpoints complete successfully.
- Force-stop the running task from the scheduler / orchestration layer, without a graceful savepoint-style stop.
- Start the job again.
- Observe that there is no clear supported mechanism to bind the new run to the latest successful checkpoint of the previous run.
Expected behavior
Zeta should provide a clear and supported recovery contract for this scenario.
Ideally, after a streaming job has completed checkpoint N, and the running task is force-stopped externally, the next restore / restart / resubmission should be able to:
- discover the latest successful checkpoint for that job
- resume from checkpoint
Ninstead of behaving like a fresh submission - make the restore path explicit for sources / transforms / sinks
- define fallback behavior when the latest checkpoint is missing, incompatible, or already cleaned up
If this is already supported today, then the current documentation and operator-facing workflow are not clear enough and should be documented explicitly.
Actual behavior
From the operator side, the practical behavior looks like a capability gap:
- checkpoints are being completed during runtime
- but after a forced stop, the next run does not have a clear official resume-from-latest-checkpoint path
- the burden is pushed to the external scheduler / operator to guess how recovery should work
Relevant evidence
A few implementation details suggest that most of the building blocks already exist, but the end-to-end operator-facing resume contract is incomplete or unclear:
- Zeta runtime already has checkpointing / savepoint-related job states
- source translation modules already contain snapshot / restore logic, such as
CoordinatedSourceandParallelSource - the missing part seems to be the engine-level / job-lifecycle-level recovery path for "force stop -> restart from latest successful checkpoint"
SeaTunnel Version
Observed from a SeaTunnel-based Zeta deployment in the 2.6.x line.
I am filing this as a Zeta capability request / gap report. If maintainers believe this is already supported upstream, please point to the recommended recovery workflow.
SeaTunnel Config
A representative streaming setup looks like this:
env {
job.mode = "STREAMING"
parallelism = 1
checkpoint.interval = 10000
}
source {
SqlServer-CDC {
startup.mode = "INITIAL"
database-names = ["demo_db"]
table-names = ["demo_table"]
base-url = "jdbc:sqlserver://<host>:1433;databaseName=<db>"
username = "<user>"
password = "<password>"
}
}
sink {
Iceberg {
namespace = demo
table = "${table_name}"
catalog_name = default
}
}
The exact connector combination should not be the core point here. The request is about the Zeta job-lifecycle recovery contract after a forced stop.
Running Command
./bin/seatunnel-cluster.sh -d
Error Exception
There may be no single canonical engine exception for this scenario.
The core problem is the lack of a clear supported restore path from the latest successful checkpoint after a forced stop.
Zeta or Flink or Spark Version
Zeta
Java or Scala Version
Java
Screenshots
No response
Are you willing to submit PR?
- Yes I am willing to submit PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Contribution Claim
If this direction makes sense, contributors are welcome to leave a comment to claim or discuss the implementation direction.