[Feature][Zeta] Support resuming a force-stopped streaming job from the latest successful checkpoint · apache/seatunnel#10926

(3 comments) (0 reactions) (1 assignee)Java (1,432 forks)batch import

featurehelp wanted

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Search before asking

I had searched in the existing issues and found related restore discussions, but not this exact Zeta capability gap.
Related but different issues:
- #10193: Flink engine checkpoint/savepoint restore compatibility
- #10574: sink-side data loss during restore after failure
- #10899: MySQL CDC timestamp startup mode cannot recover from checkpoints

What happened

In a scheduler / orchestrator environment, a long-running Zeta streaming job may be force-stopped from outside the engine (for example: task kill, Pod restart, scheduler-side takeover failure, or other non-graceful interruption).

Even when the job has already completed multiple checkpoints successfully, there does not appear to be a clear and supported Zeta-side recovery path to resume the next run from the latest successful checkpoint.

In practice, operators often have to treat the next start as a fresh submission, which weakens the operational value of checkpointing for streaming jobs.

Why this matters

This is especially painful for CDC-style long-running jobs:

re-running from scratch can repeat snapshot work
resubmitting without a restore contract can create replay / duplicate / manual-reconciliation risks
external schedulers cannot reliably implement "force stop -> restart from latest checkpoint" semantics on top of Zeta today

Checkpointing is already active during runtime, so after a forced stop, operators naturally expect to be able to recover from the latest successful checkpoint rather than start over.

Steps to reproduce

Submit a Zeta streaming job with checkpointing enabled.
Wait until several checkpoints complete successfully.
Force-stop the running task from the scheduler / orchestration layer, without a graceful savepoint-style stop.
Start the job again.
Observe that there is no clear supported mechanism to bind the new run to the latest successful checkpoint of the previous run.

Expected behavior

Zeta should provide a clear and supported recovery contract for this scenario.

Ideally, after a streaming job has completed checkpoint N, and the running task is force-stopped externally, the next restore / restart / resubmission should be able to:

discover the latest successful checkpoint for that job
resume from checkpoint N instead of behaving like a fresh submission
make the restore path explicit for sources / transforms / sinks
define fallback behavior when the latest checkpoint is missing, incompatible, or already cleaned up

If this is already supported today, then the current documentation and operator-facing workflow are not clear enough and should be documented explicitly.

Actual behavior

From the operator side, the practical behavior looks like a capability gap:

checkpoints are being completed during runtime
but after a forced stop, the next run does not have a clear official resume-from-latest-checkpoint path
the burden is pushed to the external scheduler / operator to guess how recovery should work

Relevant evidence

A few implementation details suggest that most of the building blocks already exist, but the end-to-end operator-facing resume contract is incomplete or unclear:

Zeta runtime already has checkpointing / savepoint-related job states
source translation modules already contain snapshot / restore logic, such as CoordinatedSource and ParallelSource
the missing part seems to be the engine-level / job-lifecycle-level recovery path for "force stop -> restart from latest successful checkpoint"

SeaTunnel Version

Observed from a SeaTunnel-based Zeta deployment in the 2.6.x line.

I am filing this as a Zeta capability request / gap report. If maintainers believe this is already supported upstream, please point to the recommended recovery workflow.

SeaTunnel Config

A representative streaming setup looks like this:

env {
  job.mode = "STREAMING"
  parallelism = 1
  checkpoint.interval = 10000
}

source {
  SqlServer-CDC {
    startup.mode = "INITIAL"
    database-names = ["demo_db"]
    table-names = ["demo_table"]
    base-url = "jdbc:sqlserver://<host>:1433;databaseName=<db>"
    username = "<user>"
    password = "<password>"
  }
}

sink {
  Iceberg {
    namespace = demo
    table = "${table_name}"
    catalog_name = default
  }
}

The exact connector combination should not be the core point here. The request is about the Zeta job-lifecycle recovery contract after a forced stop.

Running Command

./bin/seatunnel-cluster.sh -d

Error Exception

There may be no single canonical engine exception for this scenario.
The core problem is the lack of a clear supported restore path from the latest successful checkpoint after a forced stop.

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

Java

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Contribution Claim

If this direction makes sense, contributors are welcome to leave a comment to claim or discuss the implementation direction.

Contributor guide

Research direction: Investigate the Zeta engine's checkpoint lifecycle and implement a recovery mechanism for force stopped jobs. Focus on discovering the latest checkpoint state after restart and ensuring sources/sinks can resume from that state.
Tech stack: java
Domain: backenddata
Issue type: Feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: Java basicsStreaming concepts
Newbie friendliness: 60