[Bug] [seatunnel-engine-server] the slots will leak when the active master switches · apache/seatunnel#9589

Repository metrics

Stars: (6,897 stars)
PR merge metrics: (Avg merge 13d 21h) (143 merged PRs in 30d)

Description

Search before asking

I had searched in the issues and found no similar issues.

What happened

We assume that there are 2 master nodes, m1 and m2, and the active master is m1. When m1 restarts, m2 will become the active master and restore all the running jobs. m2 will restore a running job as following steps:

Change the job state from RUNNING to PENDING. CoordinatorService#restoreJobFromMasterActiveSwitch
Apply slots. CoordinatorService#pendingJobSchedule
Run the job. 3.1 CoordinatorService#pendingJobSchedule

3.2 PhysicalPlan#stateProcess We will execute the code in the red box because of the step 1 (job state is PENDING). But the code in the green box will return false because we only change the job state in step 1 and the pipeline state is still RUNNING instead of CREATED. This will result in the job not being deployed to the worker (i don't think it‘s necessary because the old job is still running on the worker) and the slots we applied in step 2 not being released.

After I print some logs it shows as below

SeaTunnel Version

2.3.11

SeaTunnel Config

seatunnel:
  engine:
    slot-service:
      dynamic-slot: false
      slot-num: 16

Running Command

//

Error Exception

This will result in NoEnoughResourceException finally.

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

JDK 1.8

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

Contributor guide

Research direction: Investigate the slot creation and release logic in CoordinatorService and PhysicalPlan. Trace the flow when the active master changes and identify where slots are not released. Consider adding a mechanism to release slots when the job state is restored but the pipelines are still running.
Tech stack: java
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: JavaGitDistributed Systems
Newbie friendliness: 45