[Bug] [seatunnel-engine-server] the slots will leak when the active master switches
#9589 opened on Jul 17, 2025
Description
Search before asking
- I had searched in the issues and found no similar issues.
What happened
We assume that there are 2 master nodes, m1 and m2, and the active master is m1. When m1 restarts, m2 will become the active master and restore all the running jobs. m2 will restore a running job as following steps:
-
Change the job state from RUNNING to PENDING. CoordinatorService#restoreJobFromMasterActiveSwitch
-
Apply slots. CoordinatorService#pendingJobSchedule
-
Run the job. 3.1 CoordinatorService#pendingJobSchedule
3.2 PhysicalPlan#stateProcess We will execute the code in the red box because of the step 1 (job state is PENDING). But the code in the green box will return false because we only change the job state in step 1 and the pipeline state is still RUNNING instead of CREATED. This will result in the job not being deployed to the worker (i don't think it‘s necessary because the old job is still running on the worker) and the slots we applied in step 2 not being released.
After I print some logs it shows as below
SeaTunnel Version
2.3.11
SeaTunnel Config
seatunnel:
engine:
slot-service:
dynamic-slot: false
slot-num: 16
Running Command
//
Error Exception
This will result in NoEnoughResourceException finally.
Zeta or Flink or Spark Version
Zeta
Java or Scala Version
JDK 1.8
Screenshots
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct