acquire_lock: TOCTOU race condition on PROJECTS_ROOT with multiple task replicas
#16447 opened on May 7, 2026
Description
Please confirm the following
- I agree to follow this project's code of conduct.
- I have checked the current issues for duplicates.
- I understand that AWX is open source software provided for free and that I might not receive a timely response.
- I am NOT reporting a (potential) security vulnerability. (These should be emailed to
security@ansible.cominstead.)
Bug Summary
Summary
When running AWX with multiple task replicas (>1), jobs fail immediately
with FileExistsError on /var/lib/awx/projects when triggered in parallel.
The root cause is a TOCTOU race condition in acquire_lock().
AWX Version
24.6.1 (also present in latest devel branch as of 2026-05-05)
Steps to Reproduce
- Deploy AWX with
replicas: 3and a RWX PVC for projects (CephFS/NFS) - Trigger 2+ jobs simultaneously targeting different projects
- Observe immediate failure on some jobs
Error
File "awx/main/tasks/jobs.py", line 379, in acquire_lock os.mkdir(settings.PROJECTS_ROOT) FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Root Cause
In awx/main/tasks/jobs.py, the acquire_lock() function uses a
non-atomic check-then-act pattern on PROJECTS_ROOT:
# Current code - TOCTOU race condition
if not os.path.exists(settings.PROJECTS_ROOT):
os.mkdir(settings.PROJECTS_ROOT)
With multiple task pods running concurrently, all pods can pass the
os.path.exists() check simultaneously before any of them creates
the directory, causing all but the first to raise FileExistsError.
Note: the per-project locking mechanism using fcntl.lockf() is
correctly implemented and unaffected by this bug.
Proposed Fix
Replace the non-atomic pattern with the atomic os.makedirs():
# Fix - atomic and idempotent
os.makedirs(settings.PROJECTS_ROOT, exist_ok=True)
This is a one-line fix. exist_ok=True makes the call a no-op if
the directory already exists, eliminating the race condition entirely.
Workaround
Reduce task replicas to 1. This eliminates the race condition but removes task HA.
Additional Context
- Confirmed present in
develbranch as of 2026-05-05 - PVC access mode:
ReadWriteMany(CephFS) - Operator version: 2.19.1
- The bug is triggered even when parallel jobs target different projects, since all jobs pass through this PROJECTS_ROOT check before reaching their individual project lock path
AWX version
24.6.1
Select the relevant components
- UI
- UI (tech preview)
- API
- Docs
- Collection
- CLI
- Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Steps to Reproduce
- Deploy AWX with
replicas: 3on Kubernetes - Configure a RWX PVC for projects storage (CephFS or NFS)
- Create 2+ job templates pointing to different projects
- Trigger all jobs simultaneously (e.g. via scheduled jobs or API calls at the same time)
- Observe that some jobs fail immediately before playbook execution
Expected Behavior
All jobs should start normally regardless of how many task replicas are running or how many jobs are triggered simultaneously.
Actual Behavior
Some jobs fail immediately with: File "awx/main/tasks/jobs.py", line 379, in acquire_lock os.mkdir(settings.PROJECTS_ROOT) FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
The failure rate increases with the number of task replicas and the number of simultaneous jobs.
Expected results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock os.mkdir(settings.PROJECTS_ROOT) FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Actual results
File "awx/main/tasks/jobs.py", line 379, in acquire_lock os.mkdir(settings.PROJECTS_ROOT) FileExistsError: [Errno 17] File exists: '/var/lib/awx/projects'
Additional information
No response