EmrServerlessStartJobOperator task fails randomly for few tasks in 20-21s even though job is submitted and succeeds fine in emr serverless in background
#67178 opened on May 19, 2026
Description
Under which category would you file this issue?
Providers
Apache Airflow version
3.0.6
What happened and how to reproduce it?
We upgraded aws mwaa airflow from 2.7.2 to 3.0.6 and we noticed 1 random issue. While submitting jobs to emr serverless from our dags i.e. via EmrServerlessStartJobOperator, we see jobs are submitted fine to emr serverless and are finished in emr but task status is marked as failure in airflow dag's task. Out of 100 tasks, 98-99 proceed fine but we see random failures for 1 or 2 tasks. We saw a pattern, it fails in 20-21seconds. Its completely random, not for particular task.
Something is wrong with new version of airflow or might be some configuration is missing from our end
Requirements.txt for airflow of both versions Airflow 3.0.6
--constraint "/usr/local/airflow/dags/constraints-3.11_spark_trino.txt"
apache-airflow-providers-apache-spark==5.3.2
apache-airflow-providers-amazon==9.12.0
apache-airflow-providers-ssh==4.1.3
types-paramiko==3.5.0.20250801
sshtunnel==0.4.0
requests==2.32.5
orjson==3.11.2
cachetools==5.5.2
Authlib==1.6.2
apache-airflow-providers-apache-livy==4.4.2
apache-airflow-providers-http==5.3.3
confluent-kafka==2.11.1
apache-airflow-providers-apache-kafka==1.10.2
fastavro==1.12.0
Airflow 2.7.2
--constraint "/usr/local/airflow/dags/constraints-3.7_spark_trino.txt"
apache-airflow-providers-apache-spark==3.0.0
apache-airflow-providers-amazon==6.0.0
apache-airflow-providers-ssh==3.2.0
types-paramiko==2.11.6
sshtunnel==0.4.0
requests==2.28.1
apache-airflow-providers-apache-livy==3.1.0
apache-airflow-providers-http==4.0.0
Following are the logs of the task which fails randomly
Reading remote log from Cloudwatch log_group: arn:aws:logs:xxxxx:log-group:airflow-abc-MwaaEnvironment-Task log_stream: dag_id=xxx/run_id=manual__2026-05-19T10_35_27.159729+00_00/task_id=mytaskid/attempt=1.log
An error occurred (ResourceNotFoundException) when calling the GetLogEvents operation: The specified log stream does not exist.
Ideally this error log should be printed for other tasks as well but I dont think its failing due to missing log stream in the cloud-watch. It even didnt print that job was submitted to EMR successfully as other tasks are doing.
Do we know if its a known issue?
What you think should happen instead?
If job was submitted to emr successfully, task should reflect it and should proceed fine without any failure.
Operating System
No response
Deployment
Amazon (AWS) MWAA
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.12.0
Official Helm Chart version
Not Applicable
Kubernetes Version
No response
Helm Chart configuration
No response
Docker Image customizations
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct