apache/airflow

EmrServerlessStartJobOperator task fails randomly for few tasks in 20-21s even though job is submitted and succeeds fine in emr serverless in background

Closed

#67178 opened on May 19, 2026

View on GitHub
 (8 comments) (0 reactions) (0 assignees)Python (44,809 stars) (16,781 forks)batch import
area:providersgood first issuekind:bugprovider:amazon

Description

Under which category would you file this issue?

Providers

Apache Airflow version

3.0.6

What happened and how to reproduce it?

We upgraded aws mwaa airflow from 2.7.2 to 3.0.6 and we noticed 1 random issue. While submitting jobs to emr serverless from our dags i.e. via EmrServerlessStartJobOperator, we see jobs are submitted fine to emr serverless and are finished in emr but task status is marked as failure in airflow dag's task. Out of 100 tasks, 98-99 proceed fine but we see random failures for 1 or 2 tasks. We saw a pattern, it fails in 20-21seconds. Its completely random, not for particular task.

Something is wrong with new version of airflow or might be some configuration is missing from our end

Requirements.txt for airflow of both versions Airflow 3.0.6

--constraint "/usr/local/airflow/dags/constraints-3.11_spark_trino.txt"

apache-airflow-providers-apache-spark==5.3.2
apache-airflow-providers-amazon==9.12.0
apache-airflow-providers-ssh==4.1.3
types-paramiko==3.5.0.20250801
sshtunnel==0.4.0
requests==2.32.5
orjson==3.11.2
cachetools==5.5.2
Authlib==1.6.2
apache-airflow-providers-apache-livy==4.4.2
apache-airflow-providers-http==5.3.3
confluent-kafka==2.11.1
apache-airflow-providers-apache-kafka==1.10.2
fastavro==1.12.0

Airflow 2.7.2

--constraint "/usr/local/airflow/dags/constraints-3.7_spark_trino.txt"

apache-airflow-providers-apache-spark==3.0.0
apache-airflow-providers-amazon==6.0.0
apache-airflow-providers-ssh==3.2.0
types-paramiko==2.11.6
sshtunnel==0.4.0
requests==2.28.1
apache-airflow-providers-apache-livy==3.1.0
apache-airflow-providers-http==4.0.0

Following are the logs of the task which fails randomly

Reading remote log from Cloudwatch log_group: arn:aws:logs:xxxxx:log-group:airflow-abc-MwaaEnvironment-Task log_stream: dag_id=xxx/run_id=manual__2026-05-19T10_35_27.159729+00_00/task_id=mytaskid/attempt=1.log
An error occurred (ResourceNotFoundException) when calling the GetLogEvents operation: The specified log stream does not exist.

Ideally this error log should be printed for other tasks as well but I dont think its failing due to missing log stream in the cloud-watch. It even didnt print that job was submitted to EMR successfully as other tasks are doing.

Do we know if its a known issue?

What you think should happen instead?

If job was submitted to emr successfully, task should reflect it and should proceed fine without any failure.

Operating System

No response

Deployment

Amazon (AWS) MWAA

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.12.0

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Contributor guide