描述

TLDR

There are a lot of “easy” performance wins via lazy loading across many provider packages, many of which are fairly popular.
Lazy loading makes sense as an optimization because, throughout DAG code’s lifecycle, the actual 3rd party packages get executed a mere fraction of the time that the DAG code needs to get parsed.
The biggest wins will be in the following provider packages: google, alibaba, teradata, papermill, neo4j, databricks, samba, elasticsearch.

Motivation

Let’s say the user uses 3 different providers in that single DAG. But, only one operator can be executed at a time. If each of the 3 provider imports is bringing along a bunch of superfluous stuff into sys.modules that is never actually used, this is an unnecessary performance degradation across DAG parsing and execution of other tasks.

On the margin, lazy loading improves performance by reducing the memory footprint and speeding up start-ups during the scheduler’s DAG file processing and during task execution.

These changes are also fairly easy and low risk to implement. Linting tends to detect errors of objects not existing if they are removed from globals but then not properly lazy loaded. The most annoying part about implementing this is fixing unit test mocks.

The broad objectives are:

to make OOMs less likely in memory-constrained environments.
to improve start-up time for task execution. This is especially relevant for DAGs with many tasks where only a small fraction utilize the 3rd party import causing the longer load. In these contexts, +100ms across 100 tasks adds up to a 10 second speedup for the entire DAG to execute.
to ensure Airflow is reasonably performant I think is a virtue in its own right. Lazy-loads do not result in night-and-day improvements, but they're not nothing either.

PRs done so far

A handful of the last couple of PRs I’ve contributed to Airflow were aimed at lazy-loading 3rd party packages in provider packages:

#67479
#62365

For selfish reasons, I’ve specifically targeted provider packages that I personally use (Snowflake and Slack) and with which I have experienced occasional OOM issues on small workers.

Research and Methodology

I decided to modify (read: have Claude modify) the script I was using to benchmark the performance gains in lazy-loading individual packages (Slack and Snowflake) to run on all provider packages, and find areas for performance improvement across Airflow.

The simple version is— I am looking at the “delta” (in clock time and memory) between loading things which are necessary for task execution plus BaseHook and BaseOperator, and then importing all the modules in the provider package. I then sort the packages by their deltas to identify the worst offenders, and I also look at the individual packages contributing to the delta.

Measuring deltas for packages in isolation is conceptually imperfect since A can take a long time via B, but if B is necessary to import globally then A isn’t contributing much overhead. Still, this is good enough for getting a sense of where problems may lie.

A full markdown report is here: https://gist.github.com/dwreeves/d3c35354c2305b9a81d0d67a0280830a#file-airflow_optimization_report-md The script is at the bottom of the gist, and you can run it to generate the report or to just investigate individual packages:

After running on all provider packages, I found these to be the biggest offenders with the biggest deltas:

Provider	Time Delta	Memory Delta	Modules Loaded	Top 3rd-Party Packages
`google`	550ms (+88%)	+168.1MB (+181%)	2191 (10 tested)	`google`, `pandas`, `requests`, `IPython` (+57)
`alibaba`	687ms (+110%)	+146.1MB (+158%)	1844 (9 tested)	`pandas`, `odps`, `requests`, `IPython` (+56)
`teradata`	384ms (+61%)	+125.8MB (+136%)	1589 (10 tested)	`pandas`, `azure`, `requests`, `numpy` (+40)
`papermill`	327ms (+52%)	+103.9MB (+112%)	1717 (2 tested)	`google`, `github`, `azure`, `requests` (+51)
`neo4j`	181ms (+29%)	+95.4MB (+103%)	536 (3 tested)	`pandas`, `neo4j`, `numpy`, `pyarrow` (+2)
`databricks`	226ms (+36%)	+72.7MB (+78%)	888 (10 tested)	`requests`, `numpy`, `chardet`, `oauthlib` (+28)
`samba`	204ms (+33%)	+63.5MB (+68%)	975 (2 tested)	`google`, `requests`, `chardet`, `aiohttp` (+36)
`elasticsearch`	138ms (+22%)	+62.4MB (+67%)	685 (1 tested)	`requests`, `elasticsearch`, `numpy`, `chardet` (+18)
`trino`	154ms (+25%)	+58.5MB (+63%)	936 (2 tested)	`google`, `requests`, `chardet`, `aiohttp` (+37)
`presto`	147ms (+24%)	+57.8MB (+62%)	929 (2 tested)	`google`, `requests`, `chardet`, `aiohttp` (+36)
`airbyte`	341ms (+54%)	+55.0MB (+59%)	1201 (3 tested)	`airbyte_api`, `requests`, `chardet`, `urllib3` (+7)
`akeyless`	78ms (+13%)	+51.2MB (+55%)	1247 (1 tested)	`akeyless`, `urllib3`, `six`
`influxdb`	78ms (+12%)	+47.9MB (+52%)	641 (4 tested)	`influxdb_client`, `numpy`, `reactivex`, `influxdb_client_3` (+3)
`weaviate`	180ms (+29%)	+46.4MB (+50%)	902 (2 tested)	`weaviate`, `requests`, `authlib`, `chardet` (+13)
`amazon`	124ms (+20%)	+44.6MB (+48%)	705 (10 tested)	`requests`, `chardet`, `botocore`, `aiohttp` (+24)
`qdrant`	206ms (+33%)	+44.3MB (+48%)	275 (2 tested)	`qdrant_client`, `numpy`, `google`, `urllib3` (+3)
`mysql`	87ms (+14%)	+38.9MB (+42%)	648 (5 tested)	`requests`, `chardet`, `botocore`, `cryptography` (+20)
`openlineage`	145ms (+23%)	+36.6MB (+40%)	619 (1 tested)	`requests`, `chardet`, `openlineage`, `urllib3` (+7)
`snowflake`	126ms (+20%)	+34.3MB (+37%)	542 (5 tested)	`requests`, `chardet`, `aiohttp`, `urllib3` (+17)
`yandex`	59ms (+10%)	+34.2MB (+37%)	550 (5 tested)	`requests`, `chardet`, `yandex`, `google` (+13)

Learnings

The biggest wins will be in the following provider packages: google, alibaba, teradata, papermill, neo4j, databricks, samba, elasticsearch.

Pandas tends to be a common major contributor to load delta, as well as the namesake packages. On my M4 Macbook, Pandas adds 150ms of load time and 65.3MB of additional memory.

How to handle this?

This is the tricky part. It is protocol for contributors to test their own changes for provider packages during the alpha release. However, honestly most of these packages identified in this analysis are ones I don’t actually use. So although I can write the code to modify the provider packages, I would not be able to test the changes.

I do believe there are a lot of changes that are very straightforward that I could take on based on a reasonable trade-off between complexity, performance gain, and provider package popularity.

It's unclear if this should be one big PR or individual PRs per provider package.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

贡献者指南

技术栈: python
领域: performancebackend
议题类型: performance
难度: 3
预计时间: 3-5 days
活动状态: fresh
清晰度: mostly clear
前置要求: Understanding of Python importsFamiliarity with Airflow provider packagesKnowledge of lazy loading patterns
新手友好度: 60
研究方向: The issue identifies several provider packages (google, alibaba, teradata, etc.) where lazy loading third party packages would yield performance gains. Look at previous PRs (#67479, #62365) for implementation patterns. The report gist provides a script to measure deltas. For each provider, identify heavy imports like pandas or the namesake package and replace them with lazy imports. Ensure unit tests are updated to mock lazy loaded modules. Follow the project's contribution guidelines.