Provider package performance improvements via lazy loading 3rd party packages
#67,515 创建于 2026年5月26日
描述
TLDR
- There are a lot of “easy” performance wins via lazy loading across many provider packages, many of which are fairly popular.
- Lazy loading makes sense as an optimization because, throughout DAG code’s lifecycle, the actual 3rd party packages get executed a mere fraction of the time that the DAG code needs to get parsed.
- The biggest wins will be in the following provider packages:
google,alibaba,teradata,papermill,neo4j,databricks,samba,elasticsearch.
Motivation
Let’s say the user uses 3 different providers in that single DAG. But, only one operator can be executed at a time. If each of the 3 provider imports is bringing along a bunch of superfluous stuff into sys.modules that is never actually used, this is an unnecessary performance degradation across DAG parsing and execution of other tasks.
On the margin, lazy loading improves performance by reducing the memory footprint and speeding up start-ups during the scheduler’s DAG file processing and during task execution.
These changes are also fairly easy and low risk to implement. Linting tends to detect errors of objects not existing if they are removed from globals but then not properly lazy loaded. The most annoying part about implementing this is fixing unit test mocks.
The broad objectives are:
- to make OOMs less likely in memory-constrained environments.
- to improve start-up time for task execution. This is especially relevant for DAGs with many tasks where only a small fraction utilize the 3rd party import causing the longer load. In these contexts, +100ms across 100 tasks adds up to a 10 second speedup for the entire DAG to execute.
- to ensure Airflow is reasonably performant I think is a virtue in its own right. Lazy-loads do not result in night-and-day improvements, but they're not nothing either.
PRs done so far
A handful of the last couple of PRs I’ve contributed to Airflow were aimed at lazy-loading 3rd party packages in provider packages:
- #67479
- #62365
For selfish reasons, I’ve specifically targeted provider packages that I personally use (Snowflake and Slack) and with which I have experienced occasional OOM issues on small workers.
Research and Methodology
I decided to modify (read: have Claude modify) the script I was using to benchmark the performance gains in lazy-loading individual packages (Slack and Snowflake) to run on all provider packages, and find areas for performance improvement across Airflow.
The simple version is— I am looking at the “delta” (in clock time and memory) between loading things which are necessary for task execution plus BaseHook and BaseOperator, and then importing all the modules in the provider package. I then sort the packages by their deltas to identify the worst offenders, and I also look at the individual packages contributing to the delta.
Measuring deltas for packages in isolation is conceptually imperfect since A can take a long time via B, but if B is necessary to import globally then A isn’t contributing much overhead. Still, this is good enough for getting a sense of where problems may lie.
A full markdown report is here: https://gist.github.com/dwreeves/d3c35354c2305b9a81d0d67a0280830a#file-airflow_optimization_report-md The script is at the bottom of the gist, and you can run it to generate the report or to just investigate individual packages:
After running on all provider packages, I found these to be the biggest offenders with the biggest deltas:
| Provider | Time Delta | Memory Delta | Modules Loaded | Top 3rd-Party Packages |
|---|---|---|---|---|
google |
550ms (+88%) | +168.1MB (+181%) | 2191 (10 tested) | google, pandas, requests, IPython (+57) |
alibaba |
687ms (+110%) | +146.1MB (+158%) | 1844 (9 tested) | pandas, odps, requests, IPython (+56) |
teradata |
384ms (+61%) | +125.8MB (+136%) | 1589 (10 tested) | pandas, azure, requests, numpy (+40) |
papermill |
327ms (+52%) | +103.9MB (+112%) | 1717 (2 tested) | google, github, azure, requests (+51) |
neo4j |
181ms (+29%) | +95.4MB (+103%) | 536 (3 tested) | pandas, neo4j, numpy, pyarrow (+2) |
databricks |
226ms (+36%) | +72.7MB (+78%) | 888 (10 tested) | requests, numpy, chardet, oauthlib (+28) |
samba |
204ms (+33%) | +63.5MB (+68%) | 975 (2 tested) | google, requests, chardet, aiohttp (+36) |
elasticsearch |
138ms (+22%) | +62.4MB (+67%) | 685 (1 tested) | requests, elasticsearch, numpy, chardet (+18) |
trino |
154ms (+25%) | +58.5MB (+63%) | 936 (2 tested) | google, requests, chardet, aiohttp (+37) |
presto |
147ms (+24%) | +57.8MB (+62%) | 929 (2 tested) | google, requests, chardet, aiohttp (+36) |
airbyte |
341ms (+54%) | +55.0MB (+59%) | 1201 (3 tested) | airbyte_api, requests, chardet, urllib3 (+7) |
akeyless |
78ms (+13%) | +51.2MB (+55%) | 1247 (1 tested) | akeyless, urllib3, six |
influxdb |
78ms (+12%) | +47.9MB (+52%) | 641 (4 tested) | influxdb_client, numpy, reactivex, influxdb_client_3 (+3) |
weaviate |
180ms (+29%) | +46.4MB (+50%) | 902 (2 tested) | weaviate, requests, authlib, chardet (+13) |
amazon |
124ms (+20%) | +44.6MB (+48%) | 705 (10 tested) | requests, chardet, botocore, aiohttp (+24) |
qdrant |
206ms (+33%) | +44.3MB (+48%) | 275 (2 tested) | qdrant_client, numpy, google, urllib3 (+3) |
mysql |
87ms (+14%) | +38.9MB (+42%) | 648 (5 tested) | requests, chardet, botocore, cryptography (+20) |
openlineage |
145ms (+23%) | +36.6MB (+40%) | 619 (1 tested) | requests, chardet, openlineage, urllib3 (+7) |
snowflake |
126ms (+20%) | +34.3MB (+37%) | 542 (5 tested) | requests, chardet, aiohttp, urllib3 (+17) |
yandex |
59ms (+10%) | +34.2MB (+37%) | 550 (5 tested) | requests, chardet, yandex, google (+13) |
Learnings
The biggest wins will be in the following provider packages: google, alibaba, teradata, papermill, neo4j, databricks, samba, elasticsearch.
Pandas tends to be a common major contributor to load delta, as well as the namesake packages. On my M4 Macbook, Pandas adds 150ms of load time and 65.3MB of additional memory.
How to handle this?
This is the tricky part. It is protocol for contributors to test their own changes for provider packages during the alpha release. However, honestly most of these packages identified in this analysis are ones I don’t actually use. So although I can write the code to modify the provider packages, I would not be able to test the changes.
I do believe there are a lot of changes that are very straightforward that I could take on based on a reasonable trade-off between complexity, performance gain, and provider package popularity.
It's unclear if this should be one big PR or individual PRs per provider package.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct