apache/airflow

Ability to annotate DAG's source (repo, tag, commit, etc)

Open

#51,321 opened on Jun 2, 2025

View on GitHub
 (3 comments) (0 reactions) (0 assignees)Python (44,809 stars) (16,781 forks)batch import
area:task-sdkgood first issuekind:feature

Description

Description

OpenLineage integration can provide information which DAGs and Task were executed, that datasets were read or written, and so on. Also there are a lot of facets containing all information about DAG and Tasks, e.g. scheduled time, owner, tags and so on.

One of facets OpenLineage have is SourceCodeLocation which allows users to see the repo DAG source code was originated from, as well as branch name, commit, tag and so on.

But there is no way to add annotations to DAG which allow to collect this information and send it as some facet. Probably something like this could be implemented:

from airflow.sdk import DAG, SourceCodeLocation

dag = DAG(
  dag_id="my_dag",
  source_code_location=SourceCodeLocation(
    repo_url="http://github.com/user/repo",
    branch="main",
    path="/some/dag.py",
    version="6f1cab2f"
  ),
)

Use case/motivation

My use case is:

  • There is a repo with ETL scripts and DAG .py files. Multiple users work with code within this repo, and create branches with scripts/DAGs updates.
  • Each branch (e.g. bugfix/123 or feature/123 branch) is deployed to Airflow. To prevent overriding working files of different users, /opt/airflow/dags folder have a structure {repo_name}/{branch_name}/my_dag.py.
  • For Airflow on host, branch can be cloned using git clone or rsync.
  • For Airflow in Kubernetes this is done via custom sidecar which clones every branch within a repo to create this directory structure.
  • After branch is merged, directory containing DAGs for this branch is removed from Airflow.

I'm not using gitsync sidecar because it clones only one specific branch (e.g. main), and cannot be configured without redeploying Airflow workers & dag processor. Same for GitDagBundle in Airflow 3.x - bundles are set up using airflow.cfg, not via API, and changing this config require restarting DagProcessor.

Also OpenLineage integration for Airflow is set up for the entire Airflow instance, not for each DAG specifically. But there can be DAGs from multiple repos, branches or commits within the same branch, and this cannot be passed to OpenLineage via config https://github.com/OpenLineage/OpenLineage/issues/3745.

It is possible to parse all this information from .git folder content https://github.com/OpenLineage/OpenLineage/issues/3746, but the precense of this folder actually depends on the way Airflow DAG are reployed. Also, having .git folder may be a security issue in some usecases (e.g. public SAAS instance with private Git repos).

That's why here I mention here only some kind of DAG object annotation/option.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Contributor guide