Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Best Practices
For Effectively Running
dbt in Airflow
Tatiana Al-Chueyr
Staff Software Engineer
George Yates
Field Engineer
Tell us about you!
dbt in Airflow
32.5% of the 2023 Apache Airflow Survey respondents use dbt
Popular OSS tools to run dbt in Airflow
Adoption of OSS tools to run dbt in Airflow
PyPI downloads for OSS popular tools used to run dbt in Airflow
dbt & Airflow community approaches
53.4% of the dbt in Airflow survey respondents don’t use any OSS tools
Objective
● What problem are you trying to solve?
● What should the dbt pipeline look like in Airflow?
Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
Rendering
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● Where are the tests in the pipeline?
Execution
● Where do you execute the dbt commands?
Questions when running dbt in Airflow
Objective
Objective
Run a dbt command
in Airflow
Simply run a dbt
command, similar to
how we run the
dbt-core CLI locally
in the terminal.
Convert dbt models
into Airflow tasks
Convert the dbt
pipeline into an
Airflow DAG and
have Airflow run the
dbt models as tasks.
Submit dbt pipelines
to run in dbt Cloud
Use Airflow to trigger
a job to be run in dbt
Cloud. Airflow should
treat the dbt job as a
black box.
● What problem are we trying to solve?
● What should the dbt pipeline look like in Airflow?
Objective run a dbt command in Airflow
with DAG(
"dbt_basic_dag",
start_date=datetime(2020, 12, 23),
description="A sample Airflow DAG to invoke dbt runs using a
BashOperator",
schedule_interval=None,
catchup=False,
) as dag:
dbt_seed = BashOperator(
task_id="dbt_seed",
bash_command=f"dbt seed --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_run = BashOperator(
task_id="dbt_run",
bash_command=f"dbt run --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_test = BashOperator(
task_id="dbt_test",
bash_command=f"dbt test --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_seed >> dbt_run >> dbt_test
Recommendation: BashOperator or Kubernetes Operator
Objective run the dbt pipeline in dbt Cloud
Recommendation: apache-airflow-providers-dbt-cloud
@dag(
start_date=datetime(2022, 2, 10),
schedule_interval="@daily",
catchup=False,
default_view="graph",
doc_md=__doc__,
)
def check_before_running_dbt_cloud_job():
begin, end = [EmptyOperator(task_id=id) for id in ["begin",
"end"]]
check_job = ShortCircuitOperator(
task_id="check_job_is_not_running",
python_callable=_check_job_not_running,
op_kwargs={"job_id": JOB_ID},
)
trigger_job = DbtCloudRunJobOperator(
task_id="trigger_dbt_cloud_job",
dbt_cloud_conn_id=DBT_CLOUD_CONN_ID,
job_id=JOB_ID,
check_interval=600,
timeout=3600,
)
begin >> check_job >> trigger_job >> end
Objective convert dbt models into Airflow
Objective convert dbt models into Airflow
Recommendation: astronomer-cosmos
import os
from datetime import datetime
from pathlib import Path
from cosmos import DbtDag, ProjectConfig, ProfileConfig
from cosmos.profiles import PostgresUserPasswordProfileMapping
DEFAULT_DBT_ROOT_PATH = Path(__file__).parent / "dbt"
DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH))
profile_config = ProfileConfig(
profile_name="jaffle_shop",
target_name="dev",
profile_mapping=PostgresUserPasswordProfileMapping(
conn_id="airflow_db",
profile_args={"schema": "public"},
),
)
basic_cosmos_dag = DbtDag(
project_config=ProjectConfig(
DBT_ROOT_PATH / "jaffle_shop",
),
profile_config=profile_config,
schedule_interval="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
dag_id="basic_cosmos_dag",
)
Approaches trade-offs
Cosmos (one task / model)
● Failing dbt node is easy to identify
● Efficient retries
● Slower DAG parsing
● Worker slots proportional to dbt nodes
● Synchronous
● No vendor lock-in
● Independent downstream use cases
can succeed
● One database connection per task or
dbt model
dbt Cloud provider (one task / job)
● Hard to identify failing dbt node
● Inefficient retry (re-run all dbt nodes)
● Fast DAG parsing
● Few worker slots
● Asynchronous or Synchronous
● Vendor lock-in
● Downstream use cases dependent on
every dbt node succeeding
● Single database connection to run all
job transformations
BashOperator (one task / cmd)
● Hard to identify failing dbt node
● Inefficient retry (re-run all dbt nodes)
● Fast DAG parsing
● Few worker slots
● Synchronous
● No vendor lock-in
● Downstream use cases dependent on
every dbt node succeeding
● One database connection per dbt
command
dbt Cloud or dbt-core & Airflow
dbt Cloud
Proprietary hosted platform for
running dbt jobs
● Schedule jobs
● Built-in CI/CD
● Host documentation
● Monitor and alert
● Built-in Cloud IDE
● Run commands locally
dbt-core & Airflow
Open-Source platform that can run dbt
jobs (and others)
● Schedule jobs (Cosmos & Airflow)
● Pre-existing CI/CD
● Host documentation (Cosmos)
● Monitor and alert (Airflow)
● Pre-existing IDE
● Run commands locally (dbt-core)
● Unified tool for scheduling
Analytics, DS and DE workflows
dbt docs in Airflow with Cosmos
Generate and host your dbt docs with Cosmos
● DbtDocsOperator
● DbtDocsAzureStorageOperator
● DbtDocsS3Operator
● DbtDocsGCSOperator
from cosmos.operators import DbtDocsGCSOperator
generate_dbt_docs_aws = DbtDocsGCSOperator(
task_id="generate_dbt_docs_gcs",
project_dir="path/to/jaffle_shop",
profile_config=profile_config,
# docs-specific arguments
connection_id="test_gcs",
bucket_name="test_bucket",
)
But… what is Cosmos?
Open-Source library that helps running dbt-core in Apache Airflow
Try it out:
$ pip install astronomer-cosmos
Check the repo:
http://github.com/astronomer/astronomer-cosmos
Talk to the project developers and users:
#airflow-dbt channel in the Apache Airflow Slack
Demo
Objective
● What problem are you trying to solve?
● What should the dbt pipeline look like in Airflow?
Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
Rendering
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● Where are the tests in the pipeline?
Execution
● Where do you execute the dbt commands?
Questions when running dbt in Airflow
Setup
Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
1. Parsing
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
2. Execution
● Where do you execute the dbt commands?
ERROR: Cannot install apache-airflow, apache-airflow==2.7.0, and dbt-core==1.4.0 because these package versions have conflicting
dependencies.
The conflict is caused by:
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.12.0 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.2 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.1 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.0 depends on PyYAML<6 and >=5.1
apache-airflow 2.7.0 depends on jsonschema>=4.18.0
flask-appbuilder 4.3.3 depends on jsonschema<5 and >=3
connexion 2.10.0 depends on jsonschema<4 and >=2.5.1
All set, no configuration needed!
By default, Cosmos will use:
⏺ ExecutionMode.LOCAL
⏺ InvocationMode.DBT_RUNNER
(can run dbt commands 40% faster)
Two additional steps:
1. If using Astro, create the virtualenv
as part of your Docker image build.
2. Tell Cosmos where the dbt binary
is. You will still be using the default
ExecutionMode.LOCAL
Can you install dbt
and Airflow in the
same Python
environment?
Can you create
and manage a
dedicated Python
environment
alongside Airflow?
No No
Yes
Yes
FROM quay.io/astronomer/astro-runtime:11.3.0
RUN python -m venv dbt_venv && 
source dbt_venv/bin/activate && 
pip install --no-cache-dir<your-dbt-adapter> && 
deactivate
DbtDag(
...,
execution_config=ExecutionConfig(
dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt")
operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]}
))
Setup dbt core installation
All set, no configuration needed!
By default, Cosmos will use:
⏺ ExecutionMode.LOCAL
⏺ InvocationMode.DBT_RUNNER
(can run dbt commands 40% faster)
Two additional steps:
1. If using Astro, create the virtualenv
as part of your Docker image build.
2. Tell Cosmos where the dbt binary
is. You will still be using the default
ExecutionMode.LOCAL
Can you install dbt
and Airflow in the
same Python
environment?
Can you create
and manage a
dedicated Python
environment
alongside Airflow?
No No
Yes
Yes
FROM quay.io/astronomer/astro-runtime:11.3.0
RUN python -m venv dbt_venv && 
source dbt_venv/bin/activate && 
pip install --no-cache-dir<your-dbt-adapter> && 
deactivate
DbtDag(
...,
execution_config=ExecutionConfig(
dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt")
operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]}
))
Setup dbt core installation
Good news, Cosmos can create and manage the
dbt Python virtualenv for you!
You can use the ExecutionMode.VIRTUALENV
You don’t have to have dbt in the
Airflow nodes to benefit from Cosmos!
You can leverage the:
● LoadMode.DBT_MANIFEST
● ExecutionMode.KUBERNETES
More information on these in the next
slides!
Do you still want
to have dbt
installed in
Airflow nodes?
No
No
Yes
DbtDag(
...,
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.VIRTUALENV,
)
)
Setup dbt core installation
Setup dbt deps
That’s the most efficient approach.
Tell Cosmos to not worry about the
dbt packages.ymlfile
Are you able to
pre-install dbt
deps in your
Airflow
deployment?
No
Yes
DbtDag(
...,
render_config=RenderConfig(dbt_deps=False),
execution_config=ExecutionConfig(
operator_args={"install_deps": False}
))
No problem, by default Cosmos
will run dbt deps before running
any dbt command.
No configuration needed.
Setup dbt deps
That’s the most efficient approach.
Tell Cosmos to not worry about the
dbt packages.ymlfile
No problem, by default Cosmos
will run dbt deps before running
any dbt command.
No configuration needed.
Are you able to
pre-install dbt
deps in your
Airflow
deployment?
No
Yes
DbtDag(
...,
render_config=RenderConfig(dbt_deps=False),
execution_config=ExecutionConfig(
operator_args={"install_deps": False}
))
Setup database connections
Cosmos has an extensible set of
ProfileMappingclasses, that can
automatically create the dbt
profiles.yml from Airflow
Connections.
No problem, Cosmos also allows
you to define your own
profiles.yml.
Do you manage
your database
credentials in
Airflow?
No
Yes profile_config = ProfileConfig(
profile_name="my_profile_name",
target_name="my_target_name",
profile_mapping=SnowflakeUserPasswordProfileMapping(
conn_id="my_snowflake_conn_id",
profile_args={
"database": "my_snowflake_database",
"schema": "my_snowflake_schema",
},
),
)
dag = DbtDag(
profile_config=profile_config,
)
profile_config = ProfileConfig(
profile_name="my_snowflake_profile",
target_name="dev",
profiles_yml_filepath="/path/to/profiles.yml",
)
Setup database connections
Cosmos has an extensible set of
ProfileMappingclasses, that can
automatically create the dbt
profiles.yml from Airflow
Connections.
No problem, Cosmos also allows
you to define your own
profiles.yml.
Do you manage
your database
credentials in
Airflow?
No
Yes profile_config = ProfileConfig(
profile_name="my_profile_name",
target_name="my_target_name",
profile_mapping=SnowflakeUserPasswordProfileMapping(
conn_id="my_snowflake_conn_id",
profile_args={
"database": "my_snowflake_database",
"schema": "my_snowflake_schema",
},
),
)
dag = DbtDag(
profile_config=profile_config,
)
profile_config = ProfileConfig(
profile_name="my_snowflake_profile",
target_name="dev",
profiles_yml_filepath="/path/to/profiles.yml",
)
Rendering
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● How are tests represented?
1. Execution
● Where do you execute the dbt commands?
Rendering
Rendering how to parse the dbt project
Great, this will help the Airflow DAG
processor to save time.
Cosmos will use the dbt Is command to
identify the pipeline topology.
The output is cached and refreshed
automatically. It can also be purged
manually.
Cosmos will use its built-in dbt
project parser (this implementation
is a best effort, and may not
support all dbt project features).
Are you able
to
pre-compile
your dbt
project?
No No
Yes
Is dbt
available in
the Airflow
scheduler?
DbtDag(
project_config=ProjectConfig(
manifest_path="/path/to/manifest.json"),
render_config=RenderConfig(
load_method=LoadMode.DBT_MANIFEST))
Yes
Rendering how to parse the dbt project
Great, this will help the Airflow DAG
processor to save time.
Cosmos will use the dbt Is command to
identify the pipeline topology.
The output is cached and refreshed
automatically. It can also be purged
manually.
Cosmos will use its built-in dbt
project parser (this implementation
is a best effort, and may not
support all dbt project features).
Are you able
to
pre-compile
your dbt
project?
No No
Yes
Is dbt
available in
the Airflow
scheduler?
DbtDag(
project_config=ProjectConfig(
manifest_path="/path/to/manifest.json"),
render_config=RenderConfig(
load_method=LoadMode.DBT_MANIFEST))
Yes
Rendering selecting a subset of nodes
Great, you can use any selector
flag available in the version of dbt
you’re using: select, exclude,
selector.
Cosmos will use custom implementation
of dbt selectors to exclude and select
nodes. dbt YAML selector is not currently
supported.
The following features are supported:
selecting based on tags, paths,
config.materialized, graph operators,
tags intersections
Did you use
dbt ls to
parse the
project
No
Yes
DbtDag(
render_config=RenderConfig(
exclude=["node_name+"], # node and its children
)
)
DbtDag(
render_config=RenderConfig( # intersection
select=["tag:include_tag1,tag:include_tag2"]
)
)
DbtDag(
render_config=RenderConfig(
load_method=LoadMode.DBT_LS,
selector="my_selector"
)
)
Rendering selecting a subset of nodes
Great, you can use any selector
flag available in the version of dbt
you’re using: select, exclude,
selector.
Cosmos will use custom implementation
of dbt selectors to exclude and select
nodes. dbt YAML selector is not currently
supported.
The following features are supported:
selecting based on tags, paths,
config.materialized, graph operators,
tags intersections
Did you use
dbt ls to
parse the
project
No
Yes
DbtDag(
render_config=RenderConfig(
exclude=["node_name+"], # node and its children
)
)
DbtDag(
render_config=RenderConfig( # intersection
select=["tag:include_tag1,tag:include_tag2"]
)
)
DbtDag(
render_config=RenderConfig(
load_method=LoadMode.DBT_LS,
selector="my_selector"
)
)
Rendering representing test nodes
That’s fine, tell Cosmos to not
render them with
TestBehavior.NONE
That’s easy, use
TestBehavior.AFTER_ALL
This is Cosmos’ default behavior!
It allows you to run the tests that
relate to a specific model,
snapshot or seed together with
the specific dbt node they relate
to.
Do you want
to hide dbt
tests
No No
Yes
Would you
like to run all
the tests by
the end of
the DAG?
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.NONE,
)
)
Yes
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.AFTER_ALL,
)
)
Rendering representing test nodes
That’s fine, tell Cosmos to not
render them with
TestBehavior.NONE
That’s easy, use
TestBehavior.AFTER_ALL
This is Cosmos’ default behavior!
It allows you to run the tests that
relate to a specific model,
snapshot or seed together with
the specific dbt node they relate
to.
Do you want
to hide dbt
tests
No No
Yes
Would you
like to run all
the tests by
the end of
the DAG?
Yes
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.NONE,
)
)
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.AFTER_ALL,
)
)
Rendering representing test nodes
Rendering representing test nodes
Rendering dbt within with DbtTaskGroup
Cosmos offers DbtTaskGroup so you can mix dbt and non-dbt tasks in the
same DAG
@dag(
schedule_interval="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
)
def basic_cosmos_task_group() -> None:
pre_dbt = EmptyOperator(task_id="pre_dbt")
customers = DbtTaskGroup(
group_id="customers",
project_config=ProjectConfig((DBT_ROOT_PATH / "jaffle_shop").as_posix(), dbt_vars={"var": "2"}),
profile_config=profile_config,
default_args={"retries": 2},
)
pre_dbt >> customers
basic_cosmos_task_group()
Rendering nodes your way
Cosmos allows users to define how to convert dbt nodes into Airflow
def convert_source(dag: DAG, task_group: TaskGroup, node: DbtNode, **kwargs):
"""
Return an instance of a desired operator to represent a dbt "source" node.
"""
return EmptyOperator(dag=dag, task_group=task_group, task_id=f"{node.name}_source")
render_config = RenderConfig(
node_converters={
DbtResourceType("source"): convert_source, # known dbt node type to Cosmos (part of DbtResourceType)
}
)
project_config = ProjectConfig(
DBT_ROOT_PATH / "simple",
env_vars={"DBT_SQLITE_PATH": DBT_SQLITE_PATH},
dbt_vars={"animation_alias": "top_5_animated_movies"},
)
example_cosmos_sources = DbtDag(
project_config=project_config,
profile_config=profile_config,
render_config=render_config,
)
Execution
Execution how to run the dbt commands
Cosmos will use the
ExecutionMode.LOCALby
default.
Users can also use a pre-created
dbt Python virtualenv or ask
Cosmos to create/manage one.
Review “Setup dbt-core
installation” for more.
Uses partial parsing - which can
speed up task run by 35%
Create a Docker container image with your
dbt project and use Cosmos
ExecutionMode.DOCKER
:
Is dbt
available in
the Airflow
worker
nodes?
No No
Yes
Would you like to
run dbt from
within a Docker
container in the
Airflow worker
node?
Yes
DbtDag(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.DOCKER,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"network_mode": "bridge",
},
)
Are you running
within a Cloud
provider (AWS
or Azure)?
Yes
Create a Docker container image and
delegate the execution of dbt commands
to AWS EKS or Azure Container Instance,
example:
DbtDag(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.AWS_EKS,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"cluster_name": CLUSTER_NAME,
"get_logs": True,
"is_delete_operator_pod": False,
)
No
Execution how to run the dbt commands
Cosmos will use the
ExecutionMode.LOCALby
default.
Users can also use a pre-created
dbt Python virtualenv or ask
Cosmos to create/manage one.
Review “Setup dbt-core
installation” for more.
Uses partial parsing - which can
speed up task run by 35%
Create a Docker container image with your
dbt project and use Cosmos
ExecutionMode.DOCKER
:
Is dbt
available in
the Airflow
worker
nodes?
No No
Yes
Would you like to
run dbt from
within a Docker
container in the
Airflow worker
node?
Yes
DbtDAG(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.DOCKER,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"network_mode": "bridge",
},
)
Are you running
within a Cloud
provider (AWS
or Azure)?
Yes
Create a Docker container image and
delegate the execution of dbt commands
to AWS EKS or Azure Container Instance,
example:
DbtDAG(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.AWS_EKS,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"cluster_name": CLUSTER_NAME,
"get_logs": True,
"is_delete_operator_pod": False,
)
No
Execution how to run the dbt commands
Create a docker container image
with your dbt project.
Run the dbt commands using
Airflow Kubernetes Pods with
Cosmos
ExecutionMode.KUBERNETES:
postgres_password_secret = Secret(
deploy_type="env",
deploy_target="POSTGRES_PASSWORD",
secret="postgres-secrets",
key="password",
)
k8s_cosmos_dag = DbtDag(
# ...
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.KUBERNETES,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"get_logs": True,
"is_delete_operator_pod": False,
"secrets": [postgres_password_secret],
},
)
No
Troubleshooting
1. I cannot see my DbtDag
○ Do your DAG files contain the words DAG and airflow
■ Set AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False
2. I still cannot see my DbtDags
○ Are you using LoadMode.AUTOMATIC (default) or LoadMode.DBT_LS?
■ Increase AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
○ Check the DAG processor/ scheduler logs for errors
3. The performance is suboptimal (latency or resource utilization)
● Try using the latest Cosmos release
● Leverage Cosmos caching mechanisms
● For very large dbt pipelines, use recommend using
LoadMode.DBT_MANIFEST
● Pre-install dbt deps in your Airflow environment
● If possible, use ExecutionMode.LOCAL and
InvocationMode.DBT_RUNNER
Review speed tips in
the slides with
the symbol
Troubleshooting
Cost saving
Cost Savings
● Single technology to
orchestrate & schedule dbt and
non-dbt jobs
● Costs are not proportional to
your team size
● No limit to the number of dbt
projects you can run
● No vendor lock-in
Cosmos by the
numbers
Cosmos adoption
● 900k+ downloads in a month (June-July 2024)
○ almost 10% of the dbt-core downloads per month in the same period!
● 516 stars in Github (9 July 2024)
https://pypistats.org/packages/astronomer-cosmos
Cosmos adoption in Astro
74 Astro customers use Cosmos to run their dbt projects (18.6.24)
Cosmos adoption in the community
27.4% of the community reported using Cosmos to run dbt in Airflow.
https://bit.ly/dbt-airflow-survey-2024
Cosmos community
Out of the four Cosmos committers, two are from the OSS
community and never worked at Astronomer.
Between December 2022 and July 2024, Cosmos had 93
contributors who merged 782 commits into main.
There are 677 members in the #airflow-dbt Airflow Slack with daily
interactions.
Cosmos future
state
Cosmos future possibilities
● Read and write dbt artifacts from remote
shared storage
● Improve openlineage and dataset support
● Support running dbt compiled SQL using
native Airflow operators (async support!)
● Allow users to create Cosmos DAGs from
YAML files
● Allow users to build less granular DAGs
● Support setting arguments per dbt node
● Further improve performance
● Leverage latest Airflow features
Get involved
dbt in Airflow survey
https://bit.ly/cosmos-survey-2024
Thank you!
Any questions?

More Related Content

Best Practices for Effectively Running dbt in Airflow.pdf

  • 1. Best Practices For Effectively Running dbt in Airflow
  • 2. Tatiana Al-Chueyr Staff Software Engineer George Yates Field Engineer
  • 4. dbt in Airflow 32.5% of the 2023 Apache Airflow Survey respondents use dbt
  • 5. Popular OSS tools to run dbt in Airflow
  • 6. Adoption of OSS tools to run dbt in Airflow PyPI downloads for OSS popular tools used to run dbt in Airflow
  • 7. dbt & Airflow community approaches 53.4% of the dbt in Airflow survey respondents don’t use any OSS tools
  • 8. Objective ● What problem are you trying to solve? ● What should the dbt pipeline look like in Airflow? Setup ● How do you install dbt alongside Airflow? ● How do you install dbt dependencies? Rendering ● How do you parse the dbt project? ● How do you select a subset of the original dbt project? ● Where are the tests in the pipeline? Execution ● Where do you execute the dbt commands? Questions when running dbt in Airflow
  • 10. Objective Run a dbt command in Airflow Simply run a dbt command, similar to how we run the dbt-core CLI locally in the terminal. Convert dbt models into Airflow tasks Convert the dbt pipeline into an Airflow DAG and have Airflow run the dbt models as tasks. Submit dbt pipelines to run in dbt Cloud Use Airflow to trigger a job to be run in dbt Cloud. Airflow should treat the dbt job as a black box. ● What problem are we trying to solve? ● What should the dbt pipeline look like in Airflow?
  • 11. Objective run a dbt command in Airflow with DAG( "dbt_basic_dag", start_date=datetime(2020, 12, 23), description="A sample Airflow DAG to invoke dbt runs using a BashOperator", schedule_interval=None, catchup=False, ) as dag: dbt_seed = BashOperator( task_id="dbt_seed", bash_command=f"dbt seed --profiles-dir {DBT_PROJECT_DIR} --project-dir {DBT_PROJECT_DIR}", ) dbt_run = BashOperator( task_id="dbt_run", bash_command=f"dbt run --profiles-dir {DBT_PROJECT_DIR} --project-dir {DBT_PROJECT_DIR}", ) dbt_test = BashOperator( task_id="dbt_test", bash_command=f"dbt test --profiles-dir {DBT_PROJECT_DIR} --project-dir {DBT_PROJECT_DIR}", ) dbt_seed >> dbt_run >> dbt_test Recommendation: BashOperator or Kubernetes Operator
  • 12. Objective run the dbt pipeline in dbt Cloud Recommendation: apache-airflow-providers-dbt-cloud @dag( start_date=datetime(2022, 2, 10), schedule_interval="@daily", catchup=False, default_view="graph", doc_md=__doc__, ) def check_before_running_dbt_cloud_job(): begin, end = [EmptyOperator(task_id=id) for id in ["begin", "end"]] check_job = ShortCircuitOperator( task_id="check_job_is_not_running", python_callable=_check_job_not_running, op_kwargs={"job_id": JOB_ID}, ) trigger_job = DbtCloudRunJobOperator( task_id="trigger_dbt_cloud_job", dbt_cloud_conn_id=DBT_CLOUD_CONN_ID, job_id=JOB_ID, check_interval=600, timeout=3600, ) begin >> check_job >> trigger_job >> end
  • 13. Objective convert dbt models into Airflow
  • 14. Objective convert dbt models into Airflow Recommendation: astronomer-cosmos import os from datetime import datetime from pathlib import Path from cosmos import DbtDag, ProjectConfig, ProfileConfig from cosmos.profiles import PostgresUserPasswordProfileMapping DEFAULT_DBT_ROOT_PATH = Path(__file__).parent / "dbt" DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH)) profile_config = ProfileConfig( profile_name="jaffle_shop", target_name="dev", profile_mapping=PostgresUserPasswordProfileMapping( conn_id="airflow_db", profile_args={"schema": "public"}, ), ) basic_cosmos_dag = DbtDag( project_config=ProjectConfig( DBT_ROOT_PATH / "jaffle_shop", ), profile_config=profile_config, schedule_interval="@daily", start_date=datetime(2023, 1, 1), catchup=False, dag_id="basic_cosmos_dag", )
  • 15. Approaches trade-offs Cosmos (one task / model) ● Failing dbt node is easy to identify ● Efficient retries ● Slower DAG parsing ● Worker slots proportional to dbt nodes ● Synchronous ● No vendor lock-in ● Independent downstream use cases can succeed ● One database connection per task or dbt model dbt Cloud provider (one task / job) ● Hard to identify failing dbt node ● Inefficient retry (re-run all dbt nodes) ● Fast DAG parsing ● Few worker slots ● Asynchronous or Synchronous ● Vendor lock-in ● Downstream use cases dependent on every dbt node succeeding ● Single database connection to run all job transformations BashOperator (one task / cmd) ● Hard to identify failing dbt node ● Inefficient retry (re-run all dbt nodes) ● Fast DAG parsing ● Few worker slots ● Synchronous ● No vendor lock-in ● Downstream use cases dependent on every dbt node succeeding ● One database connection per dbt command
  • 16. dbt Cloud or dbt-core & Airflow dbt Cloud Proprietary hosted platform for running dbt jobs ● Schedule jobs ● Built-in CI/CD ● Host documentation ● Monitor and alert ● Built-in Cloud IDE ● Run commands locally dbt-core & Airflow Open-Source platform that can run dbt jobs (and others) ● Schedule jobs (Cosmos & Airflow) ● Pre-existing CI/CD ● Host documentation (Cosmos) ● Monitor and alert (Airflow) ● Pre-existing IDE ● Run commands locally (dbt-core) ● Unified tool for scheduling Analytics, DS and DE workflows
  • 17. dbt docs in Airflow with Cosmos Generate and host your dbt docs with Cosmos ● DbtDocsOperator ● DbtDocsAzureStorageOperator ● DbtDocsS3Operator ● DbtDocsGCSOperator from cosmos.operators import DbtDocsGCSOperator generate_dbt_docs_aws = DbtDocsGCSOperator( task_id="generate_dbt_docs_gcs", project_dir="path/to/jaffle_shop", profile_config=profile_config, # docs-specific arguments connection_id="test_gcs", bucket_name="test_bucket", )
  • 18. But… what is Cosmos? Open-Source library that helps running dbt-core in Apache Airflow Try it out: $ pip install astronomer-cosmos Check the repo: http://github.com/astronomer/astronomer-cosmos Talk to the project developers and users: #airflow-dbt channel in the Apache Airflow Slack
  • 19. Demo
  • 20. Objective ● What problem are you trying to solve? ● What should the dbt pipeline look like in Airflow? Setup ● How do you install dbt alongside Airflow? ● How do you install dbt dependencies? Rendering ● How do you parse the dbt project? ● How do you select a subset of the original dbt project? ● Where are the tests in the pipeline? Execution ● Where do you execute the dbt commands? Questions when running dbt in Airflow
  • 21. Setup
  • 22. Setup ● How do you install dbt alongside Airflow? ● How do you install dbt dependencies? 1. Parsing ● How do you parse the dbt project? ● How do you select a subset of the original dbt project? 2. Execution ● Where do you execute the dbt commands? ERROR: Cannot install apache-airflow, apache-airflow==2.7.0, and dbt-core==1.4.0 because these package versions have conflicting dependencies. The conflict is caused by: dbt-core 1.4.0 depends on pyyaml>=6.0 connexion 2.12.0 depends on PyYAML<6 and >=5.1 dbt-core 1.4.0 depends on pyyaml>=6.0 connexion 2.11.2 depends on PyYAML<6 and >=5.1 dbt-core 1.4.0 depends on pyyaml>=6.0 connexion 2.11.1 depends on PyYAML<6 and >=5.1 dbt-core 1.4.0 depends on pyyaml>=6.0 connexion 2.11.0 depends on PyYAML<6 and >=5.1 apache-airflow 2.7.0 depends on jsonschema>=4.18.0 flask-appbuilder 4.3.3 depends on jsonschema<5 and >=3 connexion 2.10.0 depends on jsonschema<4 and >=2.5.1
  • 23. All set, no configuration needed! By default, Cosmos will use: ⏺ ExecutionMode.LOCAL ⏺ InvocationMode.DBT_RUNNER (can run dbt commands 40% faster) Two additional steps: 1. If using Astro, create the virtualenv as part of your Docker image build. 2. Tell Cosmos where the dbt binary is. You will still be using the default ExecutionMode.LOCAL Can you install dbt and Airflow in the same Python environment? Can you create and manage a dedicated Python environment alongside Airflow? No No Yes Yes FROM quay.io/astronomer/astro-runtime:11.3.0 RUN python -m venv dbt_venv && source dbt_venv/bin/activate && pip install --no-cache-dir<your-dbt-adapter> && deactivate DbtDag( ..., execution_config=ExecutionConfig( dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt") operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]} )) Setup dbt core installation
  • 24. All set, no configuration needed! By default, Cosmos will use: ⏺ ExecutionMode.LOCAL ⏺ InvocationMode.DBT_RUNNER (can run dbt commands 40% faster) Two additional steps: 1. If using Astro, create the virtualenv as part of your Docker image build. 2. Tell Cosmos where the dbt binary is. You will still be using the default ExecutionMode.LOCAL Can you install dbt and Airflow in the same Python environment? Can you create and manage a dedicated Python environment alongside Airflow? No No Yes Yes FROM quay.io/astronomer/astro-runtime:11.3.0 RUN python -m venv dbt_venv && source dbt_venv/bin/activate && pip install --no-cache-dir<your-dbt-adapter> && deactivate DbtDag( ..., execution_config=ExecutionConfig( dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt") operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]} )) Setup dbt core installation
  • 25. Good news, Cosmos can create and manage the dbt Python virtualenv for you! You can use the ExecutionMode.VIRTUALENV You don’t have to have dbt in the Airflow nodes to benefit from Cosmos! You can leverage the: ● LoadMode.DBT_MANIFEST ● ExecutionMode.KUBERNETES More information on these in the next slides! Do you still want to have dbt installed in Airflow nodes? No No Yes DbtDag( ..., execution_config=ExecutionConfig( execution_mode=ExecutionMode.VIRTUALENV, ) ) Setup dbt core installation
  • 26. Setup dbt deps That’s the most efficient approach. Tell Cosmos to not worry about the dbt packages.ymlfile Are you able to pre-install dbt deps in your Airflow deployment? No Yes DbtDag( ..., render_config=RenderConfig(dbt_deps=False), execution_config=ExecutionConfig( operator_args={"install_deps": False} )) No problem, by default Cosmos will run dbt deps before running any dbt command. No configuration needed.
  • 27. Setup dbt deps That’s the most efficient approach. Tell Cosmos to not worry about the dbt packages.ymlfile No problem, by default Cosmos will run dbt deps before running any dbt command. No configuration needed. Are you able to pre-install dbt deps in your Airflow deployment? No Yes DbtDag( ..., render_config=RenderConfig(dbt_deps=False), execution_config=ExecutionConfig( operator_args={"install_deps": False} ))
  • 28. Setup database connections Cosmos has an extensible set of ProfileMappingclasses, that can automatically create the dbt profiles.yml from Airflow Connections. No problem, Cosmos also allows you to define your own profiles.yml. Do you manage your database credentials in Airflow? No Yes profile_config = ProfileConfig( profile_name="my_profile_name", target_name="my_target_name", profile_mapping=SnowflakeUserPasswordProfileMapping( conn_id="my_snowflake_conn_id", profile_args={ "database": "my_snowflake_database", "schema": "my_snowflake_schema", }, ), ) dag = DbtDag( profile_config=profile_config, ) profile_config = ProfileConfig( profile_name="my_snowflake_profile", target_name="dev", profiles_yml_filepath="/path/to/profiles.yml", )
  • 29. Setup database connections Cosmos has an extensible set of ProfileMappingclasses, that can automatically create the dbt profiles.yml from Airflow Connections. No problem, Cosmos also allows you to define your own profiles.yml. Do you manage your database credentials in Airflow? No Yes profile_config = ProfileConfig( profile_name="my_profile_name", target_name="my_target_name", profile_mapping=SnowflakeUserPasswordProfileMapping( conn_id="my_snowflake_conn_id", profile_args={ "database": "my_snowflake_database", "schema": "my_snowflake_schema", }, ), ) dag = DbtDag( profile_config=profile_config, ) profile_config = ProfileConfig( profile_name="my_snowflake_profile", target_name="dev", profiles_yml_filepath="/path/to/profiles.yml", )
  • 31. ● How do you parse the dbt project? ● How do you select a subset of the original dbt project? ● How are tests represented? 1. Execution ● Where do you execute the dbt commands? Rendering
  • 32. Rendering how to parse the dbt project Great, this will help the Airflow DAG processor to save time. Cosmos will use the dbt Is command to identify the pipeline topology. The output is cached and refreshed automatically. It can also be purged manually. Cosmos will use its built-in dbt project parser (this implementation is a best effort, and may not support all dbt project features). Are you able to pre-compile your dbt project? No No Yes Is dbt available in the Airflow scheduler? DbtDag( project_config=ProjectConfig( manifest_path="/path/to/manifest.json"), render_config=RenderConfig( load_method=LoadMode.DBT_MANIFEST)) Yes
  • 33. Rendering how to parse the dbt project Great, this will help the Airflow DAG processor to save time. Cosmos will use the dbt Is command to identify the pipeline topology. The output is cached and refreshed automatically. It can also be purged manually. Cosmos will use its built-in dbt project parser (this implementation is a best effort, and may not support all dbt project features). Are you able to pre-compile your dbt project? No No Yes Is dbt available in the Airflow scheduler? DbtDag( project_config=ProjectConfig( manifest_path="/path/to/manifest.json"), render_config=RenderConfig( load_method=LoadMode.DBT_MANIFEST)) Yes
  • 34. Rendering selecting a subset of nodes Great, you can use any selector flag available in the version of dbt you’re using: select, exclude, selector. Cosmos will use custom implementation of dbt selectors to exclude and select nodes. dbt YAML selector is not currently supported. The following features are supported: selecting based on tags, paths, config.materialized, graph operators, tags intersections Did you use dbt ls to parse the project No Yes DbtDag( render_config=RenderConfig( exclude=["node_name+"], # node and its children ) ) DbtDag( render_config=RenderConfig( # intersection select=["tag:include_tag1,tag:include_tag2"] ) ) DbtDag( render_config=RenderConfig( load_method=LoadMode.DBT_LS, selector="my_selector" ) )
  • 35. Rendering selecting a subset of nodes Great, you can use any selector flag available in the version of dbt you’re using: select, exclude, selector. Cosmos will use custom implementation of dbt selectors to exclude and select nodes. dbt YAML selector is not currently supported. The following features are supported: selecting based on tags, paths, config.materialized, graph operators, tags intersections Did you use dbt ls to parse the project No Yes DbtDag( render_config=RenderConfig( exclude=["node_name+"], # node and its children ) ) DbtDag( render_config=RenderConfig( # intersection select=["tag:include_tag1,tag:include_tag2"] ) ) DbtDag( render_config=RenderConfig( load_method=LoadMode.DBT_LS, selector="my_selector" ) )
  • 36. Rendering representing test nodes That’s fine, tell Cosmos to not render them with TestBehavior.NONE That’s easy, use TestBehavior.AFTER_ALL This is Cosmos’ default behavior! It allows you to run the tests that relate to a specific model, snapshot or seed together with the specific dbt node they relate to. Do you want to hide dbt tests No No Yes Would you like to run all the tests by the end of the DAG? DbtDag( render_config=RenderConfig( test_behavior=TestBehavior.NONE, ) ) Yes DbtDag( render_config=RenderConfig( test_behavior=TestBehavior.AFTER_ALL, ) )
  • 37. Rendering representing test nodes That’s fine, tell Cosmos to not render them with TestBehavior.NONE That’s easy, use TestBehavior.AFTER_ALL This is Cosmos’ default behavior! It allows you to run the tests that relate to a specific model, snapshot or seed together with the specific dbt node they relate to. Do you want to hide dbt tests No No Yes Would you like to run all the tests by the end of the DAG? Yes DbtDag( render_config=RenderConfig( test_behavior=TestBehavior.NONE, ) ) DbtDag( render_config=RenderConfig( test_behavior=TestBehavior.AFTER_ALL, ) )
  • 40. Rendering dbt within with DbtTaskGroup Cosmos offers DbtTaskGroup so you can mix dbt and non-dbt tasks in the same DAG @dag( schedule_interval="@daily", start_date=datetime(2023, 1, 1), catchup=False, ) def basic_cosmos_task_group() -> None: pre_dbt = EmptyOperator(task_id="pre_dbt") customers = DbtTaskGroup( group_id="customers", project_config=ProjectConfig((DBT_ROOT_PATH / "jaffle_shop").as_posix(), dbt_vars={"var": "2"}), profile_config=profile_config, default_args={"retries": 2}, ) pre_dbt >> customers basic_cosmos_task_group()
  • 41. Rendering nodes your way Cosmos allows users to define how to convert dbt nodes into Airflow def convert_source(dag: DAG, task_group: TaskGroup, node: DbtNode, **kwargs): """ Return an instance of a desired operator to represent a dbt "source" node. """ return EmptyOperator(dag=dag, task_group=task_group, task_id=f"{node.name}_source") render_config = RenderConfig( node_converters={ DbtResourceType("source"): convert_source, # known dbt node type to Cosmos (part of DbtResourceType) } ) project_config = ProjectConfig( DBT_ROOT_PATH / "simple", env_vars={"DBT_SQLITE_PATH": DBT_SQLITE_PATH}, dbt_vars={"animation_alias": "top_5_animated_movies"}, ) example_cosmos_sources = DbtDag( project_config=project_config, profile_config=profile_config, render_config=render_config, )
  • 43. Execution how to run the dbt commands Cosmos will use the ExecutionMode.LOCALby default. Users can also use a pre-created dbt Python virtualenv or ask Cosmos to create/manage one. Review “Setup dbt-core installation” for more. Uses partial parsing - which can speed up task run by 35% Create a Docker container image with your dbt project and use Cosmos ExecutionMode.DOCKER : Is dbt available in the Airflow worker nodes? No No Yes Would you like to run dbt from within a Docker container in the Airflow worker node? Yes DbtDag( execution_config=ExecutionConfig( execution_mode=ExecutionMode.DOCKER, ), operator_args={ "image": "dbt-jaffle-shop:1.0.0", "network_mode": "bridge", }, ) Are you running within a Cloud provider (AWS or Azure)? Yes Create a Docker container image and delegate the execution of dbt commands to AWS EKS or Azure Container Instance, example: DbtDag( execution_config=ExecutionConfig( execution_mode=ExecutionMode.AWS_EKS, ), operator_args={ "image": "dbt-jaffle-shop:1.0.0", "cluster_name": CLUSTER_NAME, "get_logs": True, "is_delete_operator_pod": False, ) No
  • 44. Execution how to run the dbt commands Cosmos will use the ExecutionMode.LOCALby default. Users can also use a pre-created dbt Python virtualenv or ask Cosmos to create/manage one. Review “Setup dbt-core installation” for more. Uses partial parsing - which can speed up task run by 35% Create a Docker container image with your dbt project and use Cosmos ExecutionMode.DOCKER : Is dbt available in the Airflow worker nodes? No No Yes Would you like to run dbt from within a Docker container in the Airflow worker node? Yes DbtDAG( execution_config=ExecutionConfig( execution_mode=ExecutionMode.DOCKER, ), operator_args={ "image": "dbt-jaffle-shop:1.0.0", "network_mode": "bridge", }, ) Are you running within a Cloud provider (AWS or Azure)? Yes Create a Docker container image and delegate the execution of dbt commands to AWS EKS or Azure Container Instance, example: DbtDAG( execution_config=ExecutionConfig( execution_mode=ExecutionMode.AWS_EKS, ), operator_args={ "image": "dbt-jaffle-shop:1.0.0", "cluster_name": CLUSTER_NAME, "get_logs": True, "is_delete_operator_pod": False, ) No
  • 45. Execution how to run the dbt commands Create a docker container image with your dbt project. Run the dbt commands using Airflow Kubernetes Pods with Cosmos ExecutionMode.KUBERNETES: postgres_password_secret = Secret( deploy_type="env", deploy_target="POSTGRES_PASSWORD", secret="postgres-secrets", key="password", ) k8s_cosmos_dag = DbtDag( # ... execution_config=ExecutionConfig( execution_mode=ExecutionMode.KUBERNETES, ), operator_args={ "image": "dbt-jaffle-shop:1.0.0", "get_logs": True, "is_delete_operator_pod": False, "secrets": [postgres_password_secret], }, ) No
  • 47. 1. I cannot see my DbtDag ○ Do your DAG files contain the words DAG and airflow ■ Set AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False 2. I still cannot see my DbtDags ○ Are you using LoadMode.AUTOMATIC (default) or LoadMode.DBT_LS? ■ Increase AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT ○ Check the DAG processor/ scheduler logs for errors 3. The performance is suboptimal (latency or resource utilization) ● Try using the latest Cosmos release ● Leverage Cosmos caching mechanisms ● For very large dbt pipelines, use recommend using LoadMode.DBT_MANIFEST ● Pre-install dbt deps in your Airflow environment ● If possible, use ExecutionMode.LOCAL and InvocationMode.DBT_RUNNER Review speed tips in the slides with the symbol Troubleshooting
  • 49. Cost Savings ● Single technology to orchestrate & schedule dbt and non-dbt jobs ● Costs are not proportional to your team size ● No limit to the number of dbt projects you can run ● No vendor lock-in
  • 51. Cosmos adoption ● 900k+ downloads in a month (June-July 2024) ○ almost 10% of the dbt-core downloads per month in the same period! ● 516 stars in Github (9 July 2024) https://pypistats.org/packages/astronomer-cosmos
  • 52. Cosmos adoption in Astro 74 Astro customers use Cosmos to run their dbt projects (18.6.24)
  • 53. Cosmos adoption in the community 27.4% of the community reported using Cosmos to run dbt in Airflow. https://bit.ly/dbt-airflow-survey-2024
  • 54. Cosmos community Out of the four Cosmos committers, two are from the OSS community and never worked at Astronomer. Between December 2022 and July 2024, Cosmos had 93 contributors who merged 782 commits into main. There are 677 members in the #airflow-dbt Airflow Slack with daily interactions.
  • 56. Cosmos future possibilities ● Read and write dbt artifacts from remote shared storage ● Improve openlineage and dataset support ● Support running dbt compiled SQL using native Airflow operators (async support!) ● Allow users to create Cosmos DAGs from YAML files ● Allow users to build less granular DAGs ● Support setting arguments per dbt node ● Further improve performance ● Leverage latest Airflow features
  • 58. dbt in Airflow survey https://bit.ly/cosmos-survey-2024