Best Practices for Effectively Running dbt in Airflow.pdf
•
0 likes•337 views
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
Report
Share
Report
Share
1 of 59
Download to read offline
More Related Content
Best Practices for Effectively Running dbt in Airflow.pdf
6. Adoption of OSS tools to run dbt in Airflow
PyPI downloads for OSS popular tools used to run dbt in Airflow
7. dbt & Airflow community approaches
53.4% of the dbt in Airflow survey respondents don’t use any OSS tools
8. Objective
● What problem are you trying to solve?
● What should the dbt pipeline look like in Airflow?
Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
Rendering
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● Where are the tests in the pipeline?
Execution
● Where do you execute the dbt commands?
Questions when running dbt in Airflow
10. Objective
Run a dbt command
in Airflow
Simply run a dbt
command, similar to
how we run the
dbt-core CLI locally
in the terminal.
Convert dbt models
into Airflow tasks
Convert the dbt
pipeline into an
Airflow DAG and
have Airflow run the
dbt models as tasks.
Submit dbt pipelines
to run in dbt Cloud
Use Airflow to trigger
a job to be run in dbt
Cloud. Airflow should
treat the dbt job as a
black box.
● What problem are we trying to solve?
● What should the dbt pipeline look like in Airflow?
11. Objective run a dbt command in Airflow
with DAG(
"dbt_basic_dag",
start_date=datetime(2020, 12, 23),
description="A sample Airflow DAG to invoke dbt runs using a
BashOperator",
schedule_interval=None,
catchup=False,
) as dag:
dbt_seed = BashOperator(
task_id="dbt_seed",
bash_command=f"dbt seed --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_run = BashOperator(
task_id="dbt_run",
bash_command=f"dbt run --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_test = BashOperator(
task_id="dbt_test",
bash_command=f"dbt test --profiles-dir {DBT_PROJECT_DIR}
--project-dir {DBT_PROJECT_DIR}",
)
dbt_seed >> dbt_run >> dbt_test
Recommendation: BashOperator or Kubernetes Operator
12. Objective run the dbt pipeline in dbt Cloud
Recommendation: apache-airflow-providers-dbt-cloud
@dag(
start_date=datetime(2022, 2, 10),
schedule_interval="@daily",
catchup=False,
default_view="graph",
doc_md=__doc__,
)
def check_before_running_dbt_cloud_job():
begin, end = [EmptyOperator(task_id=id) for id in ["begin",
"end"]]
check_job = ShortCircuitOperator(
task_id="check_job_is_not_running",
python_callable=_check_job_not_running,
op_kwargs={"job_id": JOB_ID},
)
trigger_job = DbtCloudRunJobOperator(
task_id="trigger_dbt_cloud_job",
dbt_cloud_conn_id=DBT_CLOUD_CONN_ID,
job_id=JOB_ID,
check_interval=600,
timeout=3600,
)
begin >> check_job >> trigger_job >> end
15. Approaches trade-offs
Cosmos (one task / model)
● Failing dbt node is easy to identify
● Efficient retries
● Slower DAG parsing
● Worker slots proportional to dbt nodes
● Synchronous
● No vendor lock-in
● Independent downstream use cases
can succeed
● One database connection per task or
dbt model
dbt Cloud provider (one task / job)
● Hard to identify failing dbt node
● Inefficient retry (re-run all dbt nodes)
● Fast DAG parsing
● Few worker slots
● Asynchronous or Synchronous
● Vendor lock-in
● Downstream use cases dependent on
every dbt node succeeding
● Single database connection to run all
job transformations
BashOperator (one task / cmd)
● Hard to identify failing dbt node
● Inefficient retry (re-run all dbt nodes)
● Fast DAG parsing
● Few worker slots
● Synchronous
● No vendor lock-in
● Downstream use cases dependent on
every dbt node succeeding
● One database connection per dbt
command
16. dbt Cloud or dbt-core & Airflow
dbt Cloud
Proprietary hosted platform for
running dbt jobs
● Schedule jobs
● Built-in CI/CD
● Host documentation
● Monitor and alert
● Built-in Cloud IDE
● Run commands locally
dbt-core & Airflow
Open-Source platform that can run dbt
jobs (and others)
● Schedule jobs (Cosmos & Airflow)
● Pre-existing CI/CD
● Host documentation (Cosmos)
● Monitor and alert (Airflow)
● Pre-existing IDE
● Run commands locally (dbt-core)
● Unified tool for scheduling
Analytics, DS and DE workflows
17. dbt docs in Airflow with Cosmos
Generate and host your dbt docs with Cosmos
● DbtDocsOperator
● DbtDocsAzureStorageOperator
● DbtDocsS3Operator
● DbtDocsGCSOperator
from cosmos.operators import DbtDocsGCSOperator
generate_dbt_docs_aws = DbtDocsGCSOperator(
task_id="generate_dbt_docs_gcs",
project_dir="path/to/jaffle_shop",
profile_config=profile_config,
# docs-specific arguments
connection_id="test_gcs",
bucket_name="test_bucket",
)
18. But… what is Cosmos?
Open-Source library that helps running dbt-core in Apache Airflow
Try it out:
$ pip install astronomer-cosmos
Check the repo:
http://github.com/astronomer/astronomer-cosmos
Talk to the project developers and users:
#airflow-dbt channel in the Apache Airflow Slack
20. Objective
● What problem are you trying to solve?
● What should the dbt pipeline look like in Airflow?
Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
Rendering
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● Where are the tests in the pipeline?
Execution
● Where do you execute the dbt commands?
Questions when running dbt in Airflow
22. Setup
● How do you install dbt alongside Airflow?
● How do you install dbt dependencies?
1. Parsing
● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
2. Execution
● Where do you execute the dbt commands?
ERROR: Cannot install apache-airflow, apache-airflow==2.7.0, and dbt-core==1.4.0 because these package versions have conflicting
dependencies.
The conflict is caused by:
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.12.0 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.2 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.1 depends on PyYAML<6 and >=5.1
dbt-core 1.4.0 depends on pyyaml>=6.0
connexion 2.11.0 depends on PyYAML<6 and >=5.1
apache-airflow 2.7.0 depends on jsonschema>=4.18.0
flask-appbuilder 4.3.3 depends on jsonschema<5 and >=3
connexion 2.10.0 depends on jsonschema<4 and >=2.5.1
23. All set, no configuration needed!
By default, Cosmos will use:
⏺ ExecutionMode.LOCAL
⏺ InvocationMode.DBT_RUNNER
(can run dbt commands 40% faster)
Two additional steps:
1. If using Astro, create the virtualenv
as part of your Docker image build.
2. Tell Cosmos where the dbt binary
is. You will still be using the default
ExecutionMode.LOCAL
Can you install dbt
and Airflow in the
same Python
environment?
Can you create
and manage a
dedicated Python
environment
alongside Airflow?
No No
Yes
Yes
FROM quay.io/astronomer/astro-runtime:11.3.0
RUN python -m venv dbt_venv &&
source dbt_venv/bin/activate &&
pip install --no-cache-dir<your-dbt-adapter> &&
deactivate
DbtDag(
...,
execution_config=ExecutionConfig(
dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt")
operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]}
))
Setup dbt core installation
24. All set, no configuration needed!
By default, Cosmos will use:
⏺ ExecutionMode.LOCAL
⏺ InvocationMode.DBT_RUNNER
(can run dbt commands 40% faster)
Two additional steps:
1. If using Astro, create the virtualenv
as part of your Docker image build.
2. Tell Cosmos where the dbt binary
is. You will still be using the default
ExecutionMode.LOCAL
Can you install dbt
and Airflow in the
same Python
environment?
Can you create
and manage a
dedicated Python
environment
alongside Airflow?
No No
Yes
Yes
FROM quay.io/astronomer/astro-runtime:11.3.0
RUN python -m venv dbt_venv &&
source dbt_venv/bin/activate &&
pip install --no-cache-dir<your-dbt-adapter> &&
deactivate
DbtDag(
...,
execution_config=ExecutionConfig(
dbt_executable_path=Path("/usr/local/airflow/dbt_venv/bin/dbt")
operator_args={“py_requirements": ["dbt-postgres==1.6.0b1"]}
))
Setup dbt core installation
25. Good news, Cosmos can create and manage the
dbt Python virtualenv for you!
You can use the ExecutionMode.VIRTUALENV
You don’t have to have dbt in the
Airflow nodes to benefit from Cosmos!
You can leverage the:
● LoadMode.DBT_MANIFEST
● ExecutionMode.KUBERNETES
More information on these in the next
slides!
Do you still want
to have dbt
installed in
Airflow nodes?
No
No
Yes
DbtDag(
...,
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.VIRTUALENV,
)
)
Setup dbt core installation
26. Setup dbt deps
That’s the most efficient approach.
Tell Cosmos to not worry about the
dbt packages.ymlfile
Are you able to
pre-install dbt
deps in your
Airflow
deployment?
No
Yes
DbtDag(
...,
render_config=RenderConfig(dbt_deps=False),
execution_config=ExecutionConfig(
operator_args={"install_deps": False}
))
No problem, by default Cosmos
will run dbt deps before running
any dbt command.
No configuration needed.
27. Setup dbt deps
That’s the most efficient approach.
Tell Cosmos to not worry about the
dbt packages.ymlfile
No problem, by default Cosmos
will run dbt deps before running
any dbt command.
No configuration needed.
Are you able to
pre-install dbt
deps in your
Airflow
deployment?
No
Yes
DbtDag(
...,
render_config=RenderConfig(dbt_deps=False),
execution_config=ExecutionConfig(
operator_args={"install_deps": False}
))
28. Setup database connections
Cosmos has an extensible set of
ProfileMappingclasses, that can
automatically create the dbt
profiles.yml from Airflow
Connections.
No problem, Cosmos also allows
you to define your own
profiles.yml.
Do you manage
your database
credentials in
Airflow?
No
Yes profile_config = ProfileConfig(
profile_name="my_profile_name",
target_name="my_target_name",
profile_mapping=SnowflakeUserPasswordProfileMapping(
conn_id="my_snowflake_conn_id",
profile_args={
"database": "my_snowflake_database",
"schema": "my_snowflake_schema",
},
),
)
dag = DbtDag(
profile_config=profile_config,
)
profile_config = ProfileConfig(
profile_name="my_snowflake_profile",
target_name="dev",
profiles_yml_filepath="/path/to/profiles.yml",
)
29. Setup database connections
Cosmos has an extensible set of
ProfileMappingclasses, that can
automatically create the dbt
profiles.yml from Airflow
Connections.
No problem, Cosmos also allows
you to define your own
profiles.yml.
Do you manage
your database
credentials in
Airflow?
No
Yes profile_config = ProfileConfig(
profile_name="my_profile_name",
target_name="my_target_name",
profile_mapping=SnowflakeUserPasswordProfileMapping(
conn_id="my_snowflake_conn_id",
profile_args={
"database": "my_snowflake_database",
"schema": "my_snowflake_schema",
},
),
)
dag = DbtDag(
profile_config=profile_config,
)
profile_config = ProfileConfig(
profile_name="my_snowflake_profile",
target_name="dev",
profiles_yml_filepath="/path/to/profiles.yml",
)
31. ● How do you parse the dbt project?
● How do you select a subset of the original dbt project?
● How are tests represented?
1. Execution
● Where do you execute the dbt commands?
Rendering
32. Rendering how to parse the dbt project
Great, this will help the Airflow DAG
processor to save time.
Cosmos will use the dbt Is command to
identify the pipeline topology.
The output is cached and refreshed
automatically. It can also be purged
manually.
Cosmos will use its built-in dbt
project parser (this implementation
is a best effort, and may not
support all dbt project features).
Are you able
to
pre-compile
your dbt
project?
No No
Yes
Is dbt
available in
the Airflow
scheduler?
DbtDag(
project_config=ProjectConfig(
manifest_path="/path/to/manifest.json"),
render_config=RenderConfig(
load_method=LoadMode.DBT_MANIFEST))
Yes
33. Rendering how to parse the dbt project
Great, this will help the Airflow DAG
processor to save time.
Cosmos will use the dbt Is command to
identify the pipeline topology.
The output is cached and refreshed
automatically. It can also be purged
manually.
Cosmos will use its built-in dbt
project parser (this implementation
is a best effort, and may not
support all dbt project features).
Are you able
to
pre-compile
your dbt
project?
No No
Yes
Is dbt
available in
the Airflow
scheduler?
DbtDag(
project_config=ProjectConfig(
manifest_path="/path/to/manifest.json"),
render_config=RenderConfig(
load_method=LoadMode.DBT_MANIFEST))
Yes
34. Rendering selecting a subset of nodes
Great, you can use any selector
flag available in the version of dbt
you’re using: select, exclude,
selector.
Cosmos will use custom implementation
of dbt selectors to exclude and select
nodes. dbt YAML selector is not currently
supported.
The following features are supported:
selecting based on tags, paths,
config.materialized, graph operators,
tags intersections
Did you use
dbt ls to
parse the
project
No
Yes
DbtDag(
render_config=RenderConfig(
exclude=["node_name+"], # node and its children
)
)
DbtDag(
render_config=RenderConfig( # intersection
select=["tag:include_tag1,tag:include_tag2"]
)
)
DbtDag(
render_config=RenderConfig(
load_method=LoadMode.DBT_LS,
selector="my_selector"
)
)
35. Rendering selecting a subset of nodes
Great, you can use any selector
flag available in the version of dbt
you’re using: select, exclude,
selector.
Cosmos will use custom implementation
of dbt selectors to exclude and select
nodes. dbt YAML selector is not currently
supported.
The following features are supported:
selecting based on tags, paths,
config.materialized, graph operators,
tags intersections
Did you use
dbt ls to
parse the
project
No
Yes
DbtDag(
render_config=RenderConfig(
exclude=["node_name+"], # node and its children
)
)
DbtDag(
render_config=RenderConfig( # intersection
select=["tag:include_tag1,tag:include_tag2"]
)
)
DbtDag(
render_config=RenderConfig(
load_method=LoadMode.DBT_LS,
selector="my_selector"
)
)
36. Rendering representing test nodes
That’s fine, tell Cosmos to not
render them with
TestBehavior.NONE
That’s easy, use
TestBehavior.AFTER_ALL
This is Cosmos’ default behavior!
It allows you to run the tests that
relate to a specific model,
snapshot or seed together with
the specific dbt node they relate
to.
Do you want
to hide dbt
tests
No No
Yes
Would you
like to run all
the tests by
the end of
the DAG?
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.NONE,
)
)
Yes
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.AFTER_ALL,
)
)
37. Rendering representing test nodes
That’s fine, tell Cosmos to not
render them with
TestBehavior.NONE
That’s easy, use
TestBehavior.AFTER_ALL
This is Cosmos’ default behavior!
It allows you to run the tests that
relate to a specific model,
snapshot or seed together with
the specific dbt node they relate
to.
Do you want
to hide dbt
tests
No No
Yes
Would you
like to run all
the tests by
the end of
the DAG?
Yes
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.NONE,
)
)
DbtDag(
render_config=RenderConfig(
test_behavior=TestBehavior.AFTER_ALL,
)
)
43. Execution how to run the dbt commands
Cosmos will use the
ExecutionMode.LOCALby
default.
Users can also use a pre-created
dbt Python virtualenv or ask
Cosmos to create/manage one.
Review “Setup dbt-core
installation” for more.
Uses partial parsing - which can
speed up task run by 35%
Create a Docker container image with your
dbt project and use Cosmos
ExecutionMode.DOCKER
:
Is dbt
available in
the Airflow
worker
nodes?
No No
Yes
Would you like to
run dbt from
within a Docker
container in the
Airflow worker
node?
Yes
DbtDag(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.DOCKER,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"network_mode": "bridge",
},
)
Are you running
within a Cloud
provider (AWS
or Azure)?
Yes
Create a Docker container image and
delegate the execution of dbt commands
to AWS EKS or Azure Container Instance,
example:
DbtDag(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.AWS_EKS,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"cluster_name": CLUSTER_NAME,
"get_logs": True,
"is_delete_operator_pod": False,
)
No
44. Execution how to run the dbt commands
Cosmos will use the
ExecutionMode.LOCALby
default.
Users can also use a pre-created
dbt Python virtualenv or ask
Cosmos to create/manage one.
Review “Setup dbt-core
installation” for more.
Uses partial parsing - which can
speed up task run by 35%
Create a Docker container image with your
dbt project and use Cosmos
ExecutionMode.DOCKER
:
Is dbt
available in
the Airflow
worker
nodes?
No No
Yes
Would you like to
run dbt from
within a Docker
container in the
Airflow worker
node?
Yes
DbtDAG(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.DOCKER,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"network_mode": "bridge",
},
)
Are you running
within a Cloud
provider (AWS
or Azure)?
Yes
Create a Docker container image and
delegate the execution of dbt commands
to AWS EKS or Azure Container Instance,
example:
DbtDAG(
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.AWS_EKS,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"cluster_name": CLUSTER_NAME,
"get_logs": True,
"is_delete_operator_pod": False,
)
No
45. Execution how to run the dbt commands
Create a docker container image
with your dbt project.
Run the dbt commands using
Airflow Kubernetes Pods with
Cosmos
ExecutionMode.KUBERNETES:
postgres_password_secret = Secret(
deploy_type="env",
deploy_target="POSTGRES_PASSWORD",
secret="postgres-secrets",
key="password",
)
k8s_cosmos_dag = DbtDag(
# ...
execution_config=ExecutionConfig(
execution_mode=ExecutionMode.KUBERNETES,
),
operator_args={
"image": "dbt-jaffle-shop:1.0.0",
"get_logs": True,
"is_delete_operator_pod": False,
"secrets": [postgres_password_secret],
},
)
No
47. 1. I cannot see my DbtDag
○ Do your DAG files contain the words DAG and airflow
■ Set AIRFLOW__CORE__DAG_DISCOVERY_SAFE_MODE=False
2. I still cannot see my DbtDags
○ Are you using LoadMode.AUTOMATIC (default) or LoadMode.DBT_LS?
■ Increase AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
○ Check the DAG processor/ scheduler logs for errors
3. The performance is suboptimal (latency or resource utilization)
● Try using the latest Cosmos release
● Leverage Cosmos caching mechanisms
● For very large dbt pipelines, use recommend using
LoadMode.DBT_MANIFEST
● Pre-install dbt deps in your Airflow environment
● If possible, use ExecutionMode.LOCAL and
InvocationMode.DBT_RUNNER
Review speed tips in
the slides with
the symbol
Troubleshooting
49. Cost Savings
● Single technology to
orchestrate & schedule dbt and
non-dbt jobs
● Costs are not proportional to
your team size
● No limit to the number of dbt
projects you can run
● No vendor lock-in
51. Cosmos adoption
● 900k+ downloads in a month (June-July 2024)
○ almost 10% of the dbt-core downloads per month in the same period!
● 516 stars in Github (9 July 2024)
https://pypistats.org/packages/astronomer-cosmos
52. Cosmos adoption in Astro
74 Astro customers use Cosmos to run their dbt projects (18.6.24)
53. Cosmos adoption in the community
27.4% of the community reported using Cosmos to run dbt in Airflow.
https://bit.ly/dbt-airflow-survey-2024
54. Cosmos community
Out of the four Cosmos committers, two are from the OSS
community and never worked at Astronomer.
Between December 2022 and July 2024, Cosmos had 93
contributors who merged 782 commits into main.
There are 677 members in the #airflow-dbt Airflow Slack with daily
interactions.
56. Cosmos future possibilities
● Read and write dbt artifacts from remote
shared storage
● Improve openlineage and dataset support
● Support running dbt compiled SQL using
native Airflow operators (async support!)
● Allow users to create Cosmos DAGs from
YAML files
● Allow users to build less granular DAGs
● Support setting arguments per dbt node
● Further improve performance
● Leverage latest Airflow features