Guillermo Sanchez presented on the pros and cons of using Python models in dbt. While Python models allow for more advanced analytics and leveraging the Python ecosystem, they also introduce more complexity in setup and divergent APIs across platforms. Additionally, dbt may not be well-suited for certain use cases like ingesting external data or building full MLOps pipelines. In general, Python models are best for the right analytical use cases, but caution is needed, especially for production environments.
Report
Share
Report
Share
1 of 28
Download to read offline
More Related Content
dbt Python models - GoDataFest by Guillermo Sanchez
2. 2
2
About The Speaker
Guillermo Sánchez
• Tech Lead at GDD Analyticsunit!
• Been working in the dataspace for 5 years
• Based in Amsterdam, but originallyfrom Madrid,Spain
• Big climbing fan!
8. Some facts about dbt python models
• Released in dbt-core v1.3
• Currently only supported for BigQuery, Databricksand Snowflake adapters
• Still runs in your warehouse/cloudplatform
• Python models only support table and incrementalmaterializations.
9. Anatomy of a dbt python model
• Parameters
• dbt (access to config, nodes, etc.)
• session(spark or snowpark)
• Returns
• Spark DataFramefor Databricks & BigQuery
• Snowpark DataFrame for Snowflake
11. More use cases with dbt awesomeness
Advancedanalytics ML batch inference Verbosity
Taking advantageof the rich python
ecosystemaround data science
Your batch inferencepipeline in the
dbt project!
table
python
model
table w
inference
Model
registry
Pull model
Less lines of code that can do the
same!
• Jinja can solvesomestuff, but
not everything.
• We don’t need to debug complex
compiled SQL to find out our
errors.
16. Different levels of not so good
Technicalnot so goods
Use case not so goods Use cases that we think should not be tackled with dbt python models
Technicalinconveniencesof dbt python models
17. A not so good use case: external data source
There is something called dbt seeds!
18. A bad use case: inference from API endpoint
table
python
model
table w
inference
Model
endpoint
Tons ofAPI calls
Why not?
• Need to handle retry
mechanisms
• What to do when API is down?
• How does it affect the restof the
models being run on the dbt job?
19. A bad use case: inference from API endpoint
It’s complex too!
20. Technical not so goods: set ups!
It used to be only running SQL against an endpoint…not anymore!
BigQuery setup
Dataproc job which requires:
• Serverless or cluster
• Enabling DataProc
• Ensuring DataProc has access to
your data in BQ
• Your credentials should also be
able to run DataProc jobs!
Databrickssetup
Databricks job which requires:
• Job cluster should have access to
data
• Your credentials should havethe
ability to create & run Databricks
jobs
Or command API access.
Snowflake setup
Snowpark codewrapped in store
procedure:
• Requires being enrolled on
Snowpark public preview
• Anaconda packages need to be
enabled by an admin
24. Ingestion in dbt?!
dbt is not a tool that is made for data ingestion.Main reasons:
• There are toolsthat are made for this that handle all the
complexities better!
• The python API is for dbt models, not sources (wink)!
• How would you go about ingesting from Change Data Capture or
any event streams in dbt?
25. MLOps in dbt?!
dbt is not an MLOps platform (yet?). Mainreasons:
• Doesn’t have a model registry (can a table be a model registry...?)
• Doesn’t have experiment tracking or model performance tracking metrics
(a SQL test will not do it I’m afraid)
• Doesn’t offer mechanisms to deploy a models to an API endpoint