Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
October 2022
Guillermo Sanchez
Python models in dbt:
the good, the bad & the ugly
2
2
About The Speaker
Guillermo Sánchez
• Tech Lead at GDD Analyticsunit!
• Been working in the dataspace for 5 years
• Based in Amsterdam, but originallyfrom Madrid,Spain
• Big climbing fan!
Table of Contents
Context
The good
The bad
The ugly
Conclusions
Context
The roots of dbt
• Orchestrate transformations in the warehouse
• SQL first (with some jinja to dope things up)
• Awesomedocs
A classic dbt model
Recently in the dbt world…
Some facts about dbt python models
• Released in dbt-core v1.3
• Currently only supported for BigQuery, Databricksand Snowflake adapters
• Still runs in your warehouse/cloudplatform
• Python models only support table and incrementalmaterializations.
Anatomy of a dbt python model
• Parameters
• dbt (access to config, nodes, etc.)
• session(spark or snowpark)
• Returns
• Spark DataFramefor Databricks & BigQuery
• Snowpark DataFrame for Snowflake
The
good
More use cases with dbt awesomeness
Advancedanalytics ML batch inference Verbosity
Taking advantageof the rich python
ecosystemaround data science
Your batch inferencepipeline in the
dbt project!
table
python
model
table w
inference
Model
registry
Pull model
Less lines of code that can do the
same!
• Jinja can solvesomestuff, but
not everything.
• We don’t need to debug complex
compiled SQL to find out our
errors.
A example of advanced analytics
A example of ML batch inference
src: dbt community
Is it all good though…?
The
bad
Different levels of not so good
Technicalnot so goods
Use case not so goods Use cases that we think should not be tackled with dbt python models
Technicalinconveniencesof dbt python models
A not so good use case: external data source
There is something called dbt seeds!
A bad use case: inference from API endpoint
table
python
model
table w
inference
Model
endpoint
Tons ofAPI calls
Why not?
• Need to handle retry
mechanisms
• What to do when API is down?
• How does it affect the restof the
models being run on the dbt job?
A bad use case: inference from API endpoint
It’s complex too!
Technical not so goods: set ups!
It used to be only running SQL against an endpoint…not anymore!
BigQuery setup
Dataproc job which requires:
• Serverless or cluster
• Enabling DataProc
• Ensuring DataProc has access to
your data in BQ
• Your credentials should also be
able to run DataProc jobs!
Databrickssetup
Databricks job which requires:
• Job cluster should have access to
data
• Your credentials should havethe
ability to create & run Databricks
jobs
Or command API access.
Snowflake setup
Snowpark codewrapped in store
procedure:
• Requires being enrolled on
Snowpark public preview
• Anaconda packages need to be
enabled by an admin
Technical not so goods: divergent APIs
VS
The
ugly
Ugghh…
Ingestion in dbt?!
dbt is not a tool that is made for data ingestion.Main reasons:
• There are toolsthat are made for this that handle all the
complexities better!
• The python API is for dbt models, not sources (wink)!
• How would you go about ingesting from Change Data Capture or
any event streams in dbt?
MLOps in dbt?!
dbt is not an MLOps platform (yet?). Mainreasons:
• Doesn’t have a model registry (can a table be a model registry...?)
• Doesn’t have experiment tracking or model performance tracking metrics
(a SQL test will not do it I’m afraid)
• Doesn’t offer mechanisms to deploy a models to an API endpoint
Conclusions
Would I use dbt python models?
Yesyesyes! But for the right cases
Maybe I’d wait a bit before using this in prod!
WWW.GODATADRIVEN.COM
GUILLERMOSANCHEZ@GODATADRIVEN .COM

More Related Content

dbt Python models - GoDataFest by Guillermo Sanchez

  • 1. October 2022 Guillermo Sanchez Python models in dbt: the good, the bad & the ugly
  • 2. 2 2 About The Speaker Guillermo Sánchez • Tech Lead at GDD Analyticsunit! • Been working in the dataspace for 5 years • Based in Amsterdam, but originallyfrom Madrid,Spain • Big climbing fan!
  • 3. Table of Contents Context The good The bad The ugly Conclusions
  • 5. The roots of dbt • Orchestrate transformations in the warehouse • SQL first (with some jinja to dope things up) • Awesomedocs
  • 7. Recently in the dbt world…
  • 8. Some facts about dbt python models • Released in dbt-core v1.3 • Currently only supported for BigQuery, Databricksand Snowflake adapters • Still runs in your warehouse/cloudplatform • Python models only support table and incrementalmaterializations.
  • 9. Anatomy of a dbt python model • Parameters • dbt (access to config, nodes, etc.) • session(spark or snowpark) • Returns • Spark DataFramefor Databricks & BigQuery • Snowpark DataFrame for Snowflake
  • 11. More use cases with dbt awesomeness Advancedanalytics ML batch inference Verbosity Taking advantageof the rich python ecosystemaround data science Your batch inferencepipeline in the dbt project! table python model table w inference Model registry Pull model Less lines of code that can do the same! • Jinja can solvesomestuff, but not everything. • We don’t need to debug complex compiled SQL to find out our errors.
  • 12. A example of advanced analytics
  • 13. A example of ML batch inference src: dbt community
  • 14. Is it all good though…?
  • 16. Different levels of not so good Technicalnot so goods Use case not so goods Use cases that we think should not be tackled with dbt python models Technicalinconveniencesof dbt python models
  • 17. A not so good use case: external data source There is something called dbt seeds!
  • 18. A bad use case: inference from API endpoint table python model table w inference Model endpoint Tons ofAPI calls Why not? • Need to handle retry mechanisms • What to do when API is down? • How does it affect the restof the models being run on the dbt job?
  • 19. A bad use case: inference from API endpoint It’s complex too!
  • 20. Technical not so goods: set ups! It used to be only running SQL against an endpoint…not anymore! BigQuery setup Dataproc job which requires: • Serverless or cluster • Enabling DataProc • Ensuring DataProc has access to your data in BQ • Your credentials should also be able to run DataProc jobs! Databrickssetup Databricks job which requires: • Job cluster should have access to data • Your credentials should havethe ability to create & run Databricks jobs Or command API access. Snowflake setup Snowpark codewrapped in store procedure: • Requires being enrolled on Snowpark public preview • Anaconda packages need to be enabled by an admin
  • 21. Technical not so goods: divergent APIs VS
  • 24. Ingestion in dbt?! dbt is not a tool that is made for data ingestion.Main reasons: • There are toolsthat are made for this that handle all the complexities better! • The python API is for dbt models, not sources (wink)! • How would you go about ingesting from Change Data Capture or any event streams in dbt?
  • 25. MLOps in dbt?! dbt is not an MLOps platform (yet?). Mainreasons: • Doesn’t have a model registry (can a table be a model registry...?) • Doesn’t have experiment tracking or model performance tracking metrics (a SQL test will not do it I’m afraid) • Doesn’t offer mechanisms to deploy a models to an API endpoint
  • 27. Would I use dbt python models? Yesyesyes! But for the right cases Maybe I’d wait a bit before using this in prod!