Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Enabling Scalable Data Science
Pipelines with MLflow and Model
Registry at Thermo Fisher
Scientific
Allison Wu
Data Scientist, Data Science Center of Excellence
Thermo Fisher Scientific
Key Summary
▪ We standardized development of machine learning models by
integrating MLFlow tracking into the development pipeline.
▪ We improved reproducibility of machine learning models by having
GitHub and Delta Lake integrated into development and deployment
pipelines .
▪ We streamlined our deployment process for machine learning models
on different platforms through MLFlow and Centralized Model
Registry.
3
What do data scientists at our Data Science Center of
Excellence do?
4
▪ Generate novel algorithms that
can be applied across different
divisions
▪ Work with cross-divisional teams
for model migration and
standardization
▪ Enable data science in different
divisions and functions
▪ Establish data science best
practices
Operations Human Resources
Commercial &
Marketing R&D
Data Science
at Thermo
Fisher
Commercial & Marketing Data Science Life Cycle
Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability
Data Delivery
Install base
Cloud
Transactional
External data
Web behavioral
Customer interaction
Call center
Model Development &
Deployment
Automatic email campaigns
Website marketing strategies
Prescriptive rec. for sales reps.
Machine Learning Models
F E E D B A C K F O R M A C H I N E L E A R N I N G
R E L E V A N T
O F F E R
Customer
• Engagement
• Leads
• Revenue ($)
Rule-based Legacy Models
5
Model Development and Deployment Cycle
▪ Exploratory
analysis
▪ Model
development:
featuring
engineering,
selection, model
optimization
▪ Deployment to
production
environment
▪ Audit
▪ Scoring
DeploymentDevelopment
▪ Web
recommendation
▪ Email campaign
▪ Commercial
dashboard
Delivery
▪ Monitoring
▪ Feedback
Management
Production model retraining and retuning
New model development
PRDDEV PRDPRD
DEV ß PRD
6
An Example Model Development / Deployment Cycle
A model that makes product recommendation based on customer behaviors, such as web
activities, sales transactions, etc.
7
6-8 weeks of EDA
and prototyping
• Scoring daily.
• Retrain/Retune based on
new data in production
every 2 weeks.
• Deliver through email
campaign or commercial
sales rep. channels.
• Monitor model
performance metrics
What we used to do…
• All work is in Databricks Notebooks
• No version control on either data or
model
• No unit testing
• No regression testing against
different versions of models
• Hard to share modularized functions
across projects (Lots of copy-pasting)
8
What we now do…
Databricks notebook
• Exploratory Analysis
• Feature engineering
Notebook & mlflow
• ML model experiment
• Hyperparameter tracking
• Feature selection
• Model comparison
DEV
Development Model Registry
• Streamline regression testing against
previous model versions
• Documented model review process
• Clean version management for better
collaboration within the same DEV
environment
DEV
• ML model library python modules for sharable
and testable ML functions such as feature
functions, utility functions, ML tuning functions.
• Version controlled on GitHub
• Integrate with Databricks Projects to version
control Databricks notebooks
• Documented code review process
• Version controlled data source with Delta Lake
9
Tracking Feature Improvements Become Easy
▪ “Let me find out how the features do in
my….uh….model_version_10.dbc?
Maybe?”
▪ “I wish I had a screen shot of the
feature importance figure before….”
What we used to do…
Boss: What are the important features in
this version versus the previous version?
Tracking Feature Improvements Become Easy
What we now do….
▪ “I got it. Let me pull it out from MLFlow…”
12
Sharing ML features Becomes Easy
13
Colleague: I really like the feature you
used in your last model. Can I use that as
well?
What we used to do…
▪ “Sure! Just copy-paste this part of the
notebook…oh but I also have a slightly
different version in this other part of
the notebook…. I THINK this is the one I
used….”
Sharing ML features Becomes Easy
14
What we now do….
▪ “Sure! I added that feature to the
shared ML repo. Feel free to use it by
importing the module and if you
modify the feature, just contribute to
the repo so that I can use it next time
as well!”
▪ What’s even cooler…. You can log the
exact version of the repo you used
in MLFlow so that even if the repo
evolved after your model
development. You can still trace back
to the exact version you used for your
own model.
Internal Shared ML repo
• Reproducing model results does not just rely on version control of
code and notebooks but also the training data, environments and
dependencies.
• MLflow and Delta Lake allows for tracking all necessary things needed
for reproducing the model results.
• GitHub allows us to:
• establish best practices of accessing our data warehouses
• standardizing our ML models
• encourage collaboration and review among different data scientists.
What We Learned
15
Let’s talk about deployment….
16
What we used to do…
• Manually export Databricks notebooks
and dependent libraries.
• Manually set up clusters in PRD
instance to match cluster settings in
DEV.
• Difficulty in troubleshooting the
differences between PRD and DEV
shard environments as data scientists
don’t have required access to pre-
deploy in PRD environment.
17
What we now can do….
Centralized Model Registry
• Regression testing in production
environment
• Allows model version management in
a centralized workspace
• Manage production models from
different DEV environments
• Streamlined deployment with logged
dependencies and environment set-
up.
PRD
Development Model Registry
DEV
18
What we now can do….
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
19
What we can also do….
Deploying and Managing Models Across Different Platforms through a
Centralized Model Registry
Development Model Registry
DEV
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
Development Model Registry
DEV
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
20
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
Regression Testing Becomes Easy
▪ “Let me look through the previous
colleague’s notebook to find out
what the performance was….”
▪ After digging through the notebook,
you can’t find performance metrics
logged anywhere…..
22
Boss: How does your new model
performance compare to the old model in
production?
What we used to do… What we now do…
▪ From the record in model
registry, it looks like I have
improved the precision by X%..
23
Troubleshooting Transient Data Discrepancies Becomes Easy
▪ “Uh….the input table is already overwritten by today’s
run. I can rerun the model and see if the prediction
comes back to normal now?”
24
Data Engineer: The daily run yesterday
yield only <1000 rows of prediction. Do you
know what happened?
What we used to do…
▪ “Let me pull out that
version of input table
since it’s saved as Delta
Tables. Looks like
there were a lot fewer
rows in the input table
due to the delay of data
refresh job?”
25
What we now do…
Troubleshooting Transient Data Discrepancies Becomes Easy
• Data scientists like the freedom of trying out new platforms and tools.
• Allowing for the freedom of platforms and tools can be a nightmare for
deployment in production environment.
• MLFlow tracking server and model registry allows logging a wide range
of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker.
This allows management and comparison across different platforms in
the same centralized workspace.
What We Learned
26
Thank you!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

  • 2. Enabling Scalable Data Science Pipelines with MLflow and Model Registry at Thermo Fisher Scientific Allison Wu Data Scientist, Data Science Center of Excellence Thermo Fisher Scientific
  • 3. Key Summary ▪ We standardized development of machine learning models by integrating MLFlow tracking into the development pipeline. ▪ We improved reproducibility of machine learning models by having GitHub and Delta Lake integrated into development and deployment pipelines . ▪ We streamlined our deployment process for machine learning models on different platforms through MLFlow and Centralized Model Registry. 3
  • 4. What do data scientists at our Data Science Center of Excellence do? 4 ▪ Generate novel algorithms that can be applied across different divisions ▪ Work with cross-divisional teams for model migration and standardization ▪ Enable data science in different divisions and functions ▪ Establish data science best practices Operations Human Resources Commercial & Marketing R&D Data Science at Thermo Fisher
  • 5. Commercial & Marketing Data Science Life Cycle Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability Data Delivery Install base Cloud Transactional External data Web behavioral Customer interaction Call center Model Development & Deployment Automatic email campaigns Website marketing strategies Prescriptive rec. for sales reps. Machine Learning Models F E E D B A C K F O R M A C H I N E L E A R N I N G R E L E V A N T O F F E R Customer • Engagement • Leads • Revenue ($) Rule-based Legacy Models 5
  • 6. Model Development and Deployment Cycle ▪ Exploratory analysis ▪ Model development: featuring engineering, selection, model optimization ▪ Deployment to production environment ▪ Audit ▪ Scoring DeploymentDevelopment ▪ Web recommendation ▪ Email campaign ▪ Commercial dashboard Delivery ▪ Monitoring ▪ Feedback Management Production model retraining and retuning New model development PRDDEV PRDPRD DEV ß PRD 6
  • 7. An Example Model Development / Deployment Cycle A model that makes product recommendation based on customer behaviors, such as web activities, sales transactions, etc. 7 6-8 weeks of EDA and prototyping • Scoring daily. • Retrain/Retune based on new data in production every 2 weeks. • Deliver through email campaign or commercial sales rep. channels. • Monitor model performance metrics
  • 8. What we used to do… • All work is in Databricks Notebooks • No version control on either data or model • No unit testing • No regression testing against different versions of models • Hard to share modularized functions across projects (Lots of copy-pasting) 8
  • 9. What we now do… Databricks notebook • Exploratory Analysis • Feature engineering Notebook & mlflow • ML model experiment • Hyperparameter tracking • Feature selection • Model comparison DEV Development Model Registry • Streamline regression testing against previous model versions • Documented model review process • Clean version management for better collaboration within the same DEV environment DEV • ML model library python modules for sharable and testable ML functions such as feature functions, utility functions, ML tuning functions. • Version controlled on GitHub • Integrate with Databricks Projects to version control Databricks notebooks • Documented code review process • Version controlled data source with Delta Lake 9
  • 10. Tracking Feature Improvements Become Easy ▪ “Let me find out how the features do in my….uh….model_version_10.dbc? Maybe?” ▪ “I wish I had a screen shot of the feature importance figure before….” What we used to do… Boss: What are the important features in this version versus the previous version?
  • 11. Tracking Feature Improvements Become Easy What we now do…. ▪ “I got it. Let me pull it out from MLFlow…”
  • 12. 12
  • 13. Sharing ML features Becomes Easy 13 Colleague: I really like the feature you used in your last model. Can I use that as well? What we used to do… ▪ “Sure! Just copy-paste this part of the notebook…oh but I also have a slightly different version in this other part of the notebook…. I THINK this is the one I used….”
  • 14. Sharing ML features Becomes Easy 14 What we now do…. ▪ “Sure! I added that feature to the shared ML repo. Feel free to use it by importing the module and if you modify the feature, just contribute to the repo so that I can use it next time as well!” ▪ What’s even cooler…. You can log the exact version of the repo you used in MLFlow so that even if the repo evolved after your model development. You can still trace back to the exact version you used for your own model. Internal Shared ML repo
  • 15. • Reproducing model results does not just rely on version control of code and notebooks but also the training data, environments and dependencies. • MLflow and Delta Lake allows for tracking all necessary things needed for reproducing the model results. • GitHub allows us to: • establish best practices of accessing our data warehouses • standardizing our ML models • encourage collaboration and review among different data scientists. What We Learned 15
  • 16. Let’s talk about deployment…. 16
  • 17. What we used to do… • Manually export Databricks notebooks and dependent libraries. • Manually set up clusters in PRD instance to match cluster settings in DEV. • Difficulty in troubleshooting the differences between PRD and DEV shard environments as data scientists don’t have required access to pre- deploy in PRD environment. 17
  • 18. What we now can do…. Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments • Streamlined deployment with logged dependencies and environment set- up. PRD Development Model Registry DEV 18
  • 19. What we now can do…. PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD 19
  • 20. What we can also do…. Deploying and Managing Models Across Different Platforms through a Centralized Model Registry Development Model Registry DEV Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD Development Model Registry DEV PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging 20
  • 22. Regression Testing Becomes Easy ▪ “Let me look through the previous colleague’s notebook to find out what the performance was….” ▪ After digging through the notebook, you can’t find performance metrics logged anywhere….. 22 Boss: How does your new model performance compare to the old model in production? What we used to do… What we now do… ▪ From the record in model registry, it looks like I have improved the precision by X%..
  • 23. 23
  • 24. Troubleshooting Transient Data Discrepancies Becomes Easy ▪ “Uh….the input table is already overwritten by today’s run. I can rerun the model and see if the prediction comes back to normal now?” 24 Data Engineer: The daily run yesterday yield only <1000 rows of prediction. Do you know what happened? What we used to do…
  • 25. ▪ “Let me pull out that version of input table since it’s saved as Delta Tables. Looks like there were a lot fewer rows in the input table due to the delay of data refresh job?” 25 What we now do… Troubleshooting Transient Data Discrepancies Becomes Easy
  • 26. • Data scientists like the freedom of trying out new platforms and tools. • Allowing for the freedom of platforms and tools can be a nightmare for deployment in production environment. • MLFlow tracking server and model registry allows logging a wide range of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker. This allows management and comparison across different platforms in the same centralized workspace. What We Learned 26
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.