Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Machine Learning Operations - MLOps
Getting from Good to Great
Michal Maciejewski, PhD
Acknowledgements: Dejan Golubovic, Ricardo Rocha, Christoph Obermair, Marek Grzenkowicz
2
ML Model
X Y = f(X)
Let’s share our model with users aka let’s put it into production!
Alice:
Bob: Y = f(X)
What Has to Go Right?
3
What is needed for an ML model to perform well in production?
What Can Go Wrong?
Concept and data drifts are one of the main challenges of production ML systems!
4
5
MLOps is about maintaining the trained model performance* in production.
The performance may degrade due to factors outside of our control
so we ought to monitor the performance and if needed, roll out a new model to
users.
*model performance = accuracy, latency, jitter, etc.
6
ML Model = Data + Code
+ Algorithm
+ Weights
+ Hyperparameters
+ Scripts
+ Libraries
+ Infrastructure
+ DevOps
MLOps = ML Model + Software
7
D. Sculley et. al. Hidden Technical Debt in Machine Learning Systems, NIPS 2015
Good news: most of these components come as ready-to-use frameworks
Configuration
Feature Extraction
ML Model
Data
Collection
Data
Verification
Machine
Resource
Management
Analysis Tools
Process
Management
Tools
Serving
Infrastructure
Monitoring
MLOps = ML Model + Software
Your
code
ML Framework
MLOps Pipeline
8
MLOps is a multi-stage, iterative process.
Data Engineering Modelling Deployment Monitoring
Data Engineering
9
Reproducibility
Traceability
Data-driven ML
Data Engineering Modelling Deployment Monitoring
f( )
10
=
Exploratory Data Analysis
For structured data:
- schema as required
tables, columns and
datatypes
For unstructured data:
- resolution, image
extension
- frequency, duration,
audio codec
11
Initial exploration allows indetifying requirements for input data in produciton.
Data Processing Pipeline
12
Data
Ingestion
Data
Validation
Data
Cleaning
Feature
Engineering
• Filling NaNs
• Filtering
• Normalization
• Standarization
• Schema check
• Audio/video file
check
• Feature selection
• Feature crossover
• Load from file
• Load from db
We need to reproduce some of those steps (e.g. subtracting mean) in production!
https://sites.google.com/princeton.edu/rep-workshop/
Reproducibility
13
Dataset
Notebooks
Various
scripts
Excel
spreadsheets
Curated dataset
Keeping Track of Data Processing
• Version Input Data – DVC framework
• Version Processing Script - GitLab
• Version Computing Environment - Docker
14
Data Provenance – where does data come from?
Data Lineage – how data is manipulated?
Notebook Good Practices
• Linear flow of execution
• Little amount of code
• Extract reusable code into a package
• Pre-commit for cleaning notebook before
committing to a repository
• Set parameters on top so that notebook
can be treated as a function (papermill and
scrapbook packages)
15
It is OK, to do exploratory quick&dirty model development.
Once we start communicating the model outside, we need to clean it!
From Model-driven to Data-driven ML
16
Model-driven ML Data-driven ML
Fixed component Dataset Model Architecture
Variable component Model Architecture Dataset
Objective High accuracy Fairness, low bias
Explainability Limited Possible
https://datacentricai.org
https://spectrum.ieee.org/andrew-ng-data-centric-ai
Modelling
17
Training challenges
Rare events
Analyzing results
Data Engineering Modelling Deployment Monitoring
Selecting Data for Training
18
Validation
20%
Training
80%
Dataset
train validate
Hyperparameter
tuning
With this approach, the model eventually sees the entire dataset.
Selecting Data for Training
19
Validation
15%
Training
75%
Dataset
train validate
Hyperparameter
tuning
Splitting dataset in three allows to perform a final check with unseen data.
Test
10%
test
Final
check
Balancing Datasets
Consider a binary classification problem with a dataset composed of 200 entries.
There are 160 negative examples (no failure) and 40 positive ones (failure).
20
Expected:
Random:
Validation
15%
(24+6)
Training
75%
(120 + 30)
Test
10%
(16+4)
Validation
15%
(19+11)
Training
75%
(131 + 19)
Test
10%
(10+10)
For continuous values it is important to preserve statistical distribution.
Although for big datasets it is not an issue, it is still a low-hanging-fruit.
Rare Events
21
C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz
M. Brice, LHC tunnel Pictures during LS2, https://cds.cern.ch/images/CERN-PHOTO-201904-108-15
There were 3130 healthy signals (Y=False) and 112 faulty ones (Y=True)
22
This naive model is guaranteed to achieve 97% average dataset accuracy?!
Rare Events
Rare Events
23
It is a valuable conversation to decide if precision or recall (or both) is more important.
Ground truth
Y = True Y = False
Model
Y = True 0
true positive
0
false positive
Y = False 112
false negative
3130
true negative
Precision =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
=
0
0
Avg accuracy =
TN
TN + FN
= 97%
Recall =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
=
0
0 + 112
= 0
F1score =
2
1/Precision + 1/Recall
Data Augmentation
24
JH. Kim et al. Hybrid Integration of Solid-State Quantum Emitters on a Silicon Photonic Chip, Nano Letters 2017
New examples obtained by
shifting the region left and right
New examples obtained by
rotating/shifting/hiding
What else can we do?
When one of the values of Y is rare in the population, considerable
resources in data collection can be saved by randomly selecting within
categories of Y. […]
The strategy is to select on Y by collecting observations (randomly or all
those available) for which Y = 1 (the "cases") and a random selection of
observations for which Y = 0 (the "controls").
25
G. King and L. Zeng, “Logistic Regression in Rare Events Data,” Political Analysis, p. 28, 2001.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
We can also collect more data of particular class (if even possible).
Training Tracking
1. Pen & Paper
2. Spreadsheet
3. Dedicated framework
- Weights and Biases
- Neptune.ai
- Tensorflow
- …
26
Error Analysis
Such analysis may reveal issues with labelling or rare classes in data.
For unstructured data, a cockpit could help in analysis.
Useful in monitoring of certain classes of inputs. 27
# Signal Noise Gap in signal Bias Wrong sampling
1 Magnet 1 x x
2 Magnet 2 x x
3 Magnet 3 x x
28
Deployment
29
Degrees of automation
Modes of deployment
Reproducible environments
Data Engineering Modelling Deployment Monitoring
Degrees of Automation
30
C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz
Human inspection Shadow mode
Human in
the loop
Full Automation
Starting from Shadow mode we can collect more training data in production!
Modes of Deployment
31
https://hbr.org/2017/09/the-surprising-power-of-online-experiments
https://en.wikipedia.org/wiki/Blue-winged_parrot
Router
Old version
New version
X%
- In Canary deployment there is a gradual switch between versions
- In Blue/green deployment there is an on/off switch between versions
100-X%
Computing environment
(OS, Python, packages)
Reproducible Environments
32
Serverless compute
We will play with those during the exercise sessions!
ML Model
REST API
Data Pipeline
HTTP Server
Request Response
Docker Containers
Computing
infrastructure
KServe
Request Response
Pool of
models
Config
file
Monitoring
33
Useful metrics
Relevant frameworks
Data Engineering Modelling Deployment Monitoring
34
Relevant Metrics
• Model metrics
• Distribution of input features – data/concept drift
• Missing/malformed values in the input
• Average output accuracy/classification distribution – concept drift
• Infrastructure metrics
• Logging errors
• Memory, CPU resources utilization
• Latency and jitter
35
For each of the relevant metrics one should define warning/error thresholds.
Monitoring Matters
36
C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz
37
Data Engineering Modelling Deployment Monitoring
MLOps Pipeline with Tensorflow
38
https://www.tensorflow.org/tfx/guide
Pipeline represented as DAG
directed acyclic graph
Data Engineering
Modelling
Deployment
39
MLOps Pipeline with Kubeflow
https://ml.cern.ch
https://www.kubeflow.org/docs/started/
Data Engineering
Modelling
Deployment
I do hope the presented MLOps concepts will allow your models to transition
From Good to Great 40
Development ML Production ML
Objective High-accuracy model Efficiency of the overall system
Dataset Fixed Evolving
Code quality Secondary importance Critical
Model training Optimal tuning Fast turn-arounds
Reproducibility Secondary importance Critical
Traceability Secondary importance Critical
Conclusion
Resources
41
Machine Learning Engineering for Production (MLOps) Specialization

More Related Content

230208 MLOps Getting from Good to Great.pptx

  • 1. Machine Learning Operations - MLOps Getting from Good to Great Michal Maciejewski, PhD Acknowledgements: Dejan Golubovic, Ricardo Rocha, Christoph Obermair, Marek Grzenkowicz
  • 2. 2 ML Model X Y = f(X) Let’s share our model with users aka let’s put it into production! Alice: Bob: Y = f(X)
  • 3. What Has to Go Right? 3 What is needed for an ML model to perform well in production?
  • 4. What Can Go Wrong? Concept and data drifts are one of the main challenges of production ML systems! 4
  • 5. 5 MLOps is about maintaining the trained model performance* in production. The performance may degrade due to factors outside of our control so we ought to monitor the performance and if needed, roll out a new model to users. *model performance = accuracy, latency, jitter, etc.
  • 6. 6 ML Model = Data + Code + Algorithm + Weights + Hyperparameters + Scripts + Libraries + Infrastructure + DevOps MLOps = ML Model + Software
  • 7. 7 D. Sculley et. al. Hidden Technical Debt in Machine Learning Systems, NIPS 2015 Good news: most of these components come as ready-to-use frameworks Configuration Feature Extraction ML Model Data Collection Data Verification Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring MLOps = ML Model + Software Your code ML Framework
  • 8. MLOps Pipeline 8 MLOps is a multi-stage, iterative process. Data Engineering Modelling Deployment Monitoring
  • 9. Data Engineering 9 Reproducibility Traceability Data-driven ML Data Engineering Modelling Deployment Monitoring
  • 11. Exploratory Data Analysis For structured data: - schema as required tables, columns and datatypes For unstructured data: - resolution, image extension - frequency, duration, audio codec 11 Initial exploration allows indetifying requirements for input data in produciton.
  • 12. Data Processing Pipeline 12 Data Ingestion Data Validation Data Cleaning Feature Engineering • Filling NaNs • Filtering • Normalization • Standarization • Schema check • Audio/video file check • Feature selection • Feature crossover • Load from file • Load from db We need to reproduce some of those steps (e.g. subtracting mean) in production!
  • 14. Keeping Track of Data Processing • Version Input Data – DVC framework • Version Processing Script - GitLab • Version Computing Environment - Docker 14 Data Provenance – where does data come from? Data Lineage – how data is manipulated?
  • 15. Notebook Good Practices • Linear flow of execution • Little amount of code • Extract reusable code into a package • Pre-commit for cleaning notebook before committing to a repository • Set parameters on top so that notebook can be treated as a function (papermill and scrapbook packages) 15 It is OK, to do exploratory quick&dirty model development. Once we start communicating the model outside, we need to clean it!
  • 16. From Model-driven to Data-driven ML 16 Model-driven ML Data-driven ML Fixed component Dataset Model Architecture Variable component Model Architecture Dataset Objective High accuracy Fairness, low bias Explainability Limited Possible https://datacentricai.org https://spectrum.ieee.org/andrew-ng-data-centric-ai
  • 17. Modelling 17 Training challenges Rare events Analyzing results Data Engineering Modelling Deployment Monitoring
  • 18. Selecting Data for Training 18 Validation 20% Training 80% Dataset train validate Hyperparameter tuning With this approach, the model eventually sees the entire dataset.
  • 19. Selecting Data for Training 19 Validation 15% Training 75% Dataset train validate Hyperparameter tuning Splitting dataset in three allows to perform a final check with unseen data. Test 10% test Final check
  • 20. Balancing Datasets Consider a binary classification problem with a dataset composed of 200 entries. There are 160 negative examples (no failure) and 40 positive ones (failure). 20 Expected: Random: Validation 15% (24+6) Training 75% (120 + 30) Test 10% (16+4) Validation 15% (19+11) Training 75% (131 + 19) Test 10% (10+10) For continuous values it is important to preserve statistical distribution. Although for big datasets it is not an issue, it is still a low-hanging-fruit.
  • 21. Rare Events 21 C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz M. Brice, LHC tunnel Pictures during LS2, https://cds.cern.ch/images/CERN-PHOTO-201904-108-15 There were 3130 healthy signals (Y=False) and 112 faulty ones (Y=True)
  • 22. 22 This naive model is guaranteed to achieve 97% average dataset accuracy?! Rare Events
  • 23. Rare Events 23 It is a valuable conversation to decide if precision or recall (or both) is more important. Ground truth Y = True Y = False Model Y = True 0 true positive 0 false positive Y = False 112 false negative 3130 true negative Precision = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 = 0 0 Avg accuracy = TN TN + FN = 97% Recall = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 = 0 0 + 112 = 0 F1score = 2 1/Precision + 1/Recall
  • 24. Data Augmentation 24 JH. Kim et al. Hybrid Integration of Solid-State Quantum Emitters on a Silicon Photonic Chip, Nano Letters 2017 New examples obtained by shifting the region left and right New examples obtained by rotating/shifting/hiding
  • 25. What else can we do? When one of the values of Y is rare in the population, considerable resources in data collection can be saved by randomly selecting within categories of Y. […] The strategy is to select on Y by collecting observations (randomly or all those available) for which Y = 1 (the "cases") and a random selection of observations for which Y = 0 (the "controls"). 25 G. King and L. Zeng, “Logistic Regression in Rare Events Data,” Political Analysis, p. 28, 2001. https://en.wikipedia.org/wiki/Cross-validation_(statistics) We can also collect more data of particular class (if even possible).
  • 26. Training Tracking 1. Pen & Paper 2. Spreadsheet 3. Dedicated framework - Weights and Biases - Neptune.ai - Tensorflow - … 26
  • 27. Error Analysis Such analysis may reveal issues with labelling or rare classes in data. For unstructured data, a cockpit could help in analysis. Useful in monitoring of certain classes of inputs. 27 # Signal Noise Gap in signal Bias Wrong sampling 1 Magnet 1 x x 2 Magnet 2 x x 3 Magnet 3 x x
  • 28. 28
  • 29. Deployment 29 Degrees of automation Modes of deployment Reproducible environments Data Engineering Modelling Deployment Monitoring
  • 30. Degrees of Automation 30 C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz Human inspection Shadow mode Human in the loop Full Automation Starting from Shadow mode we can collect more training data in production!
  • 31. Modes of Deployment 31 https://hbr.org/2017/09/the-surprising-power-of-online-experiments https://en.wikipedia.org/wiki/Blue-winged_parrot Router Old version New version X% - In Canary deployment there is a gradual switch between versions - In Blue/green deployment there is an on/off switch between versions 100-X%
  • 32. Computing environment (OS, Python, packages) Reproducible Environments 32 Serverless compute We will play with those during the exercise sessions! ML Model REST API Data Pipeline HTTP Server Request Response Docker Containers Computing infrastructure KServe Request Response Pool of models Config file
  • 33. Monitoring 33 Useful metrics Relevant frameworks Data Engineering Modelling Deployment Monitoring
  • 34. 34
  • 35. Relevant Metrics • Model metrics • Distribution of input features – data/concept drift • Missing/malformed values in the input • Average output accuracy/classification distribution – concept drift • Infrastructure metrics • Logging errors • Memory, CPU resources utilization • Latency and jitter 35 For each of the relevant metrics one should define warning/error thresholds.
  • 36. Monitoring Matters 36 C. Obermair, Extension of Signal Monitoring Applications with Machine Learning, Master Thesis, TU Graz
  • 37. 37 Data Engineering Modelling Deployment Monitoring
  • 38. MLOps Pipeline with Tensorflow 38 https://www.tensorflow.org/tfx/guide Pipeline represented as DAG directed acyclic graph Data Engineering Modelling Deployment
  • 39. 39 MLOps Pipeline with Kubeflow https://ml.cern.ch https://www.kubeflow.org/docs/started/ Data Engineering Modelling Deployment
  • 40. I do hope the presented MLOps concepts will allow your models to transition From Good to Great 40 Development ML Production ML Objective High-accuracy model Efficiency of the overall system Dataset Fixed Evolving Code quality Secondary importance Critical Model training Optimal tuning Fast turn-arounds Reproducibility Secondary importance Critical Traceability Secondary importance Critical Conclusion
  • 41. Resources 41 Machine Learning Engineering for Production (MLOps) Specialization

Editor's Notes

  1. When one of the values of Y is rare in the population, considerable resources in datacollection can be saved by randomly selecting within categories of Y. This is known in econometrics as choice-based or endogenous stratified sampling and in epidemiology as a case-control design (Breslow 1996); it is also useful for choosing qualitative case studies (King et al. 1994, Sect. 4.4.2). The strategy is to select on Y by collecting observations (randomly or all those available) for which Y = 1 (the "cases") and a random selection of observations for which Y = 0 (the "controls").
  2. Load balancer for computation