Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
10/29/2016 Data Science Camp, Santa Clara
Managing and Versioning Machine Learning
Models in Python
Simon Frid github.com/fridiculous
Session Overview
1. Motivation
1. Image Recognition Use Case
2. Ad Conversion Use Case
3. Fraud Prediction Use Case
2. Strategies and Design Considerations
1. Data Science Workflow
2. What Can We Learn from Software Version Control
3. Python Tools
4. Solutions
1. Estimators and Django-Estimators
2. Demo
Session Overview
1. Motivation
1. Image Recognition Use Case
2. Ad Conversion Use Case
3. Fraud Prediction Use Case
2. Strategies and Design Considerations
1. Data Science Workflow
2. What Can We Learn from Software Version Control
3. Python Tools
4. Solutions
1. Estimators and Django-Estimators
2. Demo
Managing and Versioning Machine Learning Models in Python
Disclaimer
Use Case 1:
Car Rental Marketplace
Identifying Cars/Inventory with Image
Recognition
How do we Iterate?
✤ Help clarify features. Improve photo
attributes e.g. edge detection.
✤ Human in the loop!
✤ Add computational power & GPUs
Use Case: Image Recognition
✤ Lots of models.
✤ Time to Develop. Time to
Deploy.
✤ How do we reference these
models? Which one do we
choose for production?
Use Case 2:
Selling Student Loans
Predicting Conversion Rate on Ads
Frequent Training
✤ Yearly Seasonality
✤ Irregular Monthly Effects
✤ Current Activity of the User’s
Demographics Matters
✤ A/B testing and Multi-Armed
Bandits
Selling Student Loans
✤ Lots of Models
✤ Lot of Trained Versions
✤ Lots of Data “Slicing” Options
✤ How do we Reference Models at training
time? How do Reference Models for A/B
testing?
Use Case 3:
Payment Gateway
Predicting Fraudulent Transactions
Fraud Patterns Change over Time
✤ A game of Cat and Mouse
Predicting Fraud
✤ Sudden change
in Signature
Signal
✤ Forensic Analysis
of Obsolete
Models
✤ Time Relevance
of the Features
✤ How do we …
Managing and Versioning Machine Learning Models in Python
“There are practical little things in housekeeping which no man really understands.”
Session Overview
1. Motivation
1. Image Recognition Use Case
2. Ad Conversion Use Case
3. Fraud Prediction Use Case
2. Strategies and Considerations
1. Data Science Workflow
2. What Can We Learn from Software Version Control
3. Python Tools
4. Solutions
1. Estimators and Django-Estimators
2. Demo
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Concept in
Software Version Control
Definition
Technology
Needed
Repository
Versioning
Commits, Tags and Labels
Push, Pull and Checkout
Diff
Concept in
Software Version Control
Definition
Technology
Needed
Repository
The repository is where files' current and historical data are stored,
often on a server. Sometimes also called a depot. Persistance & Serialization
Versioning
Commits, Tags and Labels
Push, Pull and Checkout
Diff
Concept in
Software Version Control
Definition
Technology
Needed
Repository
The repository is where files' current and historical data are stored,
often on a server. Sometimes also called a depot. Persistance & Serialization
Versioning
The process of assigning either unique version names or unique
version numbers to unique states of computer software.
Indexing & Hashing
Commits, Tags and Labels
Push, Pull and Checkout
Diff
Concept in
Software Version Control
Definition
Technology
Needed
Repository
The repository is where files' current and historical data are stored,
often on a server. Sometimes also called a depot. Persistance & Serialization
Versioning
The process of assigning either unique version names or unique
version numbers to unique states of computer software.
Indexing & Hashing
Commits, Tags and Labels
A tag or label refers to an important snapshot in time, consistent
across many files. These files at that point may all be tagged with a
user-friendly, meaningful name or revision number.
Attributes & Tags
Push, Pull and Checkout
Diff
Concept in
Software Version Control
Definition
Technology
Needed
Repository
The repository is where files' current and historical data are stored,
often on a server. Sometimes also called a depot. Persistance & Serialization
Versioning
The process of assigning either unique version names or unique
version numbers to unique states of computer software.
Indexing & Hashing
Commits, Tags and Labels
A tag or label refers to an important snapshot in time, consistent
across many files. These files at that point may all be tagged with a
user-friendly, meaningful name or revision number.
Attributes & Tags
Push, Pull and Checkout
To create a working copy from a repository.
With respect to pushing and pulling, a push sends a copy of one
repository to another repository. To pull retrieves a copy of a target
repository.
API
to persist and retrieve
Diff
Concept in
Software Version Control
Definition
Technology
Needed
Repository
The repository is where files' current and historical data are stored,
often on a server. Sometimes also called a depot. Persistance & Serialization
Versioning
The process of assigning either unique version names or unique
version numbers to unique states of computer software.
Indexing & Hashing
Commits, Tags and Labels
A tag or label refers to an important snapshot in time, consistent
across many files. These files at that point may all be tagged with a
user-friendly, meaningful name or revision number.
Attributes & Tags
Push, Pull and Checkout
To create a working copy from a repository.
With respect to pushing and pulling, a push sends a copy of one
repository to another repository. To pull retrieves a copy of a target
repository.
API
to persist and retrieve
Diff
represents a specific modification to a document under version
control. The granularity of the modification considered a change
varies between version control systems. 😃
Session Overview
1. Motivation
1. Image Recognition Use Case
2. Ad Conversion Use Case
3. Fraud Prediction Use Case
2. Strategies and Considerations
1. Data Science Workflow
2. What Can We Learn from Software Version Control
3. Python Tools
4. Solutions
1. Estimators and Django-Estimators
2. Demo
Algorithm Options
✤ scikit-learn
✤ MILK
✤ Statsmodels
✤ pylearn2
✤ nolearn
✤ nuPIC
✤ Nilearn
✤ gensim
✤ NLTK
✤ spacy
✤ scikit-image
✤ autolearn
✤ TPOT
✤ crab
✤ XGBoost
✤ pydeap
✤ pgmpy
✤ caffe
✤ tensorflow
✤ keras
✤ gym
Persistence Layer Options
✤ s3 - e.g. s3://bucket/project/model.pkl
✤ GitLFS
✤ Elasticsearch and Document-based Stores
✤ Docker
✤ Pachyderm
Serialization Options
✤ cpickle (py2) and pickle (py3)
✤ sklearn.joblib
✤ dill, cloudpickle and picklable-itertools
✤ PMML via jpmml-sklearn
✤ and what about transformer pipelines?
Indexing & Hashing
✤ Hashing the model
✤ Hashing the data
✤ Relational Database Table for Look Up
✤ Key Value Stores like Redis, Dynamo
Labels
✤ Semantic Versioning, Major.Minor.Patch
✤ Tags (django-taggit)
✤ Storing MetaData, create_dates, relationships between models
✤ Notes and learnings (from Human in the Loops)
API… components…
✤ Custom using an ORM/DAL like django and sqlachemy
✤ SaaS & PaaS - Turi, ScienceOps, PredictionIO, Azure ML
✤ Asynchronous Tasks - Airflow, Luigi, Celery
✤ Flows using Docker and Pachyderm
Session Overview
1. Motivation
1. Image Recognition Use Case
2. Ad Conversion Use Case
3. Fraud Prediction Use Case
2. Strategies and Considerations
1. Data Science Workflow
2. What Can We Learn from Software Version Control
3. Python Tools
4. Solutions
1. Estimators and Django-Estimators
2. Demo
Managing and Versioning Machine Learning Models in Python
Estimators
✤ a standalone client as an API for your ML
repo
✤ current focus “to persist upon prediction”
✤ Uses SQLAlchemy and local filesystem (for
now)
✤ github.com/fridiculous/estimators
✤ pip install estimators
(pre-alpha development version)
Django-Estimators
✤ an django-extension for ML models
✤ current focus “to persist each object”
✤ Uses Django and local filesystem (for now)
✤ github.com/fridiculous/django-estimators
✤ pip install django-estimators
(pre-alpha development version)
Demo
Fin.

More Related Content

Managing and Versioning Machine Learning Models in Python

  • 1. 10/29/2016 Data Science Camp, Santa Clara Managing and Versioning Machine Learning Models in Python Simon Frid github.com/fridiculous
  • 2. Session Overview 1. Motivation 1. Image Recognition Use Case 2. Ad Conversion Use Case 3. Fraud Prediction Use Case 2. Strategies and Design Considerations 1. Data Science Workflow 2. What Can We Learn from Software Version Control 3. Python Tools 4. Solutions 1. Estimators and Django-Estimators 2. Demo
  • 3. Session Overview 1. Motivation 1. Image Recognition Use Case 2. Ad Conversion Use Case 3. Fraud Prediction Use Case 2. Strategies and Design Considerations 1. Data Science Workflow 2. What Can We Learn from Software Version Control 3. Python Tools 4. Solutions 1. Estimators and Django-Estimators 2. Demo
  • 6. Use Case 1: Car Rental Marketplace Identifying Cars/Inventory with Image Recognition
  • 7. How do we Iterate? ✤ Help clarify features. Improve photo attributes e.g. edge detection. ✤ Human in the loop! ✤ Add computational power & GPUs
  • 8. Use Case: Image Recognition ✤ Lots of models. ✤ Time to Develop. Time to Deploy. ✤ How do we reference these models? Which one do we choose for production?
  • 9. Use Case 2: Selling Student Loans Predicting Conversion Rate on Ads
  • 10. Frequent Training ✤ Yearly Seasonality ✤ Irregular Monthly Effects ✤ Current Activity of the User’s Demographics Matters ✤ A/B testing and Multi-Armed Bandits
  • 11. Selling Student Loans ✤ Lots of Models ✤ Lot of Trained Versions ✤ Lots of Data “Slicing” Options ✤ How do we Reference Models at training time? How do Reference Models for A/B testing?
  • 12. Use Case 3: Payment Gateway Predicting Fraudulent Transactions
  • 13. Fraud Patterns Change over Time ✤ A game of Cat and Mouse
  • 14. Predicting Fraud ✤ Sudden change in Signature Signal ✤ Forensic Analysis of Obsolete Models ✤ Time Relevance of the Features ✤ How do we …
  • 16. “There are practical little things in housekeeping which no man really understands.”
  • 17. Session Overview 1. Motivation 1. Image Recognition Use Case 2. Ad Conversion Use Case 3. Fraud Prediction Use Case 2. Strategies and Considerations 1. Data Science Workflow 2. What Can We Learn from Software Version Control 3. Python Tools 4. Solutions 1. Estimators and Django-Estimators 2. Demo
  • 25. Concept in Software Version Control Definition Technology Needed Repository Versioning Commits, Tags and Labels Push, Pull and Checkout Diff
  • 26. Concept in Software Version Control Definition Technology Needed Repository The repository is where files' current and historical data are stored, often on a server. Sometimes also called a depot. Persistance & Serialization Versioning Commits, Tags and Labels Push, Pull and Checkout Diff
  • 27. Concept in Software Version Control Definition Technology Needed Repository The repository is where files' current and historical data are stored, often on a server. Sometimes also called a depot. Persistance & Serialization Versioning The process of assigning either unique version names or unique version numbers to unique states of computer software. Indexing & Hashing Commits, Tags and Labels Push, Pull and Checkout Diff
  • 28. Concept in Software Version Control Definition Technology Needed Repository The repository is where files' current and historical data are stored, often on a server. Sometimes also called a depot. Persistance & Serialization Versioning The process of assigning either unique version names or unique version numbers to unique states of computer software. Indexing & Hashing Commits, Tags and Labels A tag or label refers to an important snapshot in time, consistent across many files. These files at that point may all be tagged with a user-friendly, meaningful name or revision number. Attributes & Tags Push, Pull and Checkout Diff
  • 29. Concept in Software Version Control Definition Technology Needed Repository The repository is where files' current and historical data are stored, often on a server. Sometimes also called a depot. Persistance & Serialization Versioning The process of assigning either unique version names or unique version numbers to unique states of computer software. Indexing & Hashing Commits, Tags and Labels A tag or label refers to an important snapshot in time, consistent across many files. These files at that point may all be tagged with a user-friendly, meaningful name or revision number. Attributes & Tags Push, Pull and Checkout To create a working copy from a repository. With respect to pushing and pulling, a push sends a copy of one repository to another repository. To pull retrieves a copy of a target repository. API to persist and retrieve Diff
  • 30. Concept in Software Version Control Definition Technology Needed Repository The repository is where files' current and historical data are stored, often on a server. Sometimes also called a depot. Persistance & Serialization Versioning The process of assigning either unique version names or unique version numbers to unique states of computer software. Indexing & Hashing Commits, Tags and Labels A tag or label refers to an important snapshot in time, consistent across many files. These files at that point may all be tagged with a user-friendly, meaningful name or revision number. Attributes & Tags Push, Pull and Checkout To create a working copy from a repository. With respect to pushing and pulling, a push sends a copy of one repository to another repository. To pull retrieves a copy of a target repository. API to persist and retrieve Diff represents a specific modification to a document under version control. The granularity of the modification considered a change varies between version control systems. 😃
  • 31. Session Overview 1. Motivation 1. Image Recognition Use Case 2. Ad Conversion Use Case 3. Fraud Prediction Use Case 2. Strategies and Considerations 1. Data Science Workflow 2. What Can We Learn from Software Version Control 3. Python Tools 4. Solutions 1. Estimators and Django-Estimators 2. Demo
  • 32. Algorithm Options ✤ scikit-learn ✤ MILK ✤ Statsmodels ✤ pylearn2 ✤ nolearn ✤ nuPIC ✤ Nilearn ✤ gensim ✤ NLTK ✤ spacy ✤ scikit-image ✤ autolearn ✤ TPOT ✤ crab ✤ XGBoost ✤ pydeap ✤ pgmpy ✤ caffe ✤ tensorflow ✤ keras ✤ gym
  • 33. Persistence Layer Options ✤ s3 - e.g. s3://bucket/project/model.pkl ✤ GitLFS ✤ Elasticsearch and Document-based Stores ✤ Docker ✤ Pachyderm
  • 34. Serialization Options ✤ cpickle (py2) and pickle (py3) ✤ sklearn.joblib ✤ dill, cloudpickle and picklable-itertools ✤ PMML via jpmml-sklearn ✤ and what about transformer pipelines?
  • 35. Indexing & Hashing ✤ Hashing the model ✤ Hashing the data ✤ Relational Database Table for Look Up ✤ Key Value Stores like Redis, Dynamo
  • 36. Labels ✤ Semantic Versioning, Major.Minor.Patch ✤ Tags (django-taggit) ✤ Storing MetaData, create_dates, relationships between models ✤ Notes and learnings (from Human in the Loops)
  • 37. API… components… ✤ Custom using an ORM/DAL like django and sqlachemy ✤ SaaS & PaaS - Turi, ScienceOps, PredictionIO, Azure ML ✤ Asynchronous Tasks - Airflow, Luigi, Celery ✤ Flows using Docker and Pachyderm
  • 38. Session Overview 1. Motivation 1. Image Recognition Use Case 2. Ad Conversion Use Case 3. Fraud Prediction Use Case 2. Strategies and Considerations 1. Data Science Workflow 2. What Can We Learn from Software Version Control 3. Python Tools 4. Solutions 1. Estimators and Django-Estimators 2. Demo
  • 40. Estimators ✤ a standalone client as an API for your ML repo ✤ current focus “to persist upon prediction” ✤ Uses SQLAlchemy and local filesystem (for now) ✤ github.com/fridiculous/estimators ✤ pip install estimators (pre-alpha development version)
  • 41. Django-Estimators ✤ an django-extension for ML models ✤ current focus “to persist each object” ✤ Uses Django and local filesystem (for now) ✤ github.com/fridiculous/django-estimators ✤ pip install django-estimators (pre-alpha development version)
  • 42. Demo
  • 43. Fin.

Editor's Notes

  1. A handcar (also known as a pump trolley, pump car, jigger, Kalamazoo,[1] velocipede[citation needed], or draisine) is a railroad car powered by its passengers, or by people pushing the car from behind. It is mostly used as a maintenance of way or mining car, but it was also used for passenger service in some cases. A typical design consists of an arm, called the walking beam, that pivots, seesaw-like, on a base, which the passengers alternately push down and pull up to move the car. It reflects the current state of machine learning applications. “To discuss strategies and tools that help organize our ml systems.”
  2. I’m NOT an Expert. I’m a practitioner.
  3. but who knows, maybe the Pokemon mobile is the hottest rental over the weekend
  4. by Eleanor Roosevelt. We need a lot of tooling to automate and organize this information
  5. yellow is the data science sandbox blue is our business strategy role red is our product and engineering role
  6. “Automation” - when we need to script, schedule, repeat a particular process. It can be ETL, it can be training a model, it can be retrain models, it can be parameter optimization In all these cases, every time we automate, we need to know what we’re automating.
  7. we need help.