Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Work Hard Once
Strategy and Automation applied to
building machine learning models
Franklin Sarkett
April 2, 2018
About me: franklin.sarkett@gmail.com
Audantic Real Estate Analytics, co-founder
● http://audantic.com/
● Audantic provides customized data, analytics, and predictive analytics for machine
residential real estate.
Facebook
● Data scientist at Facebook and developed an algorithm for the Ads Payments team that
increased revenue over $200 million and earned a patent.
Education
● CS degree from University of Illinois at Urbana Champaign
● MS in Applied Statistics from DePaul University.
Summary
Building machine learning models from data ingestion to productionalization is challenging,
with many steps.
Of all the steps, feature engineering is the biggest differentiator between models that work
and models that do not.
Using automation and strategy we can remove some of the most challenging parts, and
focus on the area of machine learning that generates the most value: feature engineering.
John Boyd and the OODA Loop
The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist
and United States Air Force Colonel John Boyd.
Boyd applied the concept to the combat operations process.
It is now also often applied to understand commercial operations and learning processes.
The approach favors agility over raw power in dealing with human opponents in any endeavor.
- Wikipedia
Pydata Chicago - work hard once
Orient (most important)
"Orient" is the key to the OODA loop.
Since one is conditioned by one's heritage, surrounding culture, existing knowledge and
learnings, the mind combines fragments of ideas, information, conjectures, impressions,
etc. to generate our orientation.
How well your orientation matches the real world is largely a function of how well you
observe.
Stages of Machine Learning
Feature
engineering
Data
cleaning
Model
training
Observe
Get raw data
(sql, csv, API)
Orient Decide
Model
evaluation
Deployment
Act
Two guiding thoughts
A mentor of mine at FB was coaching me on our model building.
Building models requires domain knowledge, and put as much data into the model as you can.
To improve the models, you need to add:
● Data quality
● Data volume
○ Breadth
○ Depth
Addressing these concerns takes Feature Engineering to the next level.
Automating the Observe stage
Many of the tasks in the observe stage could be classified as DevOps and Data Engineering.
My favorite tools to use for data science:
● Docker
● Jenkins
● Luigi
Orient - Feature Engineering
“Coming up with features is difficult, time-consuming, requires expert
knowledge. 'Applied machine learning' is basically feature engineering.”
— Prof. Andrew Ng.
Orient - Feature Engineering
“The algorithms we used are very standard for Kagglers. …We spent
most of our efforts in feature engineering. … We were also very careful
to discard features likely to expose us to the risk of over-fitting our
model.”
— Xavier Conort
Orient - Feature Engineering
“Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.”
— Dr. Jason Brownlee
Orient - Feature Engineering
At the end of the day, some machine learning projects succeed
and some fail. What makes the difference? Easily the most
important factor is the features used...It is often also one of the
most interesting parts, where intuition, creativity and “black art”
are as important as the technical stuff.
-Pedro Domingos, Prof of CS as University of Washington
Code snippet
http://bit.ly/PyDataChi-FeatureEngineering
How do we iterate
feature engineering faster?
● Create a pipeline of transforms with a final estimator.
● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of
steps in processing the data, for example feature selection, normalization and classification.
● Benefits:
○ Convenience and encapsulation.
You only have to call fit and predict once on your data to fit a whole sequence of estimators
○ Safety.
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by
ensuring that the same samples are used to train the transformers and predictors.
Feature extraction
Feature extraction
Feature extraction
The
pipeline
Summary
Building machine learning models from data ingestion to productionalization is hard.
Using automation and strategy we can remove some of the most challenging parts,
and focus on the area of machine learning that generates the most value: feature
engineering.
When we use automation and strategy to remove the most challenging parts of
machine learning, we can run through more OODA loops faster, generate better
models, learn more about our subject, and deliver more value.
franklin.sarkett@gmail.com

More Related Content

Pydata Chicago - work hard once

  • 1. Work Hard Once Strategy and Automation applied to building machine learning models Franklin Sarkett April 2, 2018
  • 2. About me: franklin.sarkett@gmail.com Audantic Real Estate Analytics, co-founder ● http://audantic.com/ ● Audantic provides customized data, analytics, and predictive analytics for machine residential real estate. Facebook ● Data scientist at Facebook and developed an algorithm for the Ads Payments team that increased revenue over $200 million and earned a patent. Education ● CS degree from University of Illinois at Urbana Champaign ● MS in Applied Statistics from DePaul University.
  • 3. Summary Building machine learning models from data ingestion to productionalization is challenging, with many steps. Of all the steps, feature engineering is the biggest differentiator between models that work and models that do not. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering.
  • 4. John Boyd and the OODA Loop The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist and United States Air Force Colonel John Boyd. Boyd applied the concept to the combat operations process. It is now also often applied to understand commercial operations and learning processes. The approach favors agility over raw power in dealing with human opponents in any endeavor. - Wikipedia
  • 6. Orient (most important) "Orient" is the key to the OODA loop. Since one is conditioned by one's heritage, surrounding culture, existing knowledge and learnings, the mind combines fragments of ideas, information, conjectures, impressions, etc. to generate our orientation. How well your orientation matches the real world is largely a function of how well you observe.
  • 7. Stages of Machine Learning Feature engineering Data cleaning Model training Observe Get raw data (sql, csv, API) Orient Decide Model evaluation Deployment Act
  • 8. Two guiding thoughts A mentor of mine at FB was coaching me on our model building. Building models requires domain knowledge, and put as much data into the model as you can. To improve the models, you need to add: ● Data quality ● Data volume ○ Breadth ○ Depth Addressing these concerns takes Feature Engineering to the next level.
  • 9. Automating the Observe stage Many of the tasks in the observe stage could be classified as DevOps and Data Engineering. My favorite tools to use for data science: ● Docker ● Jenkins ● Luigi
  • 10. Orient - Feature Engineering “Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering.” — Prof. Andrew Ng.
  • 11. Orient - Feature Engineering “The algorithms we used are very standard for Kagglers. …We spent most of our efforts in feature engineering. … We were also very careful to discard features likely to expose us to the risk of over-fitting our model.” — Xavier Conort
  • 12. Orient - Feature Engineering “Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” — Dr. Jason Brownlee
  • 13. Orient - Feature Engineering At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used...It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff. -Pedro Domingos, Prof of CS as University of Washington
  • 15. How do we iterate feature engineering faster? ● Create a pipeline of transforms with a final estimator. ● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. ● Benefits: ○ Convenience and encapsulation. You only have to call fit and predict once on your data to fit a whole sequence of estimators ○ Safety. Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
  • 20. Summary Building machine learning models from data ingestion to productionalization is hard. Using automation and strategy we can remove some of the most challenging parts, and focus on the area of machine learning that generates the most value: feature engineering. When we use automation and strategy to remove the most challenging parts of machine learning, we can run through more OODA loops faster, generate better models, learn more about our subject, and deliver more value.