Pydata Chicago - work hard once

Work Hard Once
Strategy and Automation applied to
building machine learning models
Franklin Sarkett
April 2, 2018

About me: franklin.sarkett@gmail.com
Audantic Real Estate Analytics, co-founder
● http://audantic.com/
● Audantic provides customized data, analytics, and predictive analytics for machine
residential real estate.
Facebook
● Data scientist at Facebook and developed an algorithm for the Ads Payments team that
increased revenue over $200 million and earned a patent.
Education
● CS degree from University of Illinois at Urbana Champaign
● MS in Applied Statistics from DePaul University.

Summary
Building machine learning models from data ingestion to productionalization is challenging,
with many steps.
Of all the steps, feature engineering is the biggest differentiator between models that work
and models that do not.
Using automation and strategy we can remove some of the most challenging parts, and
focus on the area of machine learning that generates the most value: feature engineering.

John Boyd and the OODA Loop
The OODA loop is the decision cycle of observe, orient, decide, and act, developed by military strategist
and United States Air Force Colonel John Boyd.
Boyd applied the concept to the combat operations process.
It is now also often applied to understand commercial operations and learning processes.
The approach favors agility over raw power in dealing with human opponents in any endeavor.
- Wikipedia

Orient (most important)
"Orient" is the key to the OODA loop.
Since one is conditioned by one's heritage, surrounding culture, existing knowledge and
learnings, the mind combines fragments of ideas, information, conjectures, impressions,
etc. to generate our orientation.
How well your orientation matches the real world is largely a function of how well you
observe.

Stages of Machine Learning
Feature
engineering
Data
cleaning
Model
training
Observe
Get raw data
(sql, csv, API)
Orient Decide
Model
evaluation
Deployment
Act

Two guiding thoughts
A mentor of mine at FB was coaching me on our model building.
Building models requires domain knowledge, and put as much data into the model as you can.
To improve the models, you need to add:
● Data quality
● Data volume
○ Breadth
○ Depth
Addressing these concerns takes Feature Engineering to the next level.

Automating the Observe stage
Many of the tasks in the observe stage could be classified as DevOps and Data Engineering.
My favorite tools to use for data science:
● Docker
● Jenkins
● Luigi

Orient - Feature Engineering
“Coming up with features is difficult, time-consuming, requires expert
knowledge. 'Applied machine learning' is basically feature engineering.”
— Prof. Andrew Ng.

“The algorithms we used are very standard for Kagglers. …We spent
most of our efforts in feature engineering. … We were also very careful
to discard features likely to expose us to the risk of over-fitting our
model.”
— Xavier Conort

“Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.”
— Dr. Jason Brownlee

At the end of the day, some machine learning projects succeed
and some fail. What makes the difference? Easily the most
important factor is the features used...It is often also one of the
most interesting parts, where intuition, creativity and “black art”
are as important as the technical stuff.
-Pedro Domingos, Prof of CS as University of Washington

Code snippet
http://bit.ly/PyDataChi-FeatureEngineering

How do we iterate
feature engineering faster?
● Create a pipeline of transforms with a final estimator.
● Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of
steps in processing the data, for example feature selection, normalization and classification.
● Benefits:
○ Convenience and encapsulation.
You only have to call fit and predict once on your data to fit a whole sequence of estimators
○ Safety.
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by
ensuring that the same samples are used to train the transformers and predictors.

Summary
Building machine learning models from data ingestion to productionalization is hard.
Using automation and strategy we can remove some of the most challenging parts,
and focus on the area of machine learning that generates the most value: feature
engineering.
When we use automation and strategy to remove the most challenging parts of
machine learning, we can run through more OODA loops faster, generate better
models, learn more about our subject, and deliver more value.

Pydata Chicago - work hard once

More Related Content

Pydata Chicago - work hard once