Data science tools of the trade

Data Science @ PMI
Tools of The Trade
Best Practices to Start, Develop and Ship a Data Science Product
Manuel Valverde
Tokyo WebHack, 17th January 2019

• PhD.@Granada U. Spain: Physics modelling and MC simulations for
SuperKamiokande
• PostDoc@Osaka U. Osaka: Nuclear Structure Calculations. Think Gaussian
processes
• DataScientist@Rakuten, Tokyo: Search Relevancy for e-commerce
• DataScientist@PMI, Tokyo: Fraud prevention
2
About Me

About Philip Morris International
3
• Founded in 1847
• No. 108 in the 2018 Fortune 500
• 80,000 employees, 180+ markets, 150M consumers
• 6 of the world's top international 15 brands, including
Shifting from combustible cigarettes to smoke-free, reduced risk products (RRP)
https://www.pmi.com/smoke-free-products

• We are part of PMI's Enterprise Analytics and Data (EAD) group
• 40+ Data Scientists across 4 hubs
• Offices in Amsterdam (NL), Kraków (PL), Lausanne (CH) and Tokyo (JP)
• Profiles
• Education: 30% PhD, 70% MSc/BSc
• Data Science Experience: 7.4 yrs on average
• Experience in PMI: 88% under 2yrs
• Expertise in Machine Learning, Big Data Engineering, Insights Communication
• SCRUM certified (Professional Scrum Developer)
4
Data Science @ PMI

5
2 Labs
LA
2 Labs
North
America
Add 1
Lab
EU
2 Labs
EE
Add 2
Labs
Asia
A
( Data x Science x Communication ) = Insight
Data is only one part of the equation. We bring the scientific method. It materializes in the analytical code we
write. It is as valuable as the data itself.
B We are business driven
Whatever we do, it contributes to the business. We are diligent about making an impact.
C
We invest in people
We invest in the ability to ask questions. It can’t be achieved with tools only. Tools are for generating answers,
but questions are posed by people.
D
We self-organize
We choose coordination & cultivation over command & control. We believe this approach allows for the best
solutions to emerge.
E We iterate and improve
We embrace lean development, we learn from mistakes and we do it together with business.
F We co-create
Data insights ecosystem requires collaboration among all parties. We want to be active contributors.
Data Science Principles @ PMI

Why are we here?
Because a data scientist is not just someone who knows more statistics than a programmer
Data Science is Software.
The product of a Data Science effort (a Model or a Report)
is essentially a small but critical part of a large,
sophisticated business software. Data Products must
therefore be designed to play well with systems up- and
downstream.
Remember that the system can work without a model, but
a model is pretty much worthless without the system.

Writing code for implementing machine learning
algorithms is getting easier every year.
Building a scikit-learn Pipeline to implement a Random
Forest model with GridSearch is less than twenty lines of
code today. AutoML is around the corner.
We need to acknowledge and understand two things:
 The code, or even the model is not our end-goal.
 We're in the business of building intelligent
applications,
or data products.
Why are we here?
Because a data scientist is not just someone who knows more programming than a statistician

9
• Obtain
connect to DBs,
download flat files
• Scrub
outliers/missing data,
aggregations
• Explore
statistical analysis,
feature engineering
• Model
learning algorithms,
parameter optimization
• INdustrialize
reports, APIs
ExploratoryProduction
Smart
Application
An OSEMN Data Science Process
Explore, Model, Iterate.
Create a Data Product.

10
We define a data product as a system that
 takes raw data as input, 📲
 applies a machine-learned model to it, 🤖
 produces data as output to another system 💻
Additionally, a data product must
 be dynamic and maintainable,
allowing periodic updates 🏃
 be responsive, performant and scalable 👨👨👦👦
What is a Data Product? 🤔
In a nutshell, it’s a software product with an ML Engine.
Examples
Amazon’s Product Recommendation Engine
LinkedIN’s “People You May Know”
Autonomous Vehicles
The Classic Data Science Workflow
Data Product Development Workflow

11
Challenges in Data Product Development 🤔
“Team programming isn’t a divide and conquer problem.
It is a divide, conquer, and integrate problem.”
1. The Process
Infrastructure Setup > Code > Build > Test >
Package > Release > Monitor
2. The Team
Cross-functional group of businesspeople, data
scientists, engineers and developers.
3. The Challenge
As an example, consider we have 2 groups,
 Team A consists of data engineers and
scientists and works on the Prediction Engine.
👨🤖💻👨🤖🎓
 Team B consists of software engineers and
front-end developers working on the UI. 👨🤖🎨
👨🤖🔬
The goal is that every piece in the product should
integrate well into a larger codebase. 🍻

12
Continuous Integration (CI), Delivery (CD) and
DeploymentDevelopment practices for overcoming integration challenges and moving faster to delivery
The CI/CD Cycle
 Continuous Integration requires multiple developers to
integrate code into a shared repository frequently.
Requested merges are automatically tested and
reviewed.
 Enabled by git-flow, code standards and
automated testing
 Continuous Delivery makes sure that the code that we
integrate is always in a deploy-ready state.
 Enabled by agile (iterative) methods,
testing and build automation
 Continuous Deployment is the actual act of pushing
updates out to the user – think of your iPhone apps or
Desktop browser that prompt for updates to be installed
periodically.

14
The Role of Data Scientists
Learn best practices to contribute effectively to data products
Write code that is
 Readable,
so others can understand and add to it
 Testable
so others can verify it does what it advertises
and integrate it into their work
 Reusable
so it may be included in other projects
 Reproducible
uses libraries/packages that are available on
production environments
 Usable
don’t write code in SAS or R,
most engineers don’t speak those languages.
Joel’s Tests
 Do you use source control?
 Can you make a build in one step?
 Do you make daily builds?
 Do you have a bug database?
 Do you fix bugs before writing new code?

15
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates

16
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates

Our building blocks
Ocean Components

To create a workflow that is …
Our Vision
• Flexible
Adapts to specific needs of every use-case
Accommodates changing requirements
• Inspection
Transparency at all times
Artifacts can be audited at any time.
• Reproducible
Out-of-the-box dependency management
No more ‘But-it-works-on-my-machine’ or ‘Please-industrialize-this-
model’
• Easy to use
Frictionless development experience
Freedom to experiment
🔥

Some things we always need to be mindful of.
Our Principles
 Sensitive Data must never leave the Ocean
 Restricted Open-Source libraries must be avoided
 Every use-case must be industrialization-ready

DS Prod Lab
Scanned by BlackDuck
Automation
On-demand
infrastructure
Data Read/Write
Data Product
Reproducible Containers
Version Control
System Architecture
The dots, connected.

We organize our workflow in 3 phases – Start, Develop and Ship
3 Steps to a Data Product
• Get Infrastructure
• DS Prod Lab
• Docker Container
• Python Environments
• Get Data
• Flat Files
• Database Connections
• Get Code
• Project repo
• Cookiecutter template
• Start Docker container
• Check out a Branch
• For each task in OSEMN,
write
Exploratory code in NBs,
• Standard Code Styles
• Documentation, Tests
• Maintain
dependencies
• Refactor into Modules
• Push
• Review, Merge
• Package Python code,
publish to PyPi on
Artifactory
• Persist models
• Build an API to industrialize
the model.
• Provide endpoints for
app-health checks.
• Set up Jenkins pipeline for
continuous integration
• Plan for the next iteration
Start Develop Ship

For Reproducibility
Docker Containers

Docker for Containerized Data Science
All your dependencies in one place.
Code guaranteed to run anywhere.
A container is a lightweight, stand-alone package of a software that
includes everything needed to run it: code, runtime, system tools,
system libraries, settings.
Containerized software will always run the same, regardless of the
environment.
Benefits for Data Scientists
 Freedom, install all your favorite tools and libraries
 Ease of installation, set up your toolbox once and it will always work
 Reproducibility and Portability,
your development environment can be reproduced anywhere
 Isolation, your Py2 setup doesn’t mess up your Py3 setup, installing
a new library doesn’t mess up system Python
 Speed, get up and running in minutes with images optimized for
specific applications like time-series analysis or deep-learning.

For organization and predictability
Project Templates

CookieCutter
Everything has a place and a purpose
The idea is borrowed from popular web-frameworks like Rails and Django
where each developer uses the same template when starting a new project.
This makes it easier for everyone on the team to figure out where they
would find or put the various moving parts.
We will use a standard project skeleton tailed for data science projects so
that every scientist knows where to put their code, notebooks, data, models,
figures and references.
Benefits of a standardized directory structure:
 allows people to collaborate more easily
 empowers reproducible analysis
 enforces a "data as immutable" design philosophy
Cookiecutters help us generate this folder structure automatically.

CookieCutter
The standard folder structure enforces a design philosophy for faster delivery
Treat Data as Immutable
Raw data should be stored inside /data/raw and should never be modified
by hand. The code you write should ingest the data from /raw and cleaned
or processed data should be written to /processed.
Reproducibility
Everyone on the team should be able to reproduce your analysis with
 the code in src/
 the data in data/raw/
 the dependencies in Dockerfile, requirements file
Notebooks for Exploration, Scripts for Production Code
Jupyter is great for exploratory analysis, but quite challenging for version
control (they're stored as json files.) Once your code works well, move it
from notebooks/ to src/ and package the functions and classes into
modules.

For being deploy-ready
Moving code from
Notebooks to Source Code

Notebooks for Exploration. Files for Production.
The case against Notebooks
 The main cause of unmaintainable code and bad structure in Data Science is the mixing
of exploratory "throw away" code with production code. Notebooks are being used to write
code that ultimately would be deployed in production.
 This is not what notebooks where invented for;
they are essentially browser-based shells and presentation tools with charts and code
blocks.
 Notebooks do not have refactoring tools, code structuring tools and are
notorious for version control management.
Motivation for Organizing Code
 Extract text and plots from notebooks into Markdown Reports for a business audience
 Notebooks with minimal code and clear narrative can be used as Technical Reports
 Move the core functionality into Python modules to speed up subsequent exploration

In the exploratory phase,
the code base is expanded through data analysis, feature
engineering and modelling.
In the refactoring phase,
the most useful results and tools from the exploratory phase are
translated into modules and packages.
The Production Codebase grows across sprints.

For integration and deployment
Automated Testing

 If your code is not performing as expected, will you
know?
 If your data are corrupted, do you notice?
 If you re-run your analysis on different data,
are the methods you used still valid?

Automated Testing
“Why do most developers fear to make continuous changes to their code? They are afraid they’ll break
it!
Why are they afraid they’ll break it? Because they don’t have tests!”
Two Types of Tests useful for DS
 Unit Testing to make sure individual pieces of code work
 Integration Testing to make sure your code works with everyone else's
Challenge with writing Tests for Data Science
For most software, the output is deterministic - a function for averaging numbers can be
Unit tested with a simple function that checks if result is accurate. You can then check your
changes in, and Integration tests can run against the new build with a fabricated set of
results to ensure that everything works as expected.
But not so with Data Science work – the output is probabilistic.
You can't always put in a 2 and 4 and expect a 3 to come out.

Automated Testing for Data Science
 First, implement a Unit Test framework within your code; use pytest or nose
 In some cases, you can set a deterministic value like number of rows or the
expected data type from a function, and write a test for it.
 But if you can't - pick the performance metric (p-value, F1-score, or AUC, etc.)
and check if it lies within an acceptable range.
Test-Driven Development (TDD)
First the developer writes an (initially failing)
automated test case that defines a desired
improvement or new function, then produces the
minimum amount of code to pass that test.” So, before
actually writing any code, you should write your tests.
All tests should go into the tests/ subdirectory of the
specific package. Write tests in three steps
 Get/make the input data
 Manually construct the result you expect
Compare the actual result
to the expected correct result

 Engineering smart systems around a machine-learned
core is difficult
 It requires teams of exceptionally talented individuals to
work together.
 What makes data scientists special is their ability to work
with both business leaders and technology experts.
 We must acknowledge that we are a part of something
much bigger and learn to play well with each other and
with all parties involved.
Our hope is that these systems, principles and best
practices will help you take the first steps in that direction

Data science tools of the trade

More Related Content

Data science tools of the trade