Paige Roberts: Shortcut MLOps with In-Database Machine Learning

Shortcut MLOps with In-Database Machine Learning
Paige Roberts
Vertica Open Source Relations Manager

Agenda
MLOps – Machine Learning Operations
Challenges with Operationalizing ML
Is In-Database ML Possible?
How Does In-DB ML Help MLOps?
How Does In-DB ML Work?
Questions

Advanced Analytics Use Cases are Everywhere
It’s not an option; competing on data is the new style of business.
Network Optimization Energy Optimization

Challenges with operationalizing ML

A lot of machine learning projects never make it to production
https://www.redapt.com/blog/why-90-of-machine-
learning-models-never-make-it-to-production
https://www.forbes.com/sites/forbestechcouncil/2019/04/03/why-
machine-learning-models-crash-and-burn-in-production/
https://thenewstack.io/add-it-up-how-long-
does-a-machine-learning-deployment-take/
https://info.algorithmia.com/hubfs/2019/Whitepapers/
The-State-of-Enterprise-ML-
2020/Algorithmia_2020_State_of_Enterprise_ML.pdf
https://towardsdatascience.com/why-90-percent-of-all-machine-
learning-models-never-make-it-into-production-ce7e250d5a4a

Data science journey – Now
sample extraction
ODBC/JDBC
Data Preparation
& Model Training
Model Deployment
& Management

Separation of Dev and Prod Environments is Problematic
Development Environment
 Notebooks – Interactive
 Python – Pandas, SciKit
 Tensorflow
 Training
 Batch
 One Instance or single computer
 Small sample – fits in memory
 No worries about resource mgmt
Production Environment
No notebook – Automated
Java, Scala
JS, C++
Prediction/Inference
Streaming
 Many nodes or instances
 Full data set – could be huge
 Concurrent jobs, resource mgmt,
autoscaling
8
Tools:
Scale:

Challenges
Added Cost
Additional hardware
and software required.
Lots of components to
make a whole solution.
Requires Down
Sampling
Cannot process large
data sets due to
memory and
computational
limitations, resulting in
inaccurate predictions
Slower Time to
Development
Higher turnaround
times for model
building/scoring and
need for moving,
transforming large
volumes of data
between systems
Slower Time to
Deployment
Inability to quickly
deploy predictive
models into production.
Often must be rebuilt in
another technology so
it will scale.
7

In-Database Machine Learning
End-to-end data science workflow in place
SELECT IMPUTE
SELECT NORMALIZE
SELECT RF_CLASSIFIER
SELECT ROC
SQL
User Defined Functions
Distributed
Analytical
Database

Is in-database ML really possible?

8 Databases that Support In-Database Machine Learning
 Vertica
 Amazon Redshift *
 BlazingSQL
 Google BigQuery *
 IBM Db2 *
 Kinetica
 Microsoft SQL Server *
 Oracle *
https://www.infoworld.com/article/3607762/8-databases-supporting-in-database-machine-learning.html
* Machine learning is an add-on or integrated product purchased with the database

Many analytical databases have full set of ML functions
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Correlations
Heteroscedascity
Trend
&
Stationarity
Normality
Cramer’s V
Biserial Point
Kendall
Spearman
Pearson Breush-Pagan
Goldfeld-Quandt
White’s Lagrange
Engle
Augmented Dickey-Fuller
Mann-Kendall
Normaltest
Regular Joins
Joins
XGBoost
Random Forest
Tree
Based
Models
Linear Regression
Logistic Regression
LinearSVC
Linear
Models
KMeans
Bisecting KMeans
Clustering
Time Series Joins
Spatial Joins

Deployment/
Management
End-to-end machine learning process in database
14
Data Analysis/
Exploration
Data
Preparation Modeling Evaluation
Business
Understanding
In-Database
Scoring
Speed at Scale
Security
Statistical Summary
Sessionization
Pattern Matching
Date/Time Algebra
Window/Partition
Date Type
Handling
Sequences
and more…
Outlier Detection
Normalization
Imbalanced
Data Processing
Sampling
Test/Validation Split
Time Series
Missing Value
Imputation
and more…
K-Means
Support Vector
Machines
Logistic, Linear, Ridge
Regression
Naïve Bayes
Cross Validation
and more…
Model-level Stats
ROC Tables
Error Rate
Lift Table
Confusion Matrix
R-Squared
MSE
XGBoost
Filtering
Feature Selection
Correlation
Matrices
Table-Like
Management
Versioning
Authorization
and more…
Random Forests
and more…
Principal Component
Analysis
PMML
Import/Export
TensorFlow Import
AutoML

Modern analytical databases analyze data in many formats
– Semi-structured, streaming, text, complex, …
data formats optimized
for long-term storage
Database format optimized for
read/analyze performance
Historical Recent
Unified Analytics
Analyze data where it sits.
 Web logs
 Sensor, IOT data
 Time series
 Click streams
 Purchase history
 Geospatial
 …
Streaming
CSV
Cloud storage,
Hadoop (HDFS),
S3 on prem
for streaming messaging
ROS

Advantages of In-Database Machine Learning
17
NODE 1 NODE 2 NODE N
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Eliminating overhead of data transfer

18
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Data Security and Provenance

19
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Model Storage and Management

20
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Serving concurrent users

21
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Highly scalable ML functionalities

22
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Avoiding maintenance cost of a
separate system

23
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Avoiding maintenance cost of a
separate system
Identical Dev, Prod environments

In-Database Random Forest
A B C
r1
r2
r3
NODE 1 1
NODE 2 2
NODE 3 3
2
A B C
Tree1
r1
r2
A B C
Tree2
r1
r3
A B C
Tree3
r2
r3
3
Tree1
Tree2 Tree3
A B C
r1
r3
A B C
r2
r3
A B C
r1
r2
TRAIN
TRAIN
TRAIN
MERGE
TABLE
SPLIT DISTRIBUTE
1

In-Database Extreme Gradient Boosting
A
COMPUTE
SUMMARY
STATISTICS
NODE 1 1
B
COMPUTE
SUMMARY
STATISTICS
NODE 2 2
C
COMPUTE
SUMMARY
STATISTICS
NODE 3 3
NODE 1 1
NODE 2 2
NODE 3 3
A
B
C
FIND
BEST
SPLIT
A
B
C
FIND
BEST
SPLIT
A
B
C
FIND
BEST
SPLIT
1 2
...
?
Tree1 TreeN
3
ITERATIVE PROCESS
TRAIN
EXPORT
SQL
+
memModel
BINARY
4
JSON

SARIMA
In-Database SARIMA
X
Y
0 s+2 s+4 s+6 s+8 s+10 s+12
V
2V
3V
trend
seasonality
AR
MA
ARIMA
value
prediction
confidence interval
1
2
3
4
seasonal decomposition
NODE 1
NODE N
X
Xt
Xt-1
Xt-2
𝜃𝜃𝑖𝑖
0 ≤ 𝑖𝑖 ≤ 𝑘𝑘𝑘
X
Xt
Xt-1
Xt-2
𝛼𝛼𝑗𝑗
0 ≤ 𝑗𝑗 ≤ 𝑘𝑘𝑘
𝜃𝜃𝑖𝑖
𝑘𝑘𝑘 ≤ 𝑖𝑖 ≤ 𝑞𝑞
𝛼𝛼𝑗𝑗
𝑘𝑘𝑘 ≤ 𝑗𝑗 ≤ 𝑝𝑝
COMPUTE
COMPUTE
A Simple Noniterative
Estimator for Moving
Average Models
John W. Galbraith &
Victoria Zinde-Walsh

Integration with other ML technologies
Geospatial
Event Series
Time series
Text Analytics
Pattern Matching
Regression
User-Defined Functions
Machine Learning
In-DB Data Science
Pojo
Frozen
Graph
SQL
+
memModel
EXPORT
EXPORT
EXPORT
EXPORT
IMPORT
SQL
+
memModel
BINARY
EXPORT
XGBoost
Random Forest ElasticNet
Lasso
Ridge
KMeans
Isolation Forest
SVM
JSON
PCA

Example Notebooks https://www.vertica.com/python/examples/
29

Q&A VerticaPy: https://vertica.com/python
Paige Roberts
Open Source Relations Manager
E: Paige.Roberts@microfocus.com
Try: https://www.vertica.com/try

Previously, we used Apache Spark. Using Apache Spark cost
80% of the time from data I/O, only 20% of the time on
machine learning.
Now, we switched to using Vertica in-database machine
learning and saved 80-90% data reading time. So, it’s had
huge benefits in saving time and costs for model training.
Xiaobo Ge, CTO, EOITek
32

Using Vertica’s User-Defined Extension, we have already built
a bespoke lowpass filter to run against our data – that
ensures top analytical performance without having to move
the data to other systems or tools.
Vertica analytics and machine learning ultimately delivers an
edge that leads to more points, podiums, and wins for Jaguar
Racing.
Phil Charles
33
Technical Manager
Jaguar Formula E racing team

What People Are Saying
34
“Vertica is the path to production
for machine learning.”
“With Vertica in Eon Mode’s sub-clustering ability, we can
also give more computation power to our users as needed.
This is really helpful for our machine learning analysts.”
“We’re using machine learning and deep
learning, technologies, in the cloud, to consume,
that data, leveraging things like Python and
TensorFlow to build out and then operationalize
those predictive models ...”
“Vertica is a key technology, bringing in those diverse,
data sets and building flexible, data models, and being
able to do it quickly and at scale …”
“If we have a particular use case, we can
easily write our own functions in any language
– Java, Python, R – and deploy this in Vertica,
harnessing the power of parallel computing.”
“Others are forced to sample.
This is about using all of your data which is fantastic.”

Analyze Data In Many Formats
– Semi-structured, streaming, text, complex, …
for long-term storage
data format optimized for
read/analyze performance
Historical Recent
Unified Analytics
Analyze data where it sits.
 Web logs
 Sensor, IOT data
 Time series
 Click streams
 Purchase history
 Geospatial
 …
Streaming
CSV
Cloud storage,
Hadoop (HDFS),
S3 on prem
for streaming messaging
ROS

Distributed databases automatically parallelize algorithm training
Deliver accurate predictive analytics at speed and scale
MPP architecture is a great fit – ML functions run in parallel
across hundreds of nodes in a Vertica cluster
Tree1 Tree2
Node1
Tree3 Tree4 Tree2n-1 Tree2n
Node2 Noden

High Performance + High Concurrency
37
Get data quickly enough to act upon it, explore your data interactively,
and enable everyone to make their own data-driven decisions
Enable everyone to make their own data-driven decisions.
Get data quickly enough to act on it.
Explore data interactively.
Scale Data Volumes Scale Users and Jobs
SQL Database
+
+
Analytics & ML Query Engine

Vertica ML is
fast and scalable
• Leverages MPP infrastructure
and scale-out architecture for
parallel high performance
• No data movement across the
system
• No need to downsample, so
accuracy is boosted
Manage resources for
many users or jobs
Use a familiar
interface
• Manage, train, and deploy
ML models using simple SQL
calls
• Integrate ML functions with
other tools via SQL
• Or use Jupyter and Python
for familiar data science
Proven
infrastructure
• No additional hardware and
software to manage
• Open integration with other
solutions
• Requires fewer people to
setup and maintain
• Resource isolation for
multiple user sessions
• Natural segmentation of user
environments and privileges
• Configurable memory usage
for ML functions
In-Database Machine Learning Advantages
Enterprise-Grade Scalability Concurrency Productivity

VerticaPy – Python API for Machine Learning at Scale
28
Benefits:
Jupyter/Python – popular interactive tool of choice
Graphs and visualizations – even on large data sets
Data stays in database – security, integrity
Distributed Engine - heavy computation done by optimized MPP engine
Open source – you can contribute!
Manage models – store in DB
Results where they’re needed – available in DB for other analyses, visualization, or driving applications and AI

VerticaPy
 Python library
 Open source
 https://github.com/vertica/VerticaPy
 Pandas and Scikit Learn functionality
 Conduct data science projects in Vertica – high performance, high scale analytical database
 Takes advantage of built-in parallel analytics and machine learning
 You write Python in a Jupyter Notebook, the database executes it on the full data set
 Uses vertica-python native Python client for Vertica created by Uber with contributions from
Twitter, Palantir, Etsy, Kayak, Gooddata, … - https://github.com/vertica/vertica-python
40

Bruce Yen,
WW Business Intelligence Leader
“Our Vertica platform is instrumental in
many areas of our business—creating
predictive algorithms, serving up
product recommendations, powering
insight to our mobile apps, and
generating daily reports and ad- hoc
queries. It’s crucial for enabling us to be
more agile with data.”

Challenge
- Analyze large volumes of endpoint data to detect indications
of cyber attacks
Result
- Provides near real-time data insertion across hundreds of
thousands of endpoints; results in enterprise-grade threat
detection quality
- Ensures rapid querying, essential for analysts to outpace
attackers
- Delivers higher detection rates and dramatically reduces
false positives
- Enables analysts to perform complex queries over large
volumes of data; presents results in easy-to-understand
interface
- Provides credible, robust, and scalable big data technology
that will be trusted by its customers
Cyberbit Surpasses
Conventional Security Systems
Vertica uses big data, behavioral analysis, and machine
learning to optimize detection and response

Our Vertica platform is instrumental in many areas of our
business—creating predictive algorithms, serving up
product recommendations, powering insight to our mobile
apps, and generating daily reports and ad- hoc queries. It’s
crucial for enabling us to be more agile with data.
Bruce Yen
43
WW Business Intelligence Leader
Guess

Advantages of Python + MPP Analytical Database
MPP Scale
Clusters with no name
node or other single point
of failure allow unlimited
scale
Speed and
Concurrency
Query optimization and
resource management
across multiple nodes
Features
ML algorithm
parallelization, moving
windows, geospatial
analysis, time series joins,
fast data prep...
Open Architecture
Integration with many
other applications - BI, ETL,
Kafka, Spark, Data Science
Labs …
Broad Utility
Many functionalities - one
of the most broadly useful
programming languages.
Flexibility
It Many right paths to do
things, a lot of freedom,
works on many platforms.
Ease of Use
High level of abstraction
makes Python one of the
easiest programming
languages.
Strong Community
Most data scientists master
Python. Many useful
packages (pandas, scikit, …)

Linear regression
Analytic database functions include machine learning algorithms
K-means
Logistic regression
Naive Bayes Random Forest
SVM
Predict customer retention
Forecast sales revenues Customer segmentation
Predict sensor failure
Classify gene expression
data for drug discovery
Refine keywords to improve
Click Through Rate (CTR)

20
Bring your R,
TensorFlow, and
Python code inside the
database – analyze the
data in place.

Predictive Maintenance Demo
48
Analyze sensor data
from cooling towers
across the US ,
enabling equipment
manufacturers to
predict and prevent
equipment failure

Flight Tracker Demo
49
Vertica operates
at the “edge”
with flight track
detail. Sensor
data is collected
using a Raspberry
pi with radio
receiver and
antenna. Data is
loaded into
Vertica as
thousands of
records per
second and
builds to billions
of flight data
points collected
within a 250-mile
radius.
https://www.vertica.com/blog/blog-post-series-using-vertica-track-
commercial-aircraft-near-real-time/

 Huge improvements in stability and
performance after moving to Vertica
 24 mins on Spark, 3 mins in Vertica
 Can incorporate other data like weather to
optimize predictive thermostat efficiency
after moving to Vertica ML
 Citing speed of analytics, ease of use when
coding in SQL, and improvements in the
accuracy of models after moving workloads
to Vertica ML
 Solving issues that were previously unsolvable
 Minimal hardware, software, and personnel
investments when differentiating with
data science.
51

Moving data science workloads from Spark on Hadoop to in-database
Improvements in stability and performance
Creating customer segmentation via clustering algorithms on a
15 million customer dataset took 24 mins on Spark - 3 mins in database
Concurrently running other algorithms without performance impact
Cardlytics partners with more than 1,500
financial institutions to run their online and
mobile banking rewards programs, which
gives us a robust view into where and when
consumers are spending their money.

Fidelis Cybersecurity protects the
world's most sensitive data by
identifying and removing attackers
no matter where they're hiding on
your network and endpoints.
53
Data science team was experiencing
challenges with performance while
using Spark ML
Moving workloads from Spark ML to
in-database ML provided:
Speed of analytics
Ease of use when coding in SQL
Increased accuracy of models

Some Vertica IoT Customer Resources
Case Studies
 Anritsu ROI case study: https://www.vertica.com/wp-
content/uploads/2017/01/r24-HPE-Vertica-ROI-case-study-Anritsu.pdf
 Infographic of ROI: https://www.vertica.com/wp-
content/uploads/2017/03/Anritsu-v2.pdf
 Nimble Storage ROI case study: https://www.vertica.com/wp-
content/uploads/2017/08/Nimble-Storage-ROI.pdf
 Optimal+ case study: https://www.vertica.com/wp-
content/uploads/2017/06/Optimal-MF-rebrand-FINAL-lo-res.pdf
 *Climate Corp case study: https://www.vertica.com/wp-
content/uploads/2019/01/Climate-Corp_Success-Story-FINAL.pdf
Webcasts – Data Disruptors
 Philips: https://www.brighttalk.com/webcast/10477/277693
 Climate Corp: https://www.brighttalk.com/webcast/8913/336201
 Nimble Storage (HPE InfoBright):
https://www.brighttalk.com/webcast/8913/330769
 Zebrium: https://www.brighttalk.com/webcast/8913/332838
 Simpli.fi: https://www.brighttalk.com/webcast/8913/354325/simpli-fi-
delivers-advertising-insights-on-billions-of-streaming-bid-messages
Videos
 Optimal+:
https://www.youtube.com/watch?v=IZkkoy5ZT1M&feature=youtu.be
 Anritsu:
https://www.youtube.com/watch?v=QZ5vWqblVXU&feature=youtu.be
54

Paige Roberts: Shortcut MLOps with In-Database Machine Learning

More Related Content

Paige Roberts: Shortcut MLOps with In-Database Machine Learning