Spark + H20 = Machine Learning at scale

Spark + H2O = Machine Learning at scale
Mateusz Dymczyk
Software Engineer
Machine Learning with Spark Tokyo
30.06.2016

Agenda
• Spark introduction
• H2O introduction
• Spark + H2O = Sparkling Water
• Demos

What is Spark?
• Fast and general engine for large-scale data processing.
• API in Java, Scala, Python and R
• Batch and streaming APIs
• Based on immutable data structure
* http://spark.apache.org/

Architecture
* http://spark.apache.org/docs/latest/cluster-overview.html

Why Spark?
• In-memory computation (fast)
• Ability to cache (intermediate) results in memory (or on
disk)
• Easy API
• Plenty of out-of-the box libraries
* http://spark.apache.org/docs/latest/mllib-guide.html

MLlib
• Spark’s machine learning library
• Supports:
• basic statistics
• classification and regression
• clustering
• dimensionality reduction
• evaluations
• … * http://spark.apache.org/docs/latest/mllib-guide.html

Linear regression demo
// imports
//V1,V2,V3,R
//1,1,1,0.1
//1,0,1,0.5
val sc: SparkContext = initContext()
val data = sc.textFile(...)
val parsedData: RDD[LabeledPoint] = data.map { line =>
// parsing
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
* http://spark.apache.org/docs/latest/mllib-linear-methods.html

But…
• Are the implementations fast enough?
• Are the implementations accurate enough?
• What about other algorithms (i.e. where’s my
DeepLearning!)?
• What about visualisations?
* http://spark.apache.org/docs/latest/mllib-guide.html

Math platform
What is H2O?
• Open source
• Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.

• Written in high performance Java - native Java API
• Drivers for R, Python, Excel, Tableau
• REST API
Math platform
API
What is H2O?
• Open source

• Written in high performance Java - native Java API
• Drivers for R, Python, Excel, Tableau
• REST API
• Highly paralleled and distributed implementation
• Fast in-memory computation on highly compressed data
• Allows you to use all your data without sampling
• Based on mutable data structures
Math platform
API
Big data
focused
What is H2O?
• Open source

Spark + H20 = Machine Learning at scale

FlowUI
• Notebook style open
source interface for H2O
• Allows you to combine
code execution, text,
mathematics, plots, and
rich media in a single
document

Why H2O?
• Speed and accuracy
• Algorithms/functionality not present in MLlib
• Access to FlowUI
• Possibility to generate dependency free (Java) models
• Option to checkpoint models (though not all) and continue
learning in the future

What is Sparkling Water?
• Framework integrating Spark and H2O
• Transparent use of H2O data structures and algorithms
with Spark API and vice versa

Modeling
ETL
Data
Source
Modelling Predictions
Deep learning,
GBM, DRF, GLM,
PCA, Ensembles
etc.

ETL
ETL
Data
Source
Modelling Predictions

Stream Processing
ETL
Data
Source
Modelling
Predictions
Data
Stream
Spark Streaming/
Storm/Flink etc.

REQUIREMENTS
• Windows/Linux/MacOS
• Java 1.7+
• Spark 1.3+
• SPARK_HOME set
INSTALLATION
1. http://www.h2o.ai/download
2. set MASTER env
3. unzip
4. run bin/sparkling-shell

DEV FLOW
1. create a script file containing application code
2. run with bin/sparkling-shell -i script_name.script.scala
OR
1. run bin/sparkling-shell and simply use the REPL
import org.apache.spark.h2o._
// sc - SparkContext already provided by the shell
val h2oContext = new H2OContext(sc).start()
import h2oContext._
// Application logic

Airline delay classification
Model
predicting flight
delays
ETL Modelling Predictions
• load data from CSVs
• use Spark APIs to filter
and join data
Model using
H2O’s GBM
* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Gradient Boosting Machines
• Classification and regression predictive modelling
• Ensemble of multiple weak models (usually decision trees)
• Iteratively solves residuals (gradient boosted)
• Stochastic

REQUIREMENTS
• git
• editor of choice (IntelliJ/eclipse support)

BOOTSTRAP
1. git clone https://github.com/h2oai/h2o-droplets.git
2. cd h2o-droplets/sparkling-water-droplet
3. if using IntelliJ or Eclipse:
– ./gradlew idea
– ./gradlew eclipse
– import project in the IDE
4. develop your app

DEPLOYMENT
1. build ./gradlew build shadowJar
2. submit with:
$SPARK_HOME/bin/spark-submit
--class water.droplets.SWTokyoDemo
--master local[*]
--packages ai.h2o:sparkling-water-core_2.10:1.6.5
build/libs/sparkling-water-droplet-app.jar

Open Source
• Github:
https://github.com/h2oai/sparkling-water
• JIRA:
http://jira.h2o.ai
• Google groups:
https://groups.google.com/forum/?hl=en#!forum/h2ostream

More Info
• Documentation and booklets:
http://www.h2o.ai/docs/
• H2O.ai blog:
http://h2o.ai/blog
• H2O.ai YouTube channel:
https://www.youtube.com/user/0xdata
@h2oai
http://www.h2o.ai

Thank you!
@mdymczyk
Mateusz Dymczyk
mateusz@h2o.ai

Spark + H20 = Machine Learning at scale

More Related Content

Spark + H20 = Machine Learning at scale