Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Spark + H2O = Machine Learning at scale
Mateusz Dymczyk
Software Engineer
Machine Learning with Spark Tokyo
30.06.2016

2

Agenda
• Spark introduction
• H2O introduction
• Spark + H2O = Sparkling Water
• Demos

3

Spark

4

What is Spark?
• Fast and general engine for large-scale data processing.
• API in Java, Scala, Python and R
• Batch and streaming APIs
• Based on immutable data structure
*	http://spark.apache.org/

5

Architecture
*	http://spark.apache.org/docs/latest/cluster-overview.html

6

Why Spark?
• In-memory computation (fast)
• Ability to cache (intermediate) results in memory (or on
disk)
• Easy API
• Plenty of out-of-the box libraries
*	http://spark.apache.org/docs/latest/mllib-guide.html

7

MLlib
• Spark’s machine learning library
• Supports:
• basic statistics
• classification and regression
• clustering
• dimensionality reduction
• evaluations
• … *	http://spark.apache.org/docs/latest/mllib-guide.html

8

Linear regression demo
// imports
//V1,V2,V3,R
//1,1,1,0.1
//1,0,1,0.5
val sc: SparkContext = initContext()
val data = sc.textFile(...)
val parsedData: RDD[LabeledPoint] = data.map { line =>
// parsing
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*	http://spark.apache.org/docs/latest/mllib-linear-methods.html

9

Linear regression demo
// imports
//V1,V2,V3,R
//1,1,1,0.1
//1,0,1,0.5
val sc: SparkContext = initContext()
val data = sc.textFile(...)
val parsedData: RDD[LabeledPoint] = data.map { line =>
// parsing
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*	http://spark.apache.org/docs/latest/mllib-linear-methods.html

10

Linear regression demo
// imports
//V1,V2,V3,R
//1,1,1,0.1
//1,0,1,0.5
val sc: SparkContext = initContext()
val data = sc.textFile(...)
val parsedData: RDD[LabeledPoint] = data.map { line =>
// parsing
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*	http://spark.apache.org/docs/latest/mllib-linear-methods.html

11

Linear regression demo
// imports
//V1,V2,V3,R
//1,1,1,0.1
//1,0,1,0.5
val sc: SparkContext = initContext()
val data = sc.textFile(...)
val parsedData: RDD[LabeledPoint] = data.map { line =>
// parsing
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*	http://spark.apache.org/docs/latest/mllib-linear-methods.html

12

But…
• Are the implementations fast enough?
• Are the implementations accurate enough?
• What about other algorithms (i.e. where’s my
DeepLearning!)?
• What about visualisations?
*	http://spark.apache.org/docs/latest/mllib-guide.html

13

H2O

14

Math platform
What is H2O?
• Open source
• Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.

15

• Written in high performance Java - native Java API
• Drivers for R, Python, Excel, Tableau
• REST API
Math platform
API
What is H2O?
• Open source
• Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.

16

• Written in high performance Java - native Java API
• Drivers for R, Python, Excel, Tableau
• REST API
• Highly paralleled and distributed implementation
• Fast in-memory computation on highly compressed data
• Allows you to use all your data without sampling
• Based on mutable data structures
Math platform
API
Big data
focused
What is H2O?
• Open source
• Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.

17

Spark + H20 = Machine Learning at scale

18

Spark + H20 = Machine Learning at scale

19

Spark + H20 = Machine Learning at scale

20

FlowUI
• Notebook style open
source interface for H2O
• Allows you to combine
code execution, text,
mathematics, plots, and
rich media in a single
document

21

Why H2O?
• Speed and accuracy
• Algorithms/functionality not present in MLlib
• Access to FlowUI
• Possibility to generate dependency free (Java) models
• Option to checkpoint models (though not all) and continue
learning in the future

22

Sparkling Water

23

What is Sparkling Water?
• Framework integrating Spark and H2O
• Transparent use of H2O data structures and algorithms
with Spark API and vice versa

24

Spark + H20 = Machine Learning at scale

25

Spark + H20 = Machine Learning at scale

26

Spark + H20 = Machine Learning at scale

27

Common use-cases

28

Modeling
ETL
Data
Source
Modelling Predictions
Deep learning,
GBM, DRF, GLM,
PCA, Ensembles
etc.

29

ETL
ETL
Data
Source
Modelling Predictions

30

Stream Processing
ETL
Data
Source
Modelling
Predictions
Data
Stream
Spark Streaming/
Storm/Flink etc.

31

Demo #1
Sparkling Shell

32

REQUIREMENTS
• Windows/Linux/MacOS
• Java 1.7+
• Spark 1.3+
• SPARK_HOME set
INSTALLATION
1. http://www.h2o.ai/download
2. set MASTER env
3. unzip
4. run bin/sparkling-shell

33

DEV FLOW
1. create a script file containing application code
2. run with bin/sparkling-shell -i script_name.script.scala
OR
1. run bin/sparkling-shell and simply use the REPL
import org.apache.spark.h2o._
// sc - SparkContext already provided by the shell
val h2oContext = new H2OContext(sc).start()
import h2oContext._
// Application logic

34

Airline delay classification
Model
predicting flight
delays
ETL Modelling Predictions
• load data from CSVs
• use Spark APIs to filter
and join data
Model using
H2O’s GBM
*	https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

35

Gradient Boosting Machines
• Classification and regression predictive modelling
• Ensemble of multiple weak models (usually decision trees)
• Iteratively solves residuals (gradient boosted)
• Stochastic

36

Demo #2
FlowUI

37

Demo #3
Standalone app

38

REQUIREMENTS
• git
• editor of choice (IntelliJ/eclipse support)

39

BOOTSTRAP
1. git clone https://github.com/h2oai/h2o-droplets.git
2. cd h2o-droplets/sparkling-water-droplet
3. if using IntelliJ or Eclipse:
– ./gradlew idea
– ./gradlew eclipse
– import project in the IDE
4. develop your app

40

DEPLOYMENT
1. build ./gradlew build shadowJar
2. submit with:
$SPARK_HOME/bin/spark-submit 
--class water.droplets.SWTokyoDemo 
--master local[*] 
--packages ai.h2o:sparkling-water-core_2.10:1.6.5 
build/libs/sparkling-water-droplet-app.jar

41

Open Source
• Github:
https://github.com/h2oai/sparkling-water
• JIRA:
http://jira.h2o.ai
• Google groups:
https://groups.google.com/forum/?hl=en#!forum/h2ostream

42

More Info
• Documentation and booklets:
http://www.h2o.ai/docs/
• H2O.ai blog:
http://h2o.ai/blog
• H2O.ai YouTube channel:
https://www.youtube.com/user/0xdata
@h2oai
http://www.h2o.ai

43

Thank you!
@mdymczyk
Mateusz Dymczyk
mateusz@h2o.ai

44

Q&A

More Related Content

Spark + H20 = Machine Learning at scale

  • 1. Spark + H2O = Machine Learning at scale Mateusz Dymczyk Software Engineer Machine Learning with Spark Tokyo 30.06.2016
  • 2. Agenda • Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos
  • 4. What is Spark? • Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure * http://spark.apache.org/
  • 6. Why Spark? • In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on disk) • Easy API • Plenty of out-of-the box libraries * http://spark.apache.org/docs/latest/mllib-guide.html
  • 7. MLlib • Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … * http://spark.apache.org/docs/latest/mllib-guide.html
  • 8. Linear regression demo // imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5 val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() * http://spark.apache.org/docs/latest/mllib-linear-methods.html
  • 9. Linear regression demo // imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5 val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() * http://spark.apache.org/docs/latest/mllib-linear-methods.html
  • 10. Linear regression demo // imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5 val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() * http://spark.apache.org/docs/latest/mllib-linear-methods.html
  • 11. Linear regression demo // imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5 val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() * http://spark.apache.org/docs/latest/mllib-linear-methods.html
  • 12. But… • Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my DeepLearning!)? • What about visualisations? * http://spark.apache.org/docs/latest/mllib-guide.html
  • 13. H2O
  • 14. Math platform What is H2O? • Open source • Set of math and predictive algorithms • GLM, Random Forest, GBM, Deep Learning etc.
  • 15. • Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API Math platform API What is H2O? • Open source • Set of math and predictive algorithms • GLM, Random Forest, GBM, Deep Learning etc.
  • 16. • Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API • Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures Math platform API Big data focused What is H2O? • Open source • Set of math and predictive algorithms • GLM, Random Forest, GBM, Deep Learning etc.
  • 20. FlowUI • Notebook style open source interface for H2O • Allows you to combine code execution, text, mathematics, plots, and rich media in a single document
  • 21. Why H2O? • Speed and accuracy • Algorithms/functionality not present in MLlib • Access to FlowUI • Possibility to generate dependency free (Java) models • Option to checkpoint models (though not all) and continue learning in the future
  • 23. What is Sparkling Water? • Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms with Spark API and vice versa
  • 32. REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell
  • 33. DEV FLOW 1. create a script file containing application code 2. run with bin/sparkling-shell -i script_name.script.scala OR 1. run bin/sparkling-shell and simply use the REPL import org.apache.spark.h2o._ // sc - SparkContext already provided by the shell val h2oContext = new H2OContext(sc).start() import h2oContext._ // Application logic
  • 34. Airline delay classification Model predicting flight delays ETL Modelling Predictions • load data from CSVs • use Spark APIs to filter and join data Model using H2O’s GBM * https://github.com/h2oai/sparkling-water/tree/master/examples/scripts
  • 35. Gradient Boosting Machines • Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic
  • 38. REQUIREMENTS • git • editor of choice (IntelliJ/eclipse support)
  • 39. BOOTSTRAP 1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE 4. develop your app
  • 40. DEPLOYMENT 1. build ./gradlew build shadowJar 2. submit with: $SPARK_HOME/bin/spark-submit --class water.droplets.SWTokyoDemo --master local[*] --packages ai.h2o:sparkling-water-core_2.10:1.6.5 build/libs/sparkling-water-droplet-app.jar
  • 41. Open Source • Github: https://github.com/h2oai/sparkling-water • JIRA: http://jira.h2o.ai • Google groups: https://groups.google.com/forum/?hl=en#!forum/h2ostream
  • 42. More Info • Documentation and booklets: http://www.h2o.ai/docs/ • H2O.ai blog: http://h2o.ai/blog • H2O.ai YouTube channel: https://www.youtube.com/user/0xdata @h2oai http://www.h2o.ai
  • 44. Q&A