2. Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
3. What is Spark
● Hadoop/YARN:
○ strong in processing large files parallelly
○ synchronization barrier when persisting data to the
disk.
○ MapReduce: launch mapper & reducer, R/W to disk,
back to queue and get resource
● Spark:
○ in-memory processing
○ iterative and interactive data analysis
○ compare to MapReduce, supports more complex
and interactive applications
4. Hadoop MapReduce
● Slow due to replication, serialization, and disk IO
● Inefficient for:
○ Iterative algorithms (Machine Learning, Graphs &
Network Analysis)
○ Interactive Data Mining (R, Excel, Searching)
9. How Spark Works - RDD
● Partitions of Data
● Dependencies between partitions
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK,
DISK_ONLY,
...
10. How Spark Works - RDD operations
Transformations
● Create a new
dataset from an
existing one.
● Lazy in nature,
executed only
when some action
is performed.
● Example
○ Map(func)
○ Filter(func)
○ Distinct()
Actions
● Returns a value or
exports data after
performing a
computation.
● Example:
○ Count()
○ Reduce(func)
○ Collect
○ Take()
Persistence
● Caching dataset
in-memory for
future operations
● store on disk or
RAM or mixed
● Example:
○ Persist()
○ Cache()
21. Compare Hadoop and Spark
Traditional OS Hadoop Spark
Storage File System HDFS HDFS
Schedule Processes MapReduce Computation Graph
I/O Disk Cache(in memory) and
shared data
Fault
Tolerance
Duplication and Disk
I/O
Hash partition and auto-
reconstruction
22. Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
24. 1. Initializing Spark
1. JavaSparkContext: tell Spark how to access to the cluster
2. SparkConf: setting - a hashmap of <String,String>
a. required: AppName, Master, more default configuration
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
25. 2. Prepare Dataset
1. From Parallelized Collections
2. From External DataSets
3. Passing Functions to Spark
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
JavaRDD<String> distFile = sc.textFile("data.txt"); OR
JavaRDD<String> distFile = sc.textFile("hdfs://data.txt");
class ParseLabeledPoint implements Function<String, LabeledPoint> {
public LabeledPoint call(String s) {...
for (int i = 0; i < len; i++) {
x[i] = Double.parseDouble(tokens[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}}
---
JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
26. 3. Train LogisticRegressionModel
/*
* @param input RDD of (label, array of features) pairs.
* @param numIterations Number of iterations of gradient descent to run.
* @param stepSize Step size to be used for each iteration of gradient descent.
* @param miniBatchFraction Fraction of data to be used per iteration.
*/
LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations,
stepSize,miniBatchFraction);
Train the model
27. 4. Calculate Score - Evaluation
pmmlModel = new PMMLSparkLogisticRegressionModel()
.adaptMLModelToPMML(lrModel, partialPmmlModel);
1. Convert LogisticRegressionModel to PMML model
2. Prepare DataSet and calculate score
//use LogisticRegressionModel
JavaRDD<Vector> evalVectors = lines.map(new ParseVector());
List<Double> evalList = lrModel.predict(evalVectors).collect();
//use PMMLEvaluator
RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml);
List<Double> evalResult = evaluator.evaluate(evalData);
//compare two evaluator results
for (...) {
Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA);
}
28. Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
29. Why Spark? - scalability & performance
1. leverage the memory of the cluster for in-
memory processing
2. Computation Graph optimization for parallel
execution
Shark: Spark SQL, Hive in Spark
Hive: manage large dataset in
distributed storage
30. Why Spark? - compatibility
1. compatible with HDFS, HBase, and any
Hadoop storage system
31. Why Spark? - Ease of Use API
1. Expressive API in Java, Scala, and Python
2. Supports more parallel operations
32. Expressive API - MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
33. Expressive API - Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Scala:
val file = spark.textFile("hdfs://...")
val counts = file.map(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
34. public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Java 6, Java 7:
JavaRDD<String> file = spark.textFile("hdfs://...");
JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String,
String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String,
Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }});
Expressive API - Spark
35. Why Spark? - Third Party Softwares
● Mahout
○ Say goodbye to MapReduce
○ Support for Apache Spark
■ Mahout-Spark Shell: facilitate the Mahout data
structures, such as Matrix, etc.
○ Support for h2o being explored
○ Support for Apache Flink possibly in future
● H2o
○ Sparkling water - embrace in-memory
processing with ML algorithm
Purpose Language Storage Stakeholder
H2o In-memory ML
predictive analysis
Java/R K/V
store
data analyst
Spark in-memory
processing engine
Scala, support
Java/Python
RDD HDFS user
36. Why Spark - Third Party Software
● Pig on Spark - Spork
● Other commercial softwares
37. Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
38. How to use Spark in Shifu?
1. train: LogisticRegressionTrainer
2. stats & normalize
3. eval: add more evaluation metrics
a. precision, recall, F-measure, precision-recall curve
- pr(), precisionByThreshold(),recallByThreshold()..
b. area under the curves (AUC) - areaUnderPR()
c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
39. Related Projects
1. Bulk Synchronous Parallel
a. parallel computing on message-passing
b. BSP: local computation, global communication,
barrier synchronization
c. graph processing: Pregel, Giraph
d. scientific computing: Hama
e. optimize operation DAG: Flink
Seconds
Nodes
40. Take Away - Big Data has moved in-memory
1. In-memory big data has come of age.
2. Spark leverages the cluster memory for
iterative and interactive operations
3. Spark is compatible with HDFS, HBase, and
any Hadoop storage system
4. Spark powers a stack of high-level tools
including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming
5. Spark has expressive API