Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Spark Overview
Lisa Hua
7/14/2014
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
What is Spark
● Hadoop/YARN:
○ strong in processing large files parallelly
○ synchronization barrier when persisting data to the
disk.
○ MapReduce: launch mapper & reducer, R/W to disk,
back to queue and get resource
● Spark:
○ in-memory processing
○ iterative and interactive data analysis
○ compare to MapReduce, supports more complex
and interactive applications
Hadoop MapReduce
● Slow due to replication, serialization, and disk IO
● Inefficient for:
○ Iterative algorithms (Machine Learning, Graphs &
Network Analysis)
○ Interactive Data Mining (R, Excel, Searching)
Spark In-memory Processing
1.Extract a working set
2.Cache it
3.Query it repeatedly
Spark Ecosystem
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How Spark Works - SparkContext
How Spark Works - RDD
● Partitions of Data
● Dependencies between partitions
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK,
DISK_ONLY,
...
How Spark Works - RDD operations
Transformations
● Create a new
dataset from an
existing one.
● Lazy in nature,
executed only
when some action
is performed.
● Example
○ Map(func)
○ Filter(func)
○ Distinct()
Actions
● Returns a value or
exports data after
performing a
computation.
● Example:
○ Count()
○ Reduce(func)
○ Collect
○ Take()
Persistence
● Caching dataset
in-memory for
future operations
● store on disk or
RAM or mixed
● Example:
○ Persist()
○ Cache()
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works: Word Count
How Spark Works - Actions
● Parallel Operations
How Spark Works - Actions
● Parallel Operations
How Spark Works - Stages
Each stage is executed as a series
of Task (one Task for each
Partition).
DAG (Directed Acyclic Graph).
Spark Programming - Tasks
Task is the fundamental unit of execution in Spark
How Spark Works - Summary
● SparkContext
● Resilient Distributed Datasets (RDDs)
● Parallel Operations
● Shared Variables
○ Broadcast Variables - read-only
○ Accumulators
Compare Hadoop and Spark
Traditional OS Hadoop Spark
Storage File System HDFS HDFS
Schedule Processes MapReduce Computation Graph
I/O Disk Cache(in memory) and
shared data
Fault
Tolerance
Duplication and Disk
I/O
Hash partition and auto-
reconstruction
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Spark - LogisticRegressionModel
1. Initialize spark JavaSparkContext
2. Prepare dataSet
3. Train LR model
4. Evaluation
1. Initializing Spark
1. JavaSparkContext: tell Spark how to access to the cluster
2. SparkConf: setting - a hashmap of <String,String>
a. required: AppName, Master, more default configuration
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
2. Prepare Dataset
1. From Parallelized Collections
2. From External DataSets
3. Passing Functions to Spark
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
JavaRDD<String> distFile = sc.textFile("data.txt"); OR
JavaRDD<String> distFile = sc.textFile("hdfs://data.txt");
class ParseLabeledPoint implements Function<String, LabeledPoint> {
public LabeledPoint call(String s) {...
for (int i = 0; i < len; i++) {
x[i] = Double.parseDouble(tokens[i]);
}
return new LabeledPoint(y, Vectors.dense(x));
}}
---
JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
3. Train LogisticRegressionModel
/*
* @param input RDD of (label, array of features) pairs.
* @param numIterations Number of iterations of gradient descent to run.
* @param stepSize Step size to be used for each iteration of gradient descent.
* @param miniBatchFraction Fraction of data to be used per iteration.
*/
LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations,
stepSize,miniBatchFraction);
Train the model
4. Calculate Score - Evaluation
pmmlModel = new PMMLSparkLogisticRegressionModel()
.adaptMLModelToPMML(lrModel, partialPmmlModel);
1. Convert LogisticRegressionModel to PMML model
2. Prepare DataSet and calculate score
//use LogisticRegressionModel
JavaRDD<Vector> evalVectors = lines.map(new ParseVector());
List<Double> evalList = lrModel.predict(evalVectors).collect();
//use PMMLEvaluator
RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml);
List<Double> evalResult = evaluator.evaluate(evalData);
//compare two evaluator results
for (...) {
Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA);
}
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
Why Spark? - scalability & performance
1. leverage the memory of the cluster for in-
memory processing
2. Computation Graph optimization for parallel
execution
Shark: Spark SQL, Hive in Spark
Hive: manage large dataset in
distributed storage
Why Spark? - compatibility
1. compatible with HDFS, HBase, and any
Hadoop storage system
Why Spark? - Ease of Use API
1. Expressive API in Java, Scala, and Python
2. Supports more parallel operations
Expressive API - MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Expressive API - Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Scala:
val file = spark.textFile("hdfs://...")
val counts = file.map(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}}}
public static class WorkdCountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,
IntWritable> output,Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Java 6, Java 7:
JavaRDD<String> file = spark.textFile("hdfs://...");
JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String,
String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String,
Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }});
Expressive API - Spark
Why Spark? - Third Party Softwares
● Mahout
○ Say goodbye to MapReduce
○ Support for Apache Spark
■ Mahout-Spark Shell: facilitate the Mahout data
structures, such as Matrix, etc.
○ Support for h2o being explored
○ Support for Apache Flink possibly in future
● H2o
○ Sparkling water - embrace in-memory
processing with ML algorithm
Purpose Language Storage Stakeholder
H2o In-memory ML
predictive analysis
Java/R K/V
store
data analyst
Spark in-memory
processing engine
Scala, support
Java/Python
RDD HDFS user
Why Spark - Third Party Software
● Pig on Spark - Spork
● Other commercial softwares
Overview
● What is Spark?
● How Spark works?
○ Mechanism
○ Logistic Regression Model
● Why Spark?
● How to leverage Spark?
How to use Spark in Shifu?
1. train: LogisticRegressionTrainer
2. stats & normalize
3. eval: add more evaluation metrics
a. precision, recall, F-measure, precision-recall curve
- pr(), precisionByThreshold(),recallByThreshold()..
b. area under the curves (AUC) - areaUnderPR()
c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
Related Projects
1. Bulk Synchronous Parallel
a. parallel computing on message-passing
b. BSP: local computation, global communication,
barrier synchronization
c. graph processing: Pregel, Giraph
d. scientific computing: Hama
e. optimize operation DAG: Flink
Seconds
Nodes
Take Away - Big Data has moved in-memory
1. In-memory big data has come of age.
2. Spark leverages the cluster memory for
iterative and interactive operations
3. Spark is compatible with HDFS, HBase, and
any Hadoop storage system
4. Spark powers a stack of high-level tools
including Spark SQL, MLlib for machine
learning, GraphX, and Spark Streaming
5. Spark has expressive API
Questions
3. Train LogisticRegressionModel (cont.)
val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
val weights =
if (addIntercept) { ...
} else { weightsWithIntercept }
2. Calculate weights
3. Gradient Descent optimize()
4. Training Error - not accessible from LogisticRegressionModel
logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10)))
14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468,
0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156

More Related Content

Spark overview

  • 2. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 3. What is Spark ● Hadoop/YARN: ○ strong in processing large files parallelly ○ synchronization barrier when persisting data to the disk. ○ MapReduce: launch mapper & reducer, R/W to disk, back to queue and get resource ● Spark: ○ in-memory processing ○ iterative and interactive data analysis ○ compare to MapReduce, supports more complex and interactive applications
  • 4. Hadoop MapReduce ● Slow due to replication, serialization, and disk IO ● Inefficient for: ○ Iterative algorithms (Machine Learning, Graphs & Network Analysis) ○ Interactive Data Mining (R, Excel, Searching)
  • 5. Spark In-memory Processing 1.Extract a working set 2.Cache it 3.Query it repeatedly
  • 7. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 8. How Spark Works - SparkContext
  • 9. How Spark Works - RDD ● Partitions of Data ● Dependencies between partitions Storage Types: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, ...
  • 10. How Spark Works - RDD operations Transformations ● Create a new dataset from an existing one. ● Lazy in nature, executed only when some action is performed. ● Example ○ Map(func) ○ Filter(func) ○ Distinct() Actions ● Returns a value or exports data after performing a computation. ● Example: ○ Count() ○ Reduce(func) ○ Collect ○ Take() Persistence ● Caching dataset in-memory for future operations ● store on disk or RAM or mixed ● Example: ○ Persist() ○ Cache()
  • 11. How Spark Works: Word Count
  • 12. How Spark Works: Word Count
  • 13. How Spark Works: Word Count
  • 14. How Spark Works: Word Count
  • 15. How Spark Works: Word Count
  • 16. How Spark Works - Actions ● Parallel Operations
  • 17. How Spark Works - Actions ● Parallel Operations
  • 18. How Spark Works - Stages Each stage is executed as a series of Task (one Task for each Partition). DAG (Directed Acyclic Graph).
  • 19. Spark Programming - Tasks Task is the fundamental unit of execution in Spark
  • 20. How Spark Works - Summary ● SparkContext ● Resilient Distributed Datasets (RDDs) ● Parallel Operations ● Shared Variables ○ Broadcast Variables - read-only ○ Accumulators
  • 21. Compare Hadoop and Spark Traditional OS Hadoop Spark Storage File System HDFS HDFS Schedule Processes MapReduce Computation Graph I/O Disk Cache(in memory) and shared data Fault Tolerance Duplication and Disk I/O Hash partition and auto- reconstruction
  • 22. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 23. Spark - LogisticRegressionModel 1. Initialize spark JavaSparkContext 2. Prepare dataSet 3. Train LR model 4. Evaluation
  • 24. 1. Initializing Spark 1. JavaSparkContext: tell Spark how to access to the cluster 2. SparkConf: setting - a hashmap of <String,String> a. required: AppName, Master, more default configuration SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); JavaSparkContext sc = new JavaSparkContext(conf);
  • 25. 2. Prepare Dataset 1. From Parallelized Collections 2. From External DataSets 3. Passing Functions to Spark List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD<Integer> distData = sc.parallelize(data); JavaRDD<String> distFile = sc.textFile("data.txt"); OR JavaRDD<String> distFile = sc.textFile("hdfs://data.txt"); class ParseLabeledPoint implements Function<String, LabeledPoint> { public LabeledPoint call(String s) {... for (int i = 0; i < len; i++) { x[i] = Double.parseDouble(tokens[i]); } return new LabeledPoint(y, Vectors.dense(x)); }} --- JavaRDD<LabeledPoint> data = distData.map(new ParseLabeledPoint()) ;
  • 26. 3. Train LogisticRegressionModel /* * @param input RDD of (label, array of features) pairs. * @param numIterations Number of iterations of gradient descent to run. * @param stepSize Step size to be used for each iteration of gradient descent. * @param miniBatchFraction Fraction of data to be used per iteration. */ LogisticRegressionModel lrModel = LogisticRegressionWithSGD.train(data, iterations, stepSize,miniBatchFraction); Train the model
  • 27. 4. Calculate Score - Evaluation pmmlModel = new PMMLSparkLogisticRegressionModel() .adaptMLModelToPMML(lrModel, partialPmmlModel); 1. Convert LogisticRegressionModel to PMML model 2. Prepare DataSet and calculate score //use LogisticRegressionModel JavaRDD<Vector> evalVectors = lines.map(new ParseVector()); List<Double> evalList = lrModel.predict(evalVectors).collect(); //use PMMLEvaluator RegressionModelEvaluator evaluator = new RegressionModelEvaluator(pmml); List<Double> evalResult = evaluator.evaluate(evalData); //compare two evaluator results for (...) { Assert.assertEqual(getPMMLEvaluatorResult(i),sparkEvalList.get(i),DELTA); }
  • 28. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 29. Why Spark? - scalability & performance 1. leverage the memory of the cluster for in- memory processing 2. Computation Graph optimization for parallel execution Shark: Spark SQL, Hive in Spark Hive: manage large dataset in distributed storage
  • 30. Why Spark? - compatibility 1. compatible with HDFS, HBase, and any Hadoop storage system
  • 31. Why Spark? - Ease of Use API 1. Expressive API in Java, Scala, and Python 2. Supports more parallel operations
  • 32. Expressive API - MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 33. Expressive API - Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Scala: val file = spark.textFile("hdfs://...") val counts = file.map(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 34. public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); }}} public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Java 6, Java 7: JavaRDD<String> file = spark.textFile("hdfs://..."); JavaRDD<String> words = file.map(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; }}); Expressive API - Spark
  • 35. Why Spark? - Third Party Softwares ● Mahout ○ Say goodbye to MapReduce ○ Support for Apache Spark ■ Mahout-Spark Shell: facilitate the Mahout data structures, such as Matrix, etc. ○ Support for h2o being explored ○ Support for Apache Flink possibly in future ● H2o ○ Sparkling water - embrace in-memory processing with ML algorithm Purpose Language Storage Stakeholder H2o In-memory ML predictive analysis Java/R K/V store data analyst Spark in-memory processing engine Scala, support Java/Python RDD HDFS user
  • 36. Why Spark - Third Party Software ● Pig on Spark - Spork ● Other commercial softwares
  • 37. Overview ● What is Spark? ● How Spark works? ○ Mechanism ○ Logistic Regression Model ● Why Spark? ● How to leverage Spark?
  • 38. How to use Spark in Shifu? 1. train: LogisticRegressionTrainer 2. stats & normalize 3. eval: add more evaluation metrics a. precision, recall, F-measure, precision-recall curve - pr(), precisionByThreshold(),recallByThreshold().. b. area under the curves (AUC) - areaUnderPR() c. receiver operating characteristic (ROC) - areaUnderROC(), roc()
  • 39. Related Projects 1. Bulk Synchronous Parallel a. parallel computing on message-passing b. BSP: local computation, global communication, barrier synchronization c. graph processing: Pregel, Giraph d. scientific computing: Hama e. optimize operation DAG: Flink Seconds Nodes
  • 40. Take Away - Big Data has moved in-memory 1. In-memory big data has come of age. 2. Spark leverages the cluster memory for iterative and interactive operations 3. Spark is compatible with HDFS, HBase, and any Hadoop storage system 4. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming 5. Spark has expressive API
  • 42. 3. Train LogisticRegressionModel (cont.) val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept) val weights = if (addIntercept) { ... } else { weightsWithIntercept } 2. Calculate weights 3. Gradient Descent optimize() 4. Training Error - not accessible from LogisticRegressionModel logInfo("Last 10 stochastic losses %s".format(stochasticLoss.takeRight(10))) 14/07/09 14:10:40 INFO optimization.GradientDescent: Last 10 stochastic losses 0.6931471805599468, 0.5255572298404575,.., 0.3444544005102222, 0.3355921369255156