Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Notes Sharing
Richard Kuo, Professional-Technical Architect,
Domain 2.0 Architecture & Planning
Agenda
• Big Data
• Overview of Spark
• Main Concepts
– RDD
– Transformations
– Programming Model
• Observation
01/06/15 Creative Common, BY, SA, NC 2
What is Apache Spark?
• Fast and general cluster computing system,
interoperable with Hadoop.
• Improves efficiency through:
– In-memory computing primitives
– General computational graph
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
01/06/15 Creative Common, BY, SA, NC 3
Big Data: Hadoop Ecosystem
01/06/15 Creative Common, BY, SA, NC 4
Distributed Computing
01/06/15 Creative Common, BY, SA, NC 5
Comparison with Hadoop
Hadoop Spark
Map Reduce Framework Generalized Computation
Usually data is on disk (HDFS) On disk or in memory
Not ideal for iterative works Data can be cached in memory, great for
iterative works
Batch process Real time streaming or batch
Up to 10x faster when data is in disk
Up to 100x faster when data is in memory
2-5x time less code to write
Support Scala, Java and Python
Code re-use across modules
Interactive shell for ad-hoc exploratory
Library support: GraphX, Machine
Learning, SQL, R, Streaming, …
01/06/15 Creative Common, BY, SA, NC 6
01/06/15 Creative Common, BY, SA, NC 7
Compare to Hadoop:
01/06/15 Creative Common, BY, SA, NC 8
System performance degrade gracefully with less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
01/06/15 Creative Common, BY, SA, NC 9
Software Components
• Spark runs as a library in
your program (1 instance
per app)
• Runs tasks locally or on
cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems
via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
01/06/15 Creative Common, BY, SA, NC 10
Spark Architecture
• [Spark
Standalone
• |Mesos
• |Yarn]
Node
Client
01/06/15 Creative Common, BY, SA, NC 11
Key Concept: RDD’s
Resilient Distributed Datasets
• Collections of objects
spread across a cluster,
stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
01/06/15 Creative Common, BY, SA, NC 12
Write programs in terms of operations on distributed
datasets
Fault Recovery
RDDs track the series of transformations used to build
them (their lineage) to re-compute lost data, no data
replication across wire.
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
01/06/15 Creative Common, BY, SA, NC 13
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
• Python & Scala
Performance
• Java & Scala are faster due to
static typing
• …but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
01/06/15 Creative Common, BY, SA, NC 14
Interactive Shell
• The fastest way to
learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• Or can run locally
01/06/15 Creative Common, BY, SA, NC 15
DEMO
01/06/15 Creative Common, BY, SA, NC 16
Transformation
01/06/15 Creative Common, BY, SA, NC 17
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 18
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 19
Spark Streaming: Word Count
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
01/06/15 Creative Common, BY, SA, NC 20
Create Spark Context
Create, map, reduce
Output
Start
Analytics
01/06/15 Creative Common, BY, SA, NC 21
Conclusion
• Spark offers a rich API to make data analytics fast:
both less to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community.
01/06/15 Creative Common, BY, SA, NC 22
Observations:
• A lot of data, different kinds of data, generated
faster, need analyzed in real-time.
• All* products are data products.
• More complicate analytic algorithms applies to
commercial products and services.
• Not all data analysis requires the same accuracy.
• Expectation on service delivery increases.
01/06/15 Creative Common, BY, SA, NC 23
Reference:
• AMPLab at UC Berkeley
• Databrick
• UC BerkeleyX
– CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015,
5 weeks
– CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks
• Spark Summit 2014 Training
• Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• An Architecture for Fast and General Data Processing on
Large Clusters
• Richard’s Study Notes
– Self Study AMPCamp
– Hortonworks HDP 2.2 Study
01/06/15 Creative Common, BY, SA, NC 24

More Related Content

Spark Study Notes

  • 1. Notes Sharing Richard Kuo, Professional-Technical Architect, Domain 2.0 Architecture & Planning
  • 2. Agenda • Big Data • Overview of Spark • Main Concepts – RDD – Transformations – Programming Model • Observation 01/06/15 Creative Common, BY, SA, NC 2
  • 3. What is Apache Spark? • Fast and general cluster computing system, interoperable with Hadoop. • Improves efficiency through: – In-memory computing primitives – General computational graph • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell 01/06/15 Creative Common, BY, SA, NC 3
  • 4. Big Data: Hadoop Ecosystem 01/06/15 Creative Common, BY, SA, NC 4
  • 6. Comparison with Hadoop Hadoop Spark Map Reduce Framework Generalized Computation Usually data is on disk (HDFS) On disk or in memory Not ideal for iterative works Data can be cached in memory, great for iterative works Batch process Real time streaming or batch Up to 10x faster when data is in disk Up to 100x faster when data is in memory 2-5x time less code to write Support Scala, Java and Python Code re-use across modules Interactive shell for ad-hoc exploratory Library support: GraphX, Machine Learning, SQL, R, Streaming, … 01/06/15 Creative Common, BY, SA, NC 6
  • 8. Compare to Hadoop: 01/06/15 Creative Common, BY, SA, NC 8
  • 9. System performance degrade gracefully with less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache 01/06/15 Creative Common, BY, SA, NC 9
  • 10. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage 01/06/15 Creative Common, BY, SA, NC 10
  • 11. Spark Architecture • [Spark Standalone • |Mesos • |Yarn] Node Client 01/06/15 Creative Common, BY, SA, NC 11
  • 12. Key Concept: RDD’s Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) 01/06/15 Creative Common, BY, SA, NC 12 Write programs in terms of operations on distributed datasets
  • 13. Fault Recovery RDDs track the series of transformations used to build them (their lineage) to re-compute lost data, no data replication across wire. val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) 01/06/15 Creative Common, BY, SA, NC 13 HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 14. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 01/06/15 Creative Common, BY, SA, NC 14
  • 15. Interactive Shell • The fastest way to learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • Or can run locally 01/06/15 Creative Common, BY, SA, NC 15
  • 18. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 18
  • 19. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 19
  • 20. Spark Streaming: Word Count import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 01/06/15 Creative Common, BY, SA, NC 20 Create Spark Context Create, map, reduce Output Start
  • 22. Conclusion • Spark offers a rich API to make data analytics fast: both less to write and fast to run. • Achieves 100x speedups in real applications. • Growing community. 01/06/15 Creative Common, BY, SA, NC 22
  • 23. Observations: • A lot of data, different kinds of data, generated faster, need analyzed in real-time. • All* products are data products. • More complicate analytic algorithms applies to commercial products and services. • Not all data analysis requires the same accuracy. • Expectation on service delivery increases. 01/06/15 Creative Common, BY, SA, NC 23
  • 24. Reference: • AMPLab at UC Berkeley • Databrick • UC BerkeleyX – CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015, 5 weeks – CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks • Spark Summit 2014 Training • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • An Architecture for Fast and General Data Processing on Large Clusters • Richard’s Study Notes – Self Study AMPCamp – Hortonworks HDP 2.2 Study 01/06/15 Creative Common, BY, SA, NC 24

Editor's Notes

  1. MPI (Message Passing Interface)
  2. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
  3. Gracefully
  4. The barrier to entry for working with the spark API is minimal
  5. (word, 1L) reduceByKey(_, _)
  6. from http://spark.apache.org/docs/latest/streaming-programming-guide.html /** * Usage: NetworkWordCount <hostname> <port> * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */