Distributed Database Systems: - Spark I

Distributed Database Systems
Lecture 3 – Spark I
Some slides taken from Matei Zaharia

and Anthony Joseph
What is Spark?
• Fast, expressive cluster computing system compatible
with Apache Hadoop
» Works with any Hadoop-supported storage system (HDFS, S3, Avro,
…)
• Improves efficiency through: Up to 100× faster

» In-memory computing primitives
» General computation graphs
• Improves usability through: Often 2-10× less code

» Rich APIs in Java, Scala, Python
» Interactive shell
How to Run It
• Local multicore: just a library in your program
• EC2: scripts for launching a Spark cluster
• Private cluster: Mesos, YARN, Standalone Mode
Languages
• APIs in Java, Scala and Python
• Interactive shells in Scala and Python
This Lecture
Programming Spark
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark Transformations and Actions
Spark Programming Model
Key Idea
• Work with distributed collections as you would with
local ones
• Concept: resilient distributed datasets (RDDs)

» Immutable collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)
Operations
• Transformations (e.g. map, filter, groupBy, join)
» Lazy operations to build RDDs from other RDDs
• Actions (e.g. count, collect, save)

» Return a result or write it to storage
Spark Driver and Workers
Your application
(driver program) • A Spark program is two programs:
» A driver program and a workers
SparkContext program
Cluster Local • Worker programs run on cluster

manager threads nodes or in local threads
Worker Worker • RDDs are distributed
Spark Spark
executor executor across workers
Amazon S3, HDFS, or other storage

Example: Mining Console Logs
• Load error messages from a log into memory, then interactively search for patterns
Cache 1
Base RDD
Transformed RDD Worker
lines = spark.textFile(“hdfs://...”) tasks
errors = lines.filter(lambda s: s.startswith(“ERROR”)) Block 1
Driver results
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()
Action
messages.filter(lambda s: “foo” in s).count() Cache 2
Worker
messages.filter(lambda s: “bar” in s).count()
Cache 3
. . . Block 2
Worker
Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7insec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
Task Scheduler
• Supports general task A: B:
graphs
F:
Stage 1 groupBy
• Pipelines functions where
possible C: D: E:
• Cache-aware data reuse & join

locality
Stage 2 map filter Stage 3
• Partitioning-aware to avoid
shuffles = RDD = cached partition
Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks Cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext
• Use SparkContext to create RDDs

Create a SparkContext
import spark.SparkContext
Scala
import spark.SparkContext._
val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“app.jar”))
Cluster URL, or local /

import spark.api.java.JavaSparkContext; App Spark install List of JARs with
local[N] name path on cluster app code (to ship)
Java
JavaSparkContext sc = new JavaSparkContext(

“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));
from pyspark import SparkContext

Python
sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

Spark Essentials: Master
• The master parameter for a SparkContext
determines which type and size of cluster to use
Master Parameter Description
local run Spark locally with one worker

thread (no parallelism)
local[K] run Spark locally with K worker
threads (ideally set to number of
cores)
spark://HOST:PORT connect to a Spark standalone cluster;
PORT depends on config (7077 by
default)
mesos://HOST:PORT connect to a Mesos cluster; PORT
depends on config (5050 by default)
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs

» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
RDDs
• Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
more partitions = more parallelism
item-1 item-6 item-11 item-16 item-21
Worker Worker Worker

Spark Spark Spark
executor executor executor
RDDs
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk

Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count
RDD
RDD filteredRDD
filtered RDD mappedRDD
mapped RDD
<list> RDD filtered RDD mapped RDD
parallelize filter map
collect
collect action causes parallelize, filter,

and map transforms to be executed
Result
Creating an RDD
• Create RDDs from Python collections (lists)
No computation occurs with sc.parallelize()
>>> data = [1, 2, 3, 4, 5] • Spark only records how to create the RDD with
four partitions
>>> data
[1, 2, 3, 4, 5]
>>> rDD = sc.parallelize(data, 4)
>>> rDD
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229

Creating RDDs
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase,
SequenceFiles, any other Hadoop InputFormat, and directory or
glob wildcard: /data/201404*
>>> distFile = sc.textFile("README.md", 4)
>>> distFile
MappedRDD[2] at textFile at
NativeMethodAccessorImpl.java:-2
Creating an RDD from a File
distFile = sc.textFile("...", 4)
• RDD distributed in 4 partitions

• Elements are lines of input
• Lazy evaluation means
no execution happens now
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right
away – instead Spark remembers set of
transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
• Think of this as a recipe for creating result

Some Transformations
Transformation Description
map(func) return a new distributed dataset formed by
passing each element of the source through a
function func
filter(func) return a new dataset formed by selecting
those elements of the source on which func
returns true
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
flatMap(func) similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single item)
Review: Python lambda Functions
• Small anonymous functions (not bound to a
name) lambda a, b: a + b
» returns the sum of its two arguments
• Can use lambda functions wherever function

objects are required
• Restricted to a single expression
Transformations
>>> rdd = sc.parallelize([1, 2, 3, 4]) Function literals (green) are
>>> rdd.map(lambda x: x * 2) closures automatically
RDD: [1, 2, 3, 4] → [2, 4, 6, 8] passed to workers
>>> rdd.filter(lambda x: x % 2 == 0)
RDD: [1, 2, 3, 4] → [2, 4]
>>> rdd2 = sc.parallelize([1, 4, 2, 2, 3])

>>> rdd2.distinct()
RDD: [1, 4, 2, 2, 3] → [1, 4, 2, 3]
Transformations
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.Map(lambda x: [x, x+5])
RDD: [1, 2, 3] → [[1, 6], [2, 7], [3, 8]]
>>> rdd.flatMap(lambda x: [x, x+5])

RDD: [1, 2, 3] → [1, 6, 2, 7, 3, 8]
Function literals
(green) are closures
automatically passed
to workers
Transforming an RDD
lines = sc.textFile("...", 4)
comments = lines.filter(isComment)
lines comments
Lazy evaluation means
nothing executes – Spark
saves recipe for
transforming source
Spark Actions
• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark

Some Actions
Action Description
reduce(func) aggregate dataset’s elements using
function func.
func takes two arguments and returns one,
and is commutative and associative so that
it can be computed correctly in parallel
take(n) return an array with the first n elements
collect() return all the elements as an array
WARNING: make sure will fit in driver
program
takeOrdered(n, key=func) return n elements ordered in ascending
order or as specified by the optional key
function
Getting Data Out of RDDs
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.reduce(lambda a, b: a * b)
Value: 6
>>> rdd.take(2)
Value: [1,2] # as list
>>> rdd.collect()
Value: [1,2,3] # as list
Getting Data Out of RDDs
>>> rdd = sc.parallelize([5,3,1,2])

>>> rdd.takeOrdered(3, lambda s: -1 * s)
Value: [5,3,2] # as list
print lines.count()
lines count() causes Spark to:

# • read data
• sum within partitions
# • combine sums in driver
#
#
print lines.count(), comments.count()
lines comments Spark recomputes

# # lines:
• read data (again)
# # • sum within partitions
# # • combine sums in
driver
# #
Caching RDDs
lines.cache() # save, don't recompute!
print lines.count(),comments.count()
lines comments
RAM # #
RAM # #
RAM # #
RAM # #
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a
collection in your driver program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse
4. Perform actions to execute parallel
computation and produce results
Spark Key-Value RDDs
• Similar to Map Reduce, Spark supports Key-Value pairs
• Each element of a Pair RDD is a pair tuple
>>> rdd = sc.parallelize([(1, 2), (3, 4)])

RDD: [(1, 2), (3, 4)]
Working with Key-Value Pairs
• Spark’s “distributed reduce” transformations act on RDDs
of key-value pairs
• Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
• Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b
• Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

pair._1 // => a
pair._2 // => b
Some Key-Value Transformations
Key-Value Description
Transformation
reduceByKey(func) return a new distributed dataset of (K, V) pairs
where the values for each key are aggregated
using the given reduce function func, which
must be of type (V,V)  V
sortByKey() return a new dataset (K, V) pairs sorted by
keys in ascending order
groupByKey() return a new dataset of (K, Iterable<V>) pairs
Key-Value Transformations
>>> rdd = sc.parallelize([(1,2), (3,4), (3,6)])
>>> rdd.reduceByKey(lambda a, b: a + b)
RDD: [(1,2), (3,4), (3,6)] → [(1,2), (3,10)]
>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])

>>> rdd2.sortByKey()
RDD: [(1,'a'), (2,'c'), (1,'b')] →
[(1,'a'), (1,'b'), (2,'c')]
Key-Value Transformations
>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
>>> rdd2.groupByKey()
RDD: [(1,'a'), (1,'b'), (2,'c')] →
[(1,['a','b']), (2,['c'])]
Be careful using groupByKey() as it

can cause a lot of data movement
across the network and create large
Iterables at workers
Multiple Datasets
visits = sc.parallelize([(“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”)])
pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
Controlling the Level of Parallelism
• All the pair RDD operations take an optional second
parameter for number of tasks
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
• External variables you use in a closure will automatically
be shipped to the cluster:
query = raw_input(“Enter a query:”)
pages.filter(lambda x: x.startswith(query)).count()
• Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable (Java/Scala) or Pickle-able (Python)
» Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example
class MyCoolRddApp { How to get around it:
val param = 3.14
val log = new Log(...)
... class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
rdd.map(x => x + param) def work(rdd: RDD[Int]) {
.reduce(...) val param_ = param
} rdd.map(x => x + param_)
} NotSerializableException: .reduce(...)
MyCoolRddApp (or Log) } References only local variable
} instead of this.param
Complete App: Scala
import spark.SparkContext
import spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))
lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
lines.flatMap(lambda s: s.split(“ ”)) \

.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.saveAsTextFile(sys.argv[2])
Example: PageRank
Why PageRank?
• Good example of a more complex algorithm
» Multiple stages of map & reduce
• Benefits from Spark’s in-memory caching

» Multiple iterations over the same data
Basic Idea
• Give pages ranks (scores) based on links to them
» Links from many pages  high rank
» Link from a high-rank page  high rank
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1.0 1.0
1.0
Algorithm
1.0
1 0.5
1
1.0 0.5 1.0
0.5
0.5
1.0
Algorithm
1.85
0.58 1.0
0.58
Algorithm
1.85
0.58 0.5
1.85
0.58 0.29 1.0
0.5
0.29
0.58
Algorithm
1.31
0.39 1.72
...
0.58
Algorithm
Final state: 1.44
0.46 1.37
0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {

val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance
171
200
Hadoop
Iteration time (s) 150
Spark
80
100
23
50
14
0
30 60
Number of machines
Other Iterative Algorithms
155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180
110
Logistic Regression
0.96
0 25 50 75 100 125
Time per Iteration (s)

Summary
Driver program Spark automatically
Programmer pushes closures to
specifies number workers
of partitions
R D D
Worker Worker Worker

code RDD code RDD code RDD
Master parameter specifies number of workers

Spark References
• http://spark.apache.org/docs/latest/programming-guide.html
• http://spark.apache.org/docs/latest/api/python/index.html

Distributed Database Systems: - Spark I

Uploaded by

Copyright:

Available Formats

Distributed Database Systems: - Spark I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Database Systems: - Spark I

Uploaded by

Copyright:

Available Formats

What is Spark and how does it improve efficiency and usability over Hadoop?

What is Spark and how does it improve efficiency and usability over Hadoop?

How does Spark handle distributed processing through RDDs, transformations, and actions?

How does Spark handle distributed processing through RDDs, transformations, and actions?

Distributed Database Systems

Some slides taken from Matei Zaharia

• Improves efficiency through: Up to 100× faster

• Improves usability through: Often 2-10× less code

• Concept: resilient distributed datasets (RDDs)

• Actions (e.g. count, collect, save)

Cluster Local • Worker programs run on cluster

Amazon S3, HDFS, or other storage

• Cache-aware data reuse & join

• Use SparkContext to create RDDs

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“app.jar”))

Cluster URL, or local /

JavaSparkContext sc = new JavaSparkContext(

from pyspark import SparkContext

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))

local run Spark locally with one worker

• You construct RDDs

Worker Worker Worker

• Transformations are lazy (not computed immediately)

• Transformed RDD is executed when action runs on it

• Persist (cache) RDDs in memory or disk

collect action causes parallelize, filter,

>>> rDD = sc.parallelize(data, 4)

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229

>>> distFile = sc.textFile("README.md", 4)

• RDD distributed in 4 partitions

• Think of this as a recipe for creating result

• Can use lambda functions wherever function

>>> rdd2 = sc.parallelize([1, 4, 2, 2, 3])

>>> rdd.flatMap(lambda x: [x, x+5])

• Mechanism for getting results out of Spark

>>> rdd = sc.parallelize([5,3,1,2])

lines count() causes Spark to:

lines comments Spark recomputes

• Each element of a Pair RDD is a pair tuple

>>> rdd = sc.parallelize([(1, 2), (3, 4)])

• Scala: val pair = (a, b)

• Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2

>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])

Be careful using groupByKey() as it

lines.flatMap(lambda s: s.split(“ ”)) \

• Benefits from Spark’s in-memory caching

for (i <- 1 to ITERATIONS) {

Time per Iteration (s)

Worker Worker Worker

Master parameter specifies number of workers

You might also like