Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Distributed Database Systems: - Spark I

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 59
At a glance
Powered by AI
The key takeaways are that Spark is a fast, general-purpose cluster computing system that improves efficiency over Hadoop through in-memory computing and improves usability through rich APIs. It works with any Hadoop storage system and provides a programming model around RDDs.

Spark is a cluster computing system that is compatible with Hadoop but improves efficiency through in-memory computing primitives and general computation graphs. It also improves usability through rich APIs in Java, Scala, and Python as well as an interactive shell. Spark can be up to 100x faster than Hadoop and often requires 2-10x less code.

Spark handles distributed processing through RDDs, which are immutable distributed collections that can be operated on in parallel. Transformations like map and filter are used to build RDDs lazily from other RDDs, while actions return results or write to storage. The Spark driver program coordinates tasks across workers using RDD partitioning.

Distributed Database Systems

Lecture 3 – Spark I

Some slides taken from Matei Zaharia


and Anthony Joseph
What is Spark?
• Fast, expressive cluster computing system compatible
with Apache Hadoop
» Works with any Hadoop-supported storage system (HDFS, S3, Avro,
…)

• Improves efficiency through: Up to 100× faster


» In-memory computing primitives
» General computation graphs

• Improves usability through: Often 2-10× less code


» Rich APIs in Java, Scala, Python
» Interactive shell
How to Run It
• Local multicore: just a library in your program
• EC2: scripts for launching a Spark cluster
• Private cluster: Mesos, YARN, Standalone Mode
Languages
• APIs in Java, Scala and Python
• Interactive shells in Scala and Python
This Lecture
Programming Spark
Resilient Distributed Datasets (RDDs)
Creating an RDD
Spark Transformations and Actions
Spark Programming Model
Key Idea
• Work with distributed collections as you would with
local ones

• Concept: resilient distributed datasets (RDDs)


» Immutable collections of objects spread across a cluster
» Built through parallel transformations (map, filter, etc)
» Automatically rebuilt on failure
» Controllable persistence (e.g. caching in RAM)
Operations
• Transformations (e.g. map, filter, groupBy, join)
» Lazy operations to build RDDs from other RDDs

• Actions (e.g. count, collect, save)


» Return a result or write it to storage
Spark Driver and Workers
Your application
(driver program) • A Spark program is two programs:
» A driver program and a workers
SparkContext program

Cluster Local • Worker programs run on cluster


manager threads nodes or in local threads
Worker Worker • RDDs are distributed
Spark Spark
executor executor across workers

Amazon S3, HDFS, or other storage


Example: Mining Console Logs
• Load error messages from a log into memory, then interactively search for patterns
Cache 1
Base RDD
Transformed RDD Worker
lines = spark.textFile(“hdfs://...”) tasks
errors = lines.filter(lambda s: s.startswith(“ERROR”)) Block 1
Driver results
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()

Action
messages.filter(lambda s: “foo” in s).count() Cache 2
Worker
messages.filter(lambda s: “bar” in s).count()
Cache 3
. . . Block 2
Worker

Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7insec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
Task Scheduler
• Supports general task A: B:
graphs
F:
Stage 1 groupBy
• Pipelines functions where
possible C: D: E:

• Cache-aware data reuse & join


locality
Stage 2 map filter Stage 3
• Partitioning-aware to avoid
shuffles = RDD = cached partition
Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks Cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext

• Use SparkContext to create RDDs


Create a SparkContext
import spark.SparkContext
Scala

import spark.SparkContext._

val sc = new SparkContext(“masterUrl”, “name”, “sparkHome”, Seq(“app.jar”))

Cluster URL, or local /


import spark.api.java.JavaSparkContext; App Spark install List of JARs with
local[N] name path on cluster app code (to ship)
Java

JavaSparkContext sc = new JavaSparkContext(


“masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”}));

from pyspark import SparkContext


Python

sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))


Spark Essentials: Master
• The master parameter for a SparkContext
determines which type and size of cluster to use
Master Parameter Description

local run Spark locally with one worker


thread (no parallelism)
local[K] run Spark locally with K worker
threads (ideally set to number of
cores)
spark://HOST:PORT connect to a Spark standalone cluster;
PORT depends on config (7077 by
default)
mesos://HOST:PORT connect to a Mesos cluster; PORT
depends on config (5050 by default)
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel

• You construct RDDs


» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
RDDs
• Programmer specifies number of partitions for an RDD
(Default value used if unspecified)
RDD split into 5 partitions
more partitions = more parallelism
item-1 item-6 item-11 item-16 item-21
item-2 item-7 item-12 item-17 item-22
item-3 item-8 item-13 item-18 item-23
item-4 item-9 item-14 item-19 item-24
item-5 item-10 item-15 item-20 item-25

Worker Worker Worker


Spark Spark Spark
executor executor executor
RDDs
• Two types of operations: transformations and actions

• Transformations are lazy (not computed immediately)

• Transformed RDD is executed when action runs on it

• Persist (cache) RDDs in memory or disk


Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count
RDD
RDD filteredRDD
filtered RDD mappedRDD
mapped RDD
<list> RDD filtered RDD mapped RDD
parallelize filter map
collect

collect action causes parallelize, filter,


and map transforms to be executed
Result
Creating an RDD
• Create RDDs from Python collections (lists)
No computation occurs with sc.parallelize()
>>> data = [1, 2, 3, 4, 5] • Spark only records how to create the RDD with
four partitions
>>> data

[1, 2, 3, 4, 5]

>>> rDD = sc.parallelize(data, 4)

>>> rDD

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229


Creating RDDs
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase,
SequenceFiles, any other Hadoop InputFormat, and directory or
glob wildcard: /data/201404*

>>> distFile = sc.textFile("README.md", 4)

>>> distFile

MappedRDD[2] at textFile at

NativeMethodAccessorImpl.java:-2
Creating an RDD from a File
distFile = sc.textFile("...", 4)

• RDD distributed in 4 partitions


• Elements are lines of input
• Lazy evaluation means
no execution happens now
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right
away – instead Spark remembers set of
transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers

• Think of this as a recipe for creating result


Some Transformations

Transformation Description
map(func) return a new distributed dataset formed by
passing each element of the source through a
function func
filter(func) return a new dataset formed by selecting
those elements of the source on which func
returns true
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
flatMap(func) similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single item)
Review: Python lambda Functions
• Small anonymous functions (not bound to a
name) lambda a, b: a + b
» returns the sum of its two arguments

• Can use lambda functions wherever function


objects are required
• Restricted to a single expression
Transformations
>>> rdd = sc.parallelize([1, 2, 3, 4]) Function literals (green) are
>>> rdd.map(lambda x: x * 2) closures automatically
RDD: [1, 2, 3, 4] → [2, 4, 6, 8] passed to workers

>>> rdd.filter(lambda x: x % 2 == 0)
RDD: [1, 2, 3, 4] → [2, 4]

>>> rdd2 = sc.parallelize([1, 4, 2, 2, 3])


>>> rdd2.distinct()
RDD: [1, 4, 2, 2, 3] → [1, 4, 2, 3]
Transformations
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.Map(lambda x: [x, x+5])
RDD: [1, 2, 3] → [[1, 6], [2, 7], [3, 8]]

>>> rdd.flatMap(lambda x: [x, x+5])


RDD: [1, 2, 3] → [1, 6, 2, 7, 3, 8]

Function literals
(green) are closures
automatically passed
to workers
Transforming an RDD
lines = sc.textFile("...", 4)

comments = lines.filter(isComment)
lines comments
Lazy evaluation means
nothing executes – Spark
saves recipe for
transforming source
Spark Actions
• Cause Spark to execute recipe to transform source

• Mechanism for getting results out of Spark


Some Actions
Action Description
reduce(func) aggregate dataset’s elements using
function func.
func takes two arguments and returns one,
and is commutative and associative so that
it can be computed correctly in parallel
take(n) return an array with the first n elements
collect() return all the elements as an array
WARNING: make sure will fit in driver
program
takeOrdered(n, key=func) return n elements ordered in ascending
order or as specified by the optional key
function
Getting Data Out of RDDs
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.reduce(lambda a, b: a * b)
Value: 6

>>> rdd.take(2)
Value: [1,2] # as list

>>> rdd.collect()
Value: [1,2,3] # as list
Getting Data Out of RDDs

>>> rdd = sc.parallelize([5,3,1,2])


>>> rdd.takeOrdered(3, lambda s: -1 * s)
Value: [5,3,2] # as list
Spark Programming Model
lines = sc.textFile("...", 4)

print lines.count()

lines count() causes Spark to:


# • read data
• sum within partitions
# • combine sums in driver
#
#
Spark Programming Model
lines = sc.textFile("...", 4)
comments = lines.filter(isComment)
print lines.count(), comments.count()

lines comments Spark recomputes


# # lines:
• read data (again)
# # • sum within partitions
# # • combine sums in
driver
# #
Caching RDDs
lines = sc.textFile("...", 4)
lines.cache() # save, don't recompute!
comments = lines.filter(isComment)
print lines.count(),comments.count()
lines comments
RAM # #
RAM # #

RAM # #

RAM # #
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a
collection in your driver program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse
4. Perform actions to execute parallel
computation and produce results
Spark Key-Value RDDs
• Similar to Map Reduce, Spark supports Key-Value pairs

• Each element of a Pair RDD is a pair tuple

>>> rdd = sc.parallelize([(1, 2), (3, 4)])


RDD: [(1, 2), (3, 4)]
Working with Key-Value Pairs
• Spark’s “distributed reduce” transformations act on RDDs
of key-value pairs
• Python: pair = (a, b)
pair[0] # => a
pair[1] # => b

• Scala: val pair = (a, b)


pair._1 // => a
pair._2 // => b

• Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2


pair._1 // => a
pair._2 // => b
Some Key-Value Transformations

Key-Value Description
Transformation
reduceByKey(func) return a new distributed dataset of (K, V) pairs
where the values for each key are aggregated
using the given reduce function func, which
must be of type (V,V)  V
sortByKey() return a new dataset (K, V) pairs sorted by
keys in ascending order
groupByKey() return a new dataset of (K, Iterable<V>) pairs
Key-Value Transformations
>>> rdd = sc.parallelize([(1,2), (3,4), (3,6)])
>>> rdd.reduceByKey(lambda a, b: a + b)
RDD: [(1,2), (3,4), (3,6)] → [(1,2), (3,10)]

>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])


>>> rdd2.sortByKey()
RDD: [(1,'a'), (2,'c'), (1,'b')] →
[(1,'a'), (1,'b'), (2,'c')]
Key-Value Transformations
>>> rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
>>> rdd2.groupByKey()
RDD: [(1,'a'), (1,'b'), (2,'c')] →
[(1,['a','b']), (2,['c'])]

Be careful using groupByKey() as it


can cause a lot of data movement
across the network and create large
Iterables at workers
Multiple Datasets
visits = sc.parallelize([(“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”)])
pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
Controlling the Level of Parallelism
• All the pair RDD operations take an optional second
parameter for number of tasks

words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
• External variables you use in a closure will automatically
be shipped to the cluster:
query = raw_input(“Enter a query:”)
pages.filter(lambda x: x.startswith(query)).count()

• Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable (Java/Scala) or Pickle-able (Python)
» Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example
class MyCoolRddApp { How to get around it:
val param = 3.14
val log = new Log(...)
... class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
rdd.map(x => x + param) def work(rdd: RDD[Int]) {
.reduce(...) val param_ = param
} rdd.map(x => x + param_)
} NotSerializableException: .reduce(...)
MyCoolRddApp (or Log) } References only local variable
} instead of this.param
Complete App: Scala
import spark.SparkContext
import spark.SparkContext._

object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))
lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext

if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])

lines.flatMap(lambda s: s.split(“ ”)) \


.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y) \
.saveAsTextFile(sys.argv[2])
Example: PageRank
Why PageRank?
• Good example of a more complex algorithm
» Multiple stages of map & reduce

• Benefits from Spark’s in-memory caching


» Multiple iterations over the same data
Basic Idea
• Give pages ranks (scores) based on links to them
» Links from many pages  high rank
» Link from a high-rank page  high rank

Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0

1.0 1.0

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1 0.5
1
1.0 0.5 1.0

0.5
0.5

1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85

0.58 1.0

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 0.5
1.85
0.58 0.29 1.0

0.5
0.29

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.31

0.39 1.72
...

0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
Final state: 1.44

0.46 1.37

0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {


val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}

ranks.saveAsTextFile(...)
PageRank Performance

171
200
Hadoop
Iteration time (s) 150
Spark

80
100

23
50

14
0
30 60
Number of machines
Other Iterative Algorithms
155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180

110
Logistic Regression
0.96

0 25 50 75 100 125

Time per Iteration (s)


Summary
Driver program Spark automatically
Programmer pushes closures to
specifies number workers
of partitions
R D D

Worker Worker Worker


code RDD code RDD code RDD

Master parameter specifies number of workers


Spark References
• http://spark.apache.org/docs/latest/programming-guide.html

• http://spark.apache.org/docs/latest/api/python/index.html

You might also like