Distributed Database Systems: - Spark I
Distributed Database Systems: - Spark I
Distributed Database Systems: - Spark I
Lecture 3 – Spark I
Action
messages.filter(lambda s: “foo” in s).count() Cache 2
Worker
messages.filter(lambda s: “bar” in s).count()
Cache 3
. . . Block 2
Worker
Result:
Result:
full-text
scaled
search
to 1 TB
of Wikipedia
data in 5-7insec
<1 sec Block 3
(vs
(vs170
20 sec
secfor
foron-disk
on-diskdata)
data)
Task Scheduler
• Supports general task A: B:
graphs
F:
Stage 1 groupBy
• Pipelines functions where
possible C: D: E:
import spark.SparkContext._
[1, 2, 3, 4, 5]
>>> rDD
>>> distFile
MappedRDD[2] at textFile at
NativeMethodAccessorImpl.java:-2
Creating an RDD from a File
distFile = sc.textFile("...", 4)
Transformation Description
map(func) return a new distributed dataset formed by
passing each element of the source through a
function func
filter(func) return a new dataset formed by selecting
those elements of the source on which func
returns true
distinct([numTasks])) return a new dataset that contains the distinct
elements of the source dataset
flatMap(func) similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single item)
Review: Python lambda Functions
• Small anonymous functions (not bound to a
name) lambda a, b: a + b
» returns the sum of its two arguments
>>> rdd.filter(lambda x: x % 2 == 0)
RDD: [1, 2, 3, 4] → [2, 4]
Function literals
(green) are closures
automatically passed
to workers
Transforming an RDD
lines = sc.textFile("...", 4)
comments = lines.filter(isComment)
lines comments
Lazy evaluation means
nothing executes – Spark
saves recipe for
transforming source
Spark Actions
• Cause Spark to execute recipe to transform source
>>> rdd.take(2)
Value: [1,2] # as list
>>> rdd.collect()
Value: [1,2,3] # as list
Getting Data Out of RDDs
print lines.count()
RAM # #
RAM # #
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a
collection in your driver program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse
4. Perform actions to execute parallel
computation and produce results
Spark Key-Value RDDs
• Similar to Map Reduce, Spark supports Key-Value pairs
Key-Value Description
Transformation
reduceByKey(func) return a new distributed dataset of (K, V) pairs
where the values for each key are aggregated
using the given reduce function func, which
must be of type (V,V) V
sortByKey() return a new dataset (K, V) pairs sorted by
keys in ascending order
groupByKey() return a new dataset of (K, Iterable<V>) pairs
Key-Value Transformations
>>> rdd = sc.parallelize([(1,2), (3,4), (3,6)])
>>> rdd.reduceByKey(lambda a, b: a + b)
RDD: [(1,2), (3,4), (3,6)] → [(1,2), (3,10)]
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
Using Local Variables
• External variables you use in a closure will automatically
be shipped to the cluster:
query = raw_input(“Enter a query:”)
pages.filter(lambda x: x.startswith(query)).count()
• Some caveats:
» Each task gets a new copy (updates aren’t sent back)
» Variable must be Serializable (Java/Scala) or Pickle-able (Python)
» Don’t use fields of an outer object (ships all of it!)
Closure Mishap Example
class MyCoolRddApp { How to get around it:
val param = 3.14
val log = new Log(...)
... class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
rdd.map(x => x + param) def work(rdd: RDD[Int]) {
.reduce(...) val param_ = param
} rdd.map(x => x + param_)
} NotSerializableException: .reduce(...)
MyCoolRddApp (or Log) } References only local variable
} instead of this.param
Complete App: Scala
import spark.SparkContext
import spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))
lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))
}
}
Complete App: Python
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)
lines = sc.textFile(sys.argv[1])
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1.0 1.0
1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0
1 0.5
1
1.0 0.5 1.0
0.5
0.5
1.0
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 1.0
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.85
0.58 0.5
1.85
0.58 0.29 1.0
0.5
0.29
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.31
0.39 1.72
...
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
Final state: 1.44
0.46 1.37
0.73
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
ranks.saveAsTextFile(...)
PageRank Performance
171
200
Hadoop
Iteration time (s) 150
Spark
80
100
23
50
14
0
30 60
Number of machines
Other Iterative Algorithms
155 Hadoop
K-Means Clustering
4.1 Spark
0 30 60 90 120 150 180
110
Logistic Regression
0.96
0 25 50 75 100 125
• http://spark.apache.org/docs/latest/api/python/index.html