Transformations and Actions: A Visual Guide of The API
Transformations and Actions: A Visual Guide of The API
http://training.databricks.com/visualapi.pdf
LinkedIn
Blog: data-frack
Databricks would like to give a special thanks to Jeff Thomspon for contributing 67
visual diagrams depicting the Spark API under the MIT license to the Spark
community.
Jeffs original, creative work can be found here and you can read more about
Jeffs project in his blog post.
(http://databricks.workable.com)
Databricks Cloud:
A unified platform for building Big Data pipelines from ETL to
Exploration and Dashboards, to Advanced Analytics and Data
Products.
RDD Elements
RDD
Legend
key
original item
B
partition(s)
transformed
type
user input
user functions
emitted value
input
object on driver
Legend
Randomized operation
Numeric calculation
TRANSFORMATIONS
Operations =
ACTIONS
= easy
= medium
ACTIONS
TRANSFORMATIONS
Math / Statistical
sample
map
randomSplit
filter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
union
intersection
subtract
distinct
cartesian
zip
takeOrdered
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
= easy
= medium
ACTIONS
TRANSFORMATIONS
flatMapValues
groupByKey
reduceByKey
reduceByKeyLocally
foldByKey
aggregateByKey
sortByKey
combineByKey
keys
values
Math / Statistical
sampleByKey
cogroup (=groupWith)
join
subtractByKey
fullOuterJoin
leftOuterJoin
rightOuterJoin
countByKey
countByValue
countByValueApprox
countApproxDistinctByKey
countApproxDistinctByKey
countByKeyApprox
sampleByKeyExact
Data Structure
partitionBy
vs
narrow
each partition of the parent RDD is used by
at most one partition of the child RDD
wide
multiple child RDD partitions may depend
on a single parent RDD partition
We found it both sufficient and useful to classify dependencies into two types:
narrow dependencies, where each partition of the parent RDD is used by at
most one partition of the child RDD
wide dependencies, where multiple child partitions may depend on it.
narrow
wide
map, filter
join w/ inputs
co-partitioned
union
groupByKey
TRANSFORMATIONS
Core Operations
MAP
RDD: x
3 items in RDD
MAP
RDD: x
RDD: y
User function
applied item by item
MAP
RDD: x
RDD: y
MAP
RDD: x
RDD: y
MAP
RDD: x
before
RDD: y
after
MAP
RDD: x
RDD: y
RDD: y
RDD: x
MAP
map(f, preservesPartitioning=False)
FILTER
RDD: x
3 items in RDD
FILTER
RDD: x
RDD: y
Apply user function:
keep item if function
returns true
True
emits
FILTER
RDD: x
False
emits
RDD: y
FILTER
RDD: x
True
emits
RDD: y
FILTER
RDD: x
before
RDD: y
after
RDD: y
RDD: x
FILTER
filter(f)
Return a new RDD containing only the elements that satisfy a predicate
x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print(x.collect())
print(y.collect())
x: [1, 2, 3]
y: [1, 3]
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
FLATMAP
RDD: x
3 items in RDD
FLATMAP
RDD: x
RDD: y
FLATMAP
RDD: x
RDD: y
FLATMAP
RDD: x
RDD: y
FLATMAP
RDD: x
before
RDD: y
after
FLATMAP
RDD: x
RDD: y
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
RDD: x
RDD: y
FLATMAP
flatMap(f, preservesPartitioning=False)
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results
x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, x*100, 42))
print(x.collect())
print(y.collect())
x: [1, 2, 3]
y: [1, 100, 42, 2, 200, 42, 3, 300, 42]
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
GROUPBY
RDD: x
4 items in RDD
James
Anna
Fred
John
GROUPBY
RDD: x
RDD: y
James
Anna
Fred
John
emits
[ John ]
GROUPBY
RDD: x
RDD: y
James
Anna
John
emits
[ Fred ]
Fred
[ John ]
GROUPBY
RDD: x
RDD: y
James
John
emits
[ Fred ]
Fred
[ Anna ]
Anna
[ John ]
GROUPBY
RDD: x
RDD: y
James
[ Fred ]
Fred
emits
[ Anna ]
Anna
John
[ John, James ]
RDD: y
RDD: x
GROUPBY
groupBy(f, numPartitions=None)
Group the data in the original RDD. Create pairs where the key is the output of
a user function, and the value is all items for which the function yields this key.
y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]
GROUPBYKEY
Pair RDD: x
5 items in RDD
5
B
4
A
3
A
2
A 1
GROUPBYKEY
Pair RDD: x
RDD: y
5
B
4
A
3
A
2
A 1
[2,3,1]
GROUPBYKEY
Pair RDD: x
RDD: y
5
B
4
A
3
A
[5,4]
2
A 1
[2,3,1]
RDD: y
RDD: x
GROUPBYKEY
groupByKey(numPartitions=None)
Group the values for each key in the original RDD. Create a new pair where the
original key corresponds to this collected group of values.
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))
MAPPARTITIONS
RDD: x
RDD: y
partitions
B
A
REDUCEBYKEY
VS
GROUPBYKEY
REDUCEBYKEY
(a, 1)
(b, 1)
(a,
(a,
(b,
(b,
(a, 1)
(b, 1)
1)
1)
1)
1)
(a, 2)
(b, 2)
a
(a, 1)
(a, 2)
(a, 3)
(a,
(a,
(a,
(b,
(b,
(a, 6)
(b, 1)
(b, 2)
(b, 2)
(b, 5)
1)
1)
1)
1)
1)
(a, 3)
(b, 2)
GROUPBYKEY
(a,
(a,
(b,
(b,
(a, 1)
(b, 1)
(a,
(a,
(a,
(b,
(b,
1)
1)
1)
1)
a
(a,
(a,
(a,
(a,
(a,
(a,
1)
1)
1)
1)
1)
1)
(a, 6)
(b,
(b,
(b,
(b,
(b,
1)
1)
1)
1)
1)
(b, 5)
1)
1)
1)
1)
1)
MAPPARTITIONS
B
A
A
mapPartitions(f, preservesPartitioning=False)
x = sc.parallelize([1,2,3], 2)
def f(iterator): yield sum(iterator); yield 42
y = x.mapPartitions(f)
MAPPARTITIONS
B
A
A
mapPartitions(f, preservesPartitioning=False)
val x = sc.parallelize(Array(1,2,3), 2)
def f(i:Iterator[Int])={ (i.sum,42).productIterator }
val y = x.mapPartitions(f)
MAPPARTITIONSWITHINDEX
RDD: x
RDD: y
partitions
A
input
partition index
MAPPARTITIONSWITHINDEX
B
B
partition index
mapPartitionsWithIndex(f, preservesPartitioning=False)
y = x.mapPartitionsWithIndex(f)
MAPPARTITIONSWITHINDEX
B
B
partition index
mapPartitionsWithIndex(f, preservesPartitioning=False)
SAMPLE
RDD: x
RDD: y
5
4
1
3
2
1
RDD: y
RDD: x
SAMPLE
sample(withReplacement, fraction, seed=None)
x = sc.parallelize([1, 2, 3, 4, 5])
y = x.sample(False, 0.4, 42)
print(x.collect())
print(y.collect())
x: [1, 2, 3, 4, 5]
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
y: [1, 3]
UNION
RDD: x
RDD: y
4
3
2
1
RDD: z
4
3
3
2
1
C
B
UNION
Return a new RDD containing all items from two original RDDs. Duplicates are not culled.
union(otherRDD)
x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.union(y)
print(z.glom().collect())
x: [1, 2, 3]
y: [3, 4]
val
val
val
val
x = sc.parallelize(Array(1,2,3), 2)
y = sc.parallelize(Array(3,4), 1)
z = x.union(y)
zOut = z.glom().collect()
JOIN
RDD: x
RDD: y
5
B
2
B
A
A
1
4
A
JOIN
RDD: x
RDD: y
5
B
2
B
A
A
1
4
A
RDD: z
A
(1, 3)
JOIN
RDD: x
RDD: y
5
B
2
B
A
A
1
4
A
RDD: z
A
(1, 4)
A
(1, 3)
JOIN
RDD: x
RDD: y
5
B
2
B
A
A
1
4
A
RDD: z
(2, 5)
B
A
(1, 4)
A
(1, 3)
JOIN
Return a new RDD containing all pairs of elements having the same key in the original RDDs
union(otherRDD, numPartitions=None)
DISTINCT
RDD: x
4
3
3
2
1
DISTINCT
RDD: x
RDD: y
4
2
1
DISTINCT
RDD: x
RDD: y
4
4
3
2
1
2
1
DISTINCT
Return a new RDD containing distinct items from the original RDD (omitting all duplicates)
distinct(numPartitions=None)
x = sc.parallelize([1,2,3,3,4])
y = x.distinct()
print(y.collect())
x: [1, 2, 3, 3, 4]
y: [1, 2, 3, 4]
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
println(y.collect().mkString(", "))
COALESCE
RDD: x
C
B
A
COALESCE
RDD: x
RDD: y
C
B
A
AB
COALESCE
RDD: x
RDD: y
C
B
A
AB
COALESCE
AB
A
x = sc.parallelize([1, 2, 3, 4, 5], 3)
y = x.coalesce(2)
print(x.glom().collect())
print(y.glom().collect())
x = sc.parallelize(Array(1, 2, 3, 4, 5), 3)
y = x.coalesce(2)
xOut = x.glom().collect()
yOut = y.glom().collect()
KEYBY
RDD: x
RDD: y
James
Anna
Fred
John
emits
John
KEYBY
RDD: x
RDD: y
James
Anna
Fred
F
John
Fred
J
John
KEYBY
RDD: x
RDD: y
James
Anna
A
Fred
Anna
F
John
Fred
J
John
KEYBY
RDD: x
James
Anna
James
A
Fred
emits
RDD: y
Anna
F
John
Fred
J
John
RDD: y
RDD: x
KEYBY
keyBy(f)
Create a Pair RDD, forming one pair for each item in the original RDD. The
pairs key is calculated from the value via a user-supplied function.
y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]
PARTITIONBY
RDD: x
J
John
A
Anna
F
Fred
James
PARTITIONBY
RDD: x
J
John
A
Anna
F
Fred
RDD: y
James
James
PARTITIONBY
RDD: x
J
John
A
Anna
F
James
Fred
RDD: y
James
Fred
PARTITIONBY
RDD: x
J
John
A
Anna
F
Fred
RDD: y
James
James
A
Anna
F
Fred
PARTITIONBY
RDD: x
J
John
A
J
Fred
John
J
Anna
F
RDD: y
James
James
A
Anna
F
Fred
PARTITIONBY
Return a new RDD with the specified number of partitions, placing original
items into the partition returned by a user supplied function
partitionBy(numPartitions, partitioner=portable_hash)
x = sc.parallelize([('J','James'),('F','Fred'),
('A','Anna'),('J','John')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print x.glom().collect()
print y.glom().collect()
PARTITIONBY
Return a new RDD with the specified number of partitions, placing original
items into the partition returned by a user supplied function.
partitionBy(numPartitions, partitioner=portable_hash)
import org.apache.spark.Partitioner
val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),
('A',"Anna"),('J',"John")), 3)
val y = x.partitionBy(new Partitioner() {
val numPartitions = 2
def getPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})
val yOut = y.glom().collect()
x: Array(Array((A,Anna), (F,Fred)),
Array((J,John), (J,James)))
y: Array(Array((F,Fred), (A,Anna)),
Array((J,John), (J,James)))
ZIP
RDD: x
RDD: y
9
2
4
1
ZIP
RDD: x
RDD: y
9
2
4
1
RDD: z
ZIP
RDD: x
RDD: y
9
2
4
1
RDD: z
4
2
ZIP
RDD: x
RDD: y
9
2
4
1
RDD: z
4
3
9
2
B
A
ZIP
Return a new RDD containing pairs whose key is the item in the original RDD, and whose
value is that items corresponding element (same partition, same index) in a second RDD
zip(otherRDD)
x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print(z.collect())
x: [1, 2, 3]
y: [1, 4, 9]
val x = sc.parallelize(Array(1,2,3))
val y = x.map(n=>n*n)
val z = x.zip(y)
println(z.collect().mkString(", "))
ACTIONS
Core Operations
vs
distributed
driver
GETNUMPARTITIONS
2
B
partition(s)
GETNUMPARTITIONS
2
B
A
getNumPartitions()
y: 2
COLLECT
partition(s)
COLLECT
collect()
y: [1, 2, 3]
REDUCE
4
3
2
1
emits
REDUCE
4
3
2
6
input
1
emits
REDUCE
4
10
input
10
2
1
REDUCE
***
**
******
*
reduce(f)
x = sc.parallelize([1,2,3,4])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
x:
[1, 2, 3, 4]
y:
10
AGGREGATE
4
3
2
1
B
A
AGGREGATE
4
3
2
1
AGGREGATE
4
3
([], 0)
2
1
([1], 1)
emits
AGGREGATE
4
([], 0)
2
1
([2], 2)
([1], 1)
AGGREGATE
4
3
2
1
([1,2], 3)
([2], 2)
([1], 1)
AGGREGATE
([], 0)
4
3
([3], 3)
2
1
([1,2], 3)
([2], 2)
([1], 1)
AGGREGATE
([], 0)
4
3
([4], 4)
([3], 3)
2
1
([1,2], 3)
([2], 2)
([1], 1)
AGGREGATE
4
3
([3,4], 7)
([4], 4)
([3], 3)
2
1
([1,2], 3)
([2], 2)
([1], 1)
AGGREGATE
4
3
([3,4], 7)
([4], 4)
([3], 3)
2
1
([1,2], 3)
([2], 2)
([1], 1)
AGGREGATE
4
3
([3,4], 7)
2
1
([1,2], 3)
AGGREGATE
4
3
([3,4], 7)
([1,2,3,4], 10)
([1,2], 3)
([1,2,3,4], 10)
2
1
AGGREGATE
***
**
[(***
),#]
***
x:
[1, 2, 3, 4]
y:
AGGREGATE
***
**
[(***
),#]
***
x:
[1, 2, 3, 4]
y:
(Array(3, 1, 2, 4),10)
MAX
1
4
4
2
max
1
4
MAX
max()
x: [2, 4, 1]
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
y: 4
SUM
1
4
7
2
1
4
SUM
sum()
x: [2, 4, 1]
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
y: 7
MEAN
1
4
2.33333333
2
1
4
MEAN
2.3333333
mean()
x: [2, 4, 1]
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
y: 2.3333333
STDEV
1
1.2472191
4
2
1
4
STDEV
1.2472191
stdev()
x: [2, 4, 1]
val x = sc.parallelize(Array(2,4,1))
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
y: 1.2472191
COUNTBYKEY
J
John
A
Anna
F
Fred
J
James
COUNTBYKEY
countByKey()
SAVEASTEXTFILE
SAVEASTEXTFILE
saveAsTextFile(path, compressionCodecClass=None)
x: [2, 4, 1]
y: [u'2', u'4', u'1']
LAB
Q&A