Spark Training in Bangalore
Spark Training in Bangalore
Spark Training in Bangalore
Input
Reduc
e
Map
Map
Reduc
e
www.kellytechno.com
Outpu
t
Input
Reduc
e
Map
Map
Reduc
e
Outpu
t
www.kellytechno.com
www.kellytechno.com
www.kellytechno.com
www.kellytechno.com
lines = spark.textFile(hdfs://...)
RDD
RDD
results
errors = lines.filter(_.startsWith(ERROR))
messages = errors.map(_.split(\t)(2))
cachedMsgs = messages.cache()
Cached
RDD
Drive
r
cachedMsgs.filter(_.contains(bar)).count
. . .
tasks Block 1
Parallel
operation
cachedMsgs.filter(_.contains(foo)).count
Cache
Work 3
er
Work
er
Cache
Work 2
er
Block 2
Block 3
www.kellytechno.com
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
Parallel operations
(return a result to
driver)
reduce
collect
count
save
lookupKey
www.kellytechno.com
cachedMsgs = textFile(...).filter(_.contains(error))
.map(_.split(\t)(2))
.cache()
Ex:
HdfsRDD
path: hdfs://
FilteredRDD
func:
contains(...)
MappedRDD
func: split()
CachedRDD
www.kellytechno.com
Concern
RDDs
Distr. Shared
Mem.
Reads
Fine-grained
Fine-grained
Writes
Bulk
transformations
Fine-grained
Consistency
Trivial (immutable)
Up to app / runtime
Fault recovery
Fine-grained and
Requires
low-overhead using checkpoints and
lineage
program rollback
Straggler
mitigation
Possible using
speculative
execution
Work
www.kellytechno.com
Automatic based on Up to
app (but
Difficult
DryadLINQ
RAMCloud
Piccolo
Lineage/provenance, logical
logging, materialized views
Relational databases
www.kellytechno.com
www.kellytechno.com
+
+ ++ +
+
+ +
+ +
target
www.kellytechno.com
var w = Vector.random(D)
println("Final w: " + w)
www.kellytechno.com
127 s / iteration
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
www.kellytechno.com
www.kellytechno.com
Input graph
Vertex state
1
Superstep 1
Vertex state
2
Superstep 2
Messages 1
Group by vertex
ID
Messages 2
Group by vertex
ID
.
www.kellytechno.com
Input graph
Vertex ranks
1
Contribution
s1
Group & add by
Superstep 1 (add
vertex
contribs)
Vertex ranks
2
Contribution
s2
Group & add by
Superstep 2 (add
vertex
contribs)
.
www.kellytechno.com
www.kellytechno.com
www.kellytechno.com
Set of partitions
Preferred locations
for each partition
Optional
partitioning scheme
(hash or range)
Storage strategy
(lazy or cached)
Parent RDDs
(forming a lineage
DAG)
www.kellytechno.com
Presented By
Kelly Technologies
www.kellytechno.com