Spark Training in Bangalore

In-Memory Cluster Computing for
Iterative and Interactive Applications

Presented By
Kelly Technologies
www.kellytechno.com
Commodity clusters have become an

important computing platform for a variety of
applications
In industry: search, machine translation, ad

targeting,
In research: bioinformatics, NLP, climate simulation,
High-level cluster programming models like

MapReduce power many of these apps
Theme of this work: provide similarly powerful
abstractions for a broader class of applications
www.kellytechno.com
Current popular programming models

for clusters transform data flowing
from stable storage to stable storage
E.g., MapReduce:
Map
Input
Reduc
e
Map
Map
Reduc
e
www.kellytechno.com
Outpu
t
Current popular programming models

for clusters transform data flowing
from stable storage to stable storage
E.g., MapReduce:
Map
Input
Reduc
e
Map
Map
Reduc
e
Outpu
t
www.kellytechno.com
Acyclic data flow is a powerful abstraction,

but is not efficient for applications that
repeatedly reuse a working set of data:
Iterative algorithms (many in machine

learning)
Interactive data mining tools (R, Excel,
Python)
Spark makes working sets a first-class

concept to efficiently support these apps
www.kellytechno.com
Provide distributed memory

abstractions for clusters to support
apps with working sets
Retain the attractive properties of
MapReduce:
Fault tolerance (for crashes & stragglers)
Data locality
Solution: augment data flow model
Scalability
with resilient distributed datasets

(RDDs) www.kellytechno.com
We conjecture that Sparks combination of

data flow with RDDs unifies many proposed
cluster programming models
General data flow models: MapReduce, Dryad, SQL

Specialized models for stateful apps: Pregel (BSP),
HaLoop (iterative MR), Continuous Bulk Processing
Instead of specialized APIs for one type of

app, give user first-class control of distrib.
datasets
www.kellytechno.com
Spark programming model

Example applications
Implementation
Demo
Future work
www.kellytechno.com
Resilient distributed datasets (RDDs)
Parallel operations on RDDs
Immutable collections partitioned across

cluster that can be rebuilt if a partition is
lost
Created by transforming data in stable
storage using data flow operators (map,
filter, group-by, )
Can be cached across parallel operations
Reduce, collect, count, save,
Restricted shared variables
Accumulators, broadcast variables
www.kellytechno.com
Load error messages from a log into

memory, then interactively search for
Cache
Base
various patterns
Transformed
Work 1
lines = spark.textFile(hdfs://...)
RDD
RDD
results
errors = lines.filter(_.startsWith(ERROR))
messages = errors.map(_.split(\t)(2))
cachedMsgs = messages.cache()
Cached
RDD
Drive
r
cachedMsgs.filter(_.contains(bar)).count
. . .
Result: full-text search of

Wikipedia in <1 sec (vs 20
sec for on-disk data)
tasks Block 1
Parallel
operation
cachedMsgs.filter(_.contains(foo)).count
Cache
Work 3
er
Work
er
Cache
Work 2
er
Block 2
Block 3
www.kellytechno.com
An RDD is an immutable, partitioned, logical

collection of records
Need not be materialized, but rather contains

information to rebuild a dataset from stable
storage
Partitioning can be based on a key in each

record (using hash or range partitioning)
Built using bulk transformations on other
RDDs
Can be cached for future reuse
www.kellytechno.com
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
Parallel operations
(return a result to
driver)
reduce
collect
count
save
lookupKey
www.kellytechno.com
RDDs maintain lineage information

that can be used to reconstruct lost
partitions
cachedMsgs = textFile(...).filter(_.contains(error))
.map(_.split(\t)(2))
.cache()
Ex:
HdfsRDD
path: hdfs://
FilteredRDD
func:
contains(...)
MappedRDD
func: split()
CachedRDD
www.kellytechno.com
Consistency is easy due to immutability

Inexpensive fault tolerance (log lineage
rather than replicating/checkpointing
data)
Locality-aware scheduling of tasks on
partitions
Despite being restricted, model seems
applicable to a broad variety of
applications
www.kellytechno.com
Concern
RDDs
Distr. Shared
Mem.
Reads
Fine-grained
Fine-grained
Writes
Bulk
transformations
Fine-grained
Consistency
Trivial (immutable)
Up to app / runtime
Fault recovery
Fine-grained and
Requires
low-overhead using checkpoints and
lineage
program rollback
Straggler
mitigation
Possible using
speculative
execution
Work
www.kellytechno.com
Automatic based on Up to
app (but
Difficult
DryadLINQ
Allows random read/write to

all cells, requiring logging
much like distributed shared
memory systems
RAMCloud
Cannot define multiple

distributed datasets, run
diferent map/reduce pairs on
them, or query data
interactively
Iterative MapReduce (Twister

and HaLoop)
Parallel programs with shared

distributed tables; similar to
distributed shared memory
Piccolo
Lineage/provenance, logical
logging, materialized views
Relational databases
Language-integrated API with

SQL-like operations on lazy
datasets
Cannot have a dataset
persist across queries
www.kellytechno.com

Implementation
Demo
Future work
www.kellytechno.com
Goal: find best line separating two sets

of points
random initial line
+
+ ++ +
+

+ +
+ +

target
www.kellytechno.com
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
www.kellytechno.com
127 s / iteration
first iteration 174

s
further iterations
6s
www.kellytechno.com
MapReduce data flow can be expressed

using RDD transformations
res = data.flatMap(rec => myMapFunc(rec))

.groupByKey()
.map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
www.kellytechno.com
val lines = spark.textFile(hdfs://...)

val counts = lines.flatMap(_.split(\\s))
.reduceByKey(_ + _)
counts.save(hdfs://...)
www.kellytechno.com
Graph processing framework from

Google that implements Bulk
Synchronous Parallel model
Vertices in the graph have state
At each superstep, each node can
update its state and send messages to
nodes in future step
Good fit for PageRank, shortest paths,
www.kellytechno.com
Input graph
Vertex state
1
Superstep 1
Vertex state
2
Superstep 2
Messages 1
Group by vertex
ID
Messages 2
Group by vertex
ID
.
www.kellytechno.com
Input graph
Vertex ranks
1
Contribution
s1
Group & add by
Superstep 1 (add
vertex
contribs)
Vertex ranks
2
Contribution
s2
Group & add by
Superstep 2 (add
vertex
contribs)
.
www.kellytechno.com
Separate RDDs for immutable graph

state and for vertex states and messages
at each iteration
Use groupByKey to perform each step
Cache the resulting vertex and message
RDDs
Optimization: co-partition input graph
and vertex state RDDs to reduce
communication
www.kellytechno.com
Twitter spam classification (Justin Ma)

EM alg. for traffic prediction (Mobile
Millennium)
K-means clustering
Alternating Least Squares matrix
factorization
In-memory OLAP aggregation on Hive data
SQL on Spark (future work)
www.kellytechno.com

Implementation
Demo
Future work
www.kellytechno.com
Spark runs on the

Mesos cluster manager Spark Hadoo
MPI
p
[NSDI 11], letting it
share resources with

Mesos
Hadoop & other apps
Can read from any
Node Node Node Node
Hadoop input source
(e.g. HDFS)
~6000 lines of Scala code thanks to building

on Mesos
www.kellytechno.com
Scala closures are Serializable Java objects
Not quite enough
Serialize on driver, load & run on workers

Nested closures may reference entire outer scope
May pull in non-Serializable variables not used
inside
Solution: bytecode analysis + reflection
Shared variables implemented using custom

serialized form (e.g. broadcast variable
contains pointer to BitTorrent tracker)
www.kellytechno.com
Modified Scala interpreter to allow Spark to

be used interactively from the command
line
Required two changes:
Modified wrapper code generation so that each

line typed has references to objects for its
dependencies
Place generated classes in distributed filesystem
Enables in-memory exploration of big data

www.kellytechno.com

Implementation
Demo
Future work
www.kellytechno.com
Further extend RDD capabilities
Leverage lineage for debugging
Control over storage layout (e.g. columnoriented)

Additional caching options (e.g. on disk,
replicated)
Replay any task, rebuild any intermediate RDD
Adaptive checkpointing of RDDs

Higher-level analytics tools built on top of
Spark
www.kellytechno.com
By making distributed datasets a first-class

primitive, Spark provides a simple,
efficient programming model for stateful
data analytics
RDDs provide:
Lineage info for fault recovery and debugging

Adjustable in-memory caching
Locality-aware parallel operations
We plan to make Spark the basis of a suite

of batch and interactive data analysis tools
www.kellytechno.com
Set of partitions
Preferred locations
for each partition
Optional
partitioning scheme
(hash or range)
Storage strategy
(lazy or cached)
Parent RDDs
(forming a lineage
DAG)
www.kellytechno.com
Presented By
Kelly Technologies
www.kellytechno.com

Spark Training in Bangalore

Uploaded by

Copyright:

Available Formats

Spark Training in Bangalore

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Training in Bangalore

Uploaded by

Copyright:

Available Formats

In-Memory Cluster Computing for

Iterative and Interactive Applications

Commodity clusters have become an

In industry: search, machine translation, ad

High-level cluster programming models like

Current popular programming models

Current popular programming models

Acyclic data flow is a powerful abstraction,

Iterative algorithms (many in machine

Spark makes working sets a first-class

Provide distributed memory

with resilient distributed datasets

We conjecture that Sparks combination of

General data flow models: MapReduce, Dryad, SQL

Instead of specialized APIs for one type of

Spark programming model

Resilient distributed datasets (RDDs)

Parallel operations on RDDs

Immutable collections partitioned across

Restricted shared variables

Accumulators, broadcast variables

Load error messages from a log into

Result: full-text search of

An RDD is an immutable, partitioned, logical

Need not be materialized, but rather contains

Partitioning can be based on a key in each

RDDs maintain lineage information

Consistency is easy due to immutability

Allows random read/write to

Cannot define multiple

Iterative MapReduce (Twister

Parallel programs with shared

Language-integrated API with

Spark programming model

Goal: find best line separating two sets

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) {

first iteration 174

MapReduce data flow can be expressed

res = data.flatMap(rec => myMapFunc(rec))

val lines = spark.textFile(hdfs://...)

Graph processing framework from

Separate RDDs for immutable graph

Twitter spam classification (Justin Ma)

Spark programming model

Spark runs on the

share resources with

~6000 lines of Scala code thanks to building

Scala closures are Serializable Java objects

Not quite enough

Serialize on driver, load & run on workers

Shared variables implemented using custom

Modified Scala interpreter to allow Spark to

Modified wrapper code generation so that each

Enables in-memory exploration of big data

Spark programming model

Further extend RDD capabilities

Leverage lineage for debugging

Control over storage layout (e.g. columnoriented)

Adaptive checkpointing of RDDs