Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

Stratosphere:
System Overview
Robert Metzger
mail@robertmetzger.de
Twitter: @rmetzger_
Big Data Beers Meetup, Nov. 19th, 2013

Stratosphere
… is a distributed data processing engine
… automatically handles parallelization
… brings database technology to the world of
big data

Overview
● Extends MapReduce with more operators
map

cross

join

reduce

cogroup

New in Stratosphere

Known from Hadoop

● Support for advanced data flow graphs
M
M

R
J

R

R

M
Known from Hadoop

New in Stratosphere

● Compiler/Optimizer, Java/Scala Interface, YARN

R

Stratosphere System Stack
Java
API

Scala
API

Meteor

...

Hive
Stratosphere Optimizer
Stratosphere Runtime

Hadoop MR
Cluster
Manager

YARN

Direct

EC2

Storage

Local Files

HDFS

S3

...

Stratosphere in a Cluster
Master Node

●
●
●
●
●

Operators are executed
over the whole cluster
Side by side with Hadoop
Scales by adding more
nodes
Support for YARN is in
development
We have a LocalExecutor

Job
Submission

JobManager
Resource Mgmt
Compiler
Web Interface

TaskManager

TaskManager

DataNode

DataNode

TaskManager

TaskManager

DataNode

DataNode

Legend:
Cluster Node
Stratosphere
Hadoop

4 Worker Nodes

1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface

Data Flows: Execution Models
M

Apache Hadoop MR is
limited to one data flow

R

One of many possible data flows
in Stratosphere
M

R
J
M

R

Complex Data Flows in Hadoop
Grouping

R

Grouping

J

Filtering
M

M

R

Joining

R

M
M

R

M

R

Data Flows: Lessons Learned

1. Most tasks do not fit the MapReduce model
2. Very expensive
○ Always go to disk and HDFS

3. Tedious to implement
○ Custom data types and file formats between jobs

That’s why higher level abstractions for MR exist.

Advanced Data Flows in Stratosphere
●
●

Data flow graphs are supported natively
Stratosphere only writes to disk if necessary,
otherwise in-memory

R
J
M

R

Skeleton of a Stratosphere Program
● Input: text file, JDBC source, CSV, etc.
● Transformations
○ map, reduce, join, iterate etc.

● Output: to file etc.
● Data Types
○ PactRecord: Tuples with n fields.
○ custom data types for vectors, images, audio (we
only expect serialization and compare)
2

Data Flows: Code Example

R
J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);

MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();

Filter Mapper

ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0).build();

Define group key

MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0,
0)
.input1(ordersFiltered)
.input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0).build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);

Map Stub and PactRecord by Example
MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();

public class FilterOrders extends MapStub {
@Override
public void map(PactRecord order, Collector<PactRecord> out)
throws Exception {
PactString date = order.getField(Orders.DATE_IDX, PactString.class);
if (date.getValue().equals("11.20.2013")) {
out.collect(order);
}
}
}

Joins in Hadoop
Map (Broadcast) Join

Reduce (Repartition) Join

● Which strategy to choose?
● How to configure it
Lessons Learned:
● Joins do not naturally fit MapReduce
● Very time consuming to implement
● Hand optimization necessary
Source: Sebastian Schelter, TU Berlin

Joins with Stratosphere
● Natively implemented into the system
● Optimizer decides join strategy:
○ Sort-merge-join
○ Hybrid Hash Join
○ Data Shipping Strategy
● Hybrid Hash Join starts in-memory and
gracefully degrades to disk

Optimizer Magic
Recap example job:
Grouping

R

Grouping

J

Filtering
M

R

Joining

We require a grouped input for the reducer
(sorting or hashing)
● Optimizer chooses Sort-Merge-Join → no sorting
for reduce
●

Stratosphere Optimizer
●

Cost-based optimizer
○ Enumerate different execution plans
○ Choose the cheapest one

●

Optimizer collects statistics
○ Size of input and output

Operators (Map, Reduce, Join) tell how they
modify fields
● In-memory chaining of operators
● Memory Distribution
⇒ Focus on your application logic rather than
parallel execution.
●

Algorithms that need iterations
●
●
●
●
●
●
●

K-Means
Gradient descent
Page-Rank
Logistic Regression
Path algorithms on graphs
Graph communities / dense sub-components
Inference (belief propagation)

Why Iterations?
●

Many algorithms loop over the data
○ Machine learning: iteratively refine the model
○ Graph processing: propagate information hop by hop

Initial Input
1

1st Iteration
1

2

4

3

1

1

2

2

5

6

2nd Iteration
1

5

7

5

1

1

5

5

Example: Connected Components

5

5

Iterations in Hadoop
Loop is outside the system
○ Hard to program
○ Very poor performance

Itera
n 2nd

Ite
io
n

R

Usually each iteration
is more than a single
map and reduce!

t
ra

1st Iteration

th

M

n-

S

n

It

aw

n

w
pa

1st

Sp

on

i
rat
e

tion

Driver

Spaw

●

M
2nd Iteration

R

M
...

n-th Iteration

R

Iterations in Stratosphere
●

Loop is inside the system
○ Easy to program
○ Huge performance gains

Iterate
M

C

M

R

R

M

●
●
●
●
●
●
●
●

Functional object oriented programming language
ScaLa = Scalable Language
Very productive (few LOC)
Feels like a scripting language
No more UDFs
Easy to integrate
Runs in JVM, is compatible to regular Java classes
Basis for developing embedded domain specific
languages (DSL)

Do more, write less!
class Person(val firstName: String, val lastName: String)

public class Person {
private final String firstName;
private final String lastName;
public Person(String firstName, String lastName) {
this.firstName = firstName;
this.lastName = lastName;
}
public String getFirstName() {
return firstName;
}
public String getLastName() {
return lastName;
}
}

Let the code speak
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))

R

Example in Scala

J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);
MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build();
ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0)
.build();
MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0)
.input1(ordersFiltered).input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0)
.build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts");

val customers = DataSource(customersPath, CsvInputFormat[Customer])
val orders = DataSource(ordersPath, CsvInputFormat[Order])
val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")}
val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip,
grp.maxBy{_.total})}
val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id}
.isEqualTo {cust => cust._1} .map { (orders, cust) => cust}
val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}}
val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled))
val plan = new ScalaPlan(Seq(output), "BDB Example")

Summary: Feature Matrix
Stratosphere: Database inspired Big Data Analytics
Map Reduce
●
●

Map
Reduce

Operators

Stratosphere
●
●
●
●
●
●
●

Map
Reduce (multiple sort keys)
Cross
Join
CoGroup
Union
Iterate, Iterate Delta

Composition

Only MapReduce

Arbitrary Data flows

Data Exchange

Batch through disk

Pipelined, in-memory
(automatic spilling to disk)

Get In Touch
Stratosphere is the next-generation open source
Big Data Analytics Platform.
Quickstart: http://stratosphere.eu/quickstart
Website: http://stratosphere.eu
GitHub: https://github.com/stratosphere
Mailing List:
https://groups.google.com/d/forum/stratosphere-dev
Twitter: @stratosphere_eu

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

More Related Content

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013