L11 MapReduce Dijkstra BFS
L11 MapReduce Dijkstra BFS
L11 MapReduce Dijkstra BFS
Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States. See
http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Overview
MapReduce Introduction
Simple counting, averaging
Graph problems and representations
Parallel breadth-first search
CS 4407
University College Cork,
Gregory M. Provan
MapReduce: Parallel Programming Framework
Scaling algorithms by parallel computation
– Needed for “big data”
MapReduce
– Google framework Jobs
CS 4407
University College Cork,
Gregory M. Provan
MapReduce Basics
Partition data
Two phases
– MAP: extract values
– REDUCE: combine values
Map f f f f f
Reduce g g g g g
CS 4407
University College Cork,
Gregory M. Provan
Map
Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
map() produces one or more intermediate values along with
an output key from the input.
CS 4407
University College Cork,
Gregory M. Provan
Map
CS 4407
University College Cork,
Gregory M. Provan
Reduce
After the map phase is over, all the intermediate values for a
given output key are combined together into a list
reduce() combines those intermediate values into one or
more final values for that same output key
(in practice, usually only one final value per key)
CS 4407
University College Cork,
Gregory M. Provan
Reduce
initial
returned
CS 4407
University College Cork,
Gregory M. Provan
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
r1 s1 CSr24407
s2 r3 s3
University College Cork,
Gregory M. Provan
MapReduce: Overview
Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are reduced together
Optionally, also:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
The execution framework handles everything else…
CS 4407
University College Cork,
Gregory M. Provan
“Everything Else”
The execution framework handles everything else…
– Scheduling: assigns workers to map and reduce tasks
– “Data distribution”: moves processes to data
– Synchronization: gathers, sorts, and shuffles intermediate
data
– Errors and faults: detects worker failures and restarts
Limited control over data and execution flow
– All algorithms must expressed in m, r, c, p
You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a particular mapper is processing
– Which intermediate key a particular reducer is processing
CS 4407
University College Cork,
Gregory M. Provan
Word Count Example
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick quick, 1 adjective, article brown, 2
brown fox Map how, 1
Reduce now, 1
quick, 1
the, 1 the, 3
the, 1
how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
fox, 1 Reduce mouse, 1
how now fox, 2
brown cow Map cow, 1 noun, verb
CS 4407
University College Cork,
Gregory M. Provan
Word Count: Baseline
CS 4407
University College Cork,
Gregory M. Provan
Word Count: Version 2
CS 4407
University College Cork,
Gregory M. Provan
Design Pattern for Local Aggregation
“In-mapper combining”
– Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
Advantages
– Speed
– Faster than actual combiners
Disadvantages
– Explicit memory management required
– Potential for order-dependent bugs
CS 4407
University College Cork,
Gregory M. Provan
Combiner Design
Combiners and reducers share same method
signature
– Sometimes, reducers can serve as combiners
– Often, not…
Remember: combiner are optional optimizations
– Should not affect algorithm correctness
– May be run 0, 1, or multiple times
Example: find average of all integers associated
with the same key
CS 4407
University College Cork,
Gregory M. Provan
Computing the Mean: Version 1
CS 4407
University College Cork,
Gregory M. Provan
Single Source Shortest Path
Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
First, a refresher: Dijkstra’s Algorithm
CS 4407
University College Cork,
Gregory M. Provan
Pseudocode for Dijkstra
Initialize the cost of each vertex to
cost[s] = 0;
heap.insert(s);
While (! heap.empty())
n = heap.deleteMin()
For (each vertex a which is adjacent to n along edge e)
if (cost[n] + edge_cost[e] < cost[a]) then
cost [a] = cost[n] + edge_cost[e]
previous_on_path_to[a] = n;
if (a is in the heap) then heap.decreaseKey(a)
else heap.insert(a)
CS 4407
University College Cork,
Gregory M. Provan
Dijkstra’s Algorithm Example
1
10
0 2 3 9 4 6
5 7
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example
1
10
10
0 2 3 9 4 6
5 7
5
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example
1
8 14
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example
1
8 13
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example
1
1
8 9
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example
1
8 9
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Single Source Shortest Path
Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
Single processor machine: Dijkstra’s Algorithm
MapReduce: parallel Breadth-First Search (BFS)
CS 4407
University College Cork,
Gregory M. Provan
Finding the Shortest Path
Consider simple case of equal edge weights
Solution to the problem can be defined inductively
Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– DISTANCETO(s) = 0
– For all nodes p reachable from s,
DISTANCETO(p) = 1
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m M)
d1 m1
…
d2
s … n
m2
… d3
m3
CS 4407
University College Cork,
Gregory M. Provan
Visualizing Parallel BFS
n7
n0 n1
n3 n2
n6
n5
n4
n8
n9
CS 4407
University College Cork,
Gregory M. Provan
From Intuition to Algorithm
Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d =
Mapper:
– m adjacency list: emit (m, d + 1)
Sort/Shuffle
– Groups distances by reachable nodes
Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path
CS 4407
University College Cork,
Gregory M. Provan
Multiple Iterations Needed
Each MapReduce iteration advances the “known frontier” by
one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well
CS 4407
University College Cork,
Gregory M. Provan
BFS Pseudo-Code
CS 4407
University College Cork,
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
Adjacency matrix B C
A B C D E 1
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5
Adjacency List 7
CS 4407
University College Cork, 33
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> 1
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
A
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>
5 7
Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>>
<B, <inf, <(C, 1), (D, 2)>>> 2
<C, inf> <D, inf>
<E, inf> <C, <inf, <(E, 4)>>> D E
<B, inf> <C, inf> <E, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <E, <inf, <(A, 7), (C, 6)>>>
Flushed to local disk!!
CS 4407
University College Cork, 34
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf> 9
0 2 3 4 6
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>
CS 4407
University College Cork, 36
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration 1
10
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> Flushed to local disk!!
<E, <inf, <(A, 7), (C, 6)>>>
CS 4407
University College Cork, 37
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1
<A, inf>
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6
5
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
5
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
CS 4407
University College Cork, 40
Gregory M. Provan
Stopping Criterion
How many iterations are needed in parallel BFS (equal edge
weight case)?
Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
Now answer the question...
– Six degrees of separation?
Practicalities of implementation in MapReduce
CS 4407
University College Cork,
Gregory M. Provan
Comparison to Dijkstra
Dijkstra’s algorithm is more efficient
– At any step it only pursues edges from the minimum-cost path inside
the frontier
MapReduce explores all paths in parallel
– Lots of “waste”
– Useful work is only done at the “frontier”
Why can’t we do better using MapReduce?
CS 4407
University College Cork,
Gregory M. Provan
Weighted Edges
Now add positive weights to the edges
– Why can’t edge weights be negative?
Simple change: adjacency list now includes a weight w for
each edge
– In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m
CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
How many iterations are needed in parallel BFS (positive
edge weight case)
– Graph diameter D
Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
CS 4407
University College Cork,
Gregory M. Provan
Additional Complexities
1
search frontier 1
n6 n7 1
n8
r 10
1 n9
n5
n1
s 1 1
q
p n4
1
n2 1
n3
CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
How many iterations are needed in parallel BFS (positive
edge weight case)?
Practicalities of implementation in MapReduce
CS 4407
University College Cork,
Gregory M. Provan
Graphs and MapReduce
Graph algorithms typically involve:
– Performing computations at each node: based on node features,
edge features, and local link structure
– Propagating computations: “traversing” the graph
Generic recipe:
– Represent graphs as adjacency lists
– Perform local computations in mapper
– Pass along partial results via outlinks, keyed by destination node
– Perform aggregation in reducer on inlinks to a node
– Iterate until convergence: controlled by external “driver”
– Don’t forget to pass the graph structure between iterations
CS 4407
University College Cork,
Gregory M. Provan
http://famousphil.com/blog/2011/06/a-hadoop-mapreduce-solution-to-dijkstra%E2%80%99s-algorithm/
for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first
as a key
String[] sp = val.toString().split(" ");//splits on space
//look at first value
if(sp[0].equalsIgnoreCase("NODES")){
nodes = null;
nodes = sp[1];
}else if(sp[0].equalsIgnoreCase("VALUE")){
int distance = Integer.parseInt(sp[1]);
lowest = Math.min(distance, lowest);
}
}
word.set(lowest+" "+nodes);
context.write(key, word);
word.clear();
}
CS 4407
} University College Cork,
Gregory M. Provan
public int run(String[] args) throws Exception {
//http://code.google.com/p/joycrawler/source/browse/NetflixChallenge/src/org/niubility/learning/knn/KNNDriver.java?r=2
getConf().set("mapred.textoutputformat.separator", " ");//make the key -> value space separated (for iterations)
…..
while(isdone == false){
Job job = new Job(getConf());
job.setJarByClass(Dijkstra.class);
job.setJobName("Dijkstra");
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TheMapper.class);
job.setReducerClass(TheReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);