L11 MapReduce Dijkstra BFS

Map-Reduce Applications:
Counting, Graph Shortest Paths
Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States. See
http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Overview
 MapReduce Introduction
 Simple counting, averaging
 Graph problems and representations
 Parallel breadth-first search
CS 4407
University College Cork,
Gregory M. Provan
MapReduce: Parallel Programming Framework
 Scaling algorithms by parallel computation
– Needed for “big data”
 MapReduce
– Google framework Jobs
– Partition data input many CPUs
 Examine a few algorithms executed in parallel Data

– Word counting Centre
– Dijkstra’s algorithm
– PageRank
CS 4407
Gregory M. Provan
MapReduce Basics
 Partition data
 Two phases
– MAP: extract values
– REDUCE: combine values
Map f f f f f
Reduce g g g g g
CS 4407
Gregory M. Provan
Map
 Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
 map() produces one or more intermediate values along with
an output key from the input.
CS 4407
Gregory M. Provan
Map
map (in_key, in_value) ->

(out_key, intermediate_value) list
CS 4407
Gregory M. Provan
Reduce
 After the map phase is over, all the intermediate values for a
given output key are combined together into a list
 reduce() combines those intermediate values into one or
more final values for that same output key
 (in practice, usually only one final value per key)
CS 4407
Gregory M. Provan
Reduce
reduce (out_key, intermediate_value list) ->

out_value list
initial
returned
CS 4407
Gregory M. Provan
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
combine combine combine combine
a 1 b 2 c 9 a 5 c 2 b 7 c 8
partition partition partition partition
Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 9 8
reduce reduce reduce
r1 s1 CSr24407
s2 r3 s3
Gregory M. Provan
MapReduce: Overview
 Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are reduced together
 Optionally, also:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
 The execution framework handles everything else…
CS 4407
Gregory M. Provan
“Everything Else”
 The execution framework handles everything else…
– Scheduling: assigns workers to map and reduce tasks
– “Data distribution”: moves processes to data
– Synchronization: gathers, sorts, and shuffles intermediate
data
– Errors and faults: detects worker failures and restarts
 Limited control over data and execution flow
– All algorithms must expressed in m, r, c, p
 You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a particular mapper is processing
– Which intermediate key a particular reducer is processing
CS 4407
Gregory M. Provan
Word Count Example
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick quick, 1 adjective, article brown, 2
brown fox Map how, 1
Reduce now, 1
quick, 1
the, 1 the, 3
the, 1
the fox ate

the mouse Map
fox, 1
how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
fox, 1 Reduce mouse, 1
how now fox, 2
brown cow Map cow, 1 noun, verb
CS 4407
Gregory M. Provan
Word Count: Baseline
What’s the impact of combiners?

CS 4407
Gregory M. Provan
Word Count: Version 1
CS 4407
Gregory M. Provan
Word Count: Version 2
CS 4407
Gregory M. Provan
Design Pattern for Local Aggregation
 “In-mapper combining”
– Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
 Advantages
– Speed
– Faster than actual combiners
 Disadvantages
– Explicit memory management required
– Potential for order-dependent bugs
CS 4407
Gregory M. Provan
Combiner Design
 Combiners and reducers share same method
signature
– Sometimes, reducers can serve as combiners
– Often, not…
 Remember: combiner are optional optimizations
– Should not affect algorithm correctness
– May be run 0, 1, or multiple times
 Example: find average of all integers associated
with the same key
CS 4407
Gregory M. Provan
Computing the Mean: Version 1
CS 4407
Gregory M. Provan
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 First, a refresher: Dijkstra’s Algorithm
CS 4407
Gregory M. Provan
Pseudocode for Dijkstra
Initialize the cost of each vertex to 
cost[s] = 0;
heap.insert(s);
While (! heap.empty())
n = heap.deleteMin()
For (each vertex a which is adjacent to n along edge e)
if (cost[n] + edge_cost[e] < cost[a]) then
cost [a] = cost[n] + edge_cost[e]
previous_on_path_to[a] = n;
if (a is in the heap) then heap.decreaseKey(a)
else heap.insert(a)
CS 4407
Gregory M. Provan
Dijkstra’s Algorithm Example
1
 
10
0 2 3 9 4 6
5 7
 
2
CS 4407
Example from CLR Gregory M. Provan
1
10 
10
0 2 3 9 4 6
5 7
5 
2
CS 4407
1
8 14
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
1
8 13
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
1
1
8 9
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
1
8 9
10
0 2 3 9 4 6
5 7
5 7
2
CS 4407
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 Single processor machine: Dijkstra’s Algorithm
 MapReduce: parallel Breadth-First Search (BFS)
CS 4407
Gregory M. Provan
Finding the Shortest Path
 Consider simple case of equal edge weights
 Solution to the problem can be defined inductively
 Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– DISTANCETO(s) = 0
– For all nodes p reachable from s,
DISTANCETO(p) = 1
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m  M)
d1 m1
…
d2
s … n
m2
… d3
m3
CS 4407
Gregory M. Provan
Visualizing Parallel BFS
n7
n0 n1
n3 n2
n6
n5
n4
n8
n9
CS 4407
Gregory M. Provan
From Intuition to Algorithm
 Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d = 
 Mapper:
– m  adjacency list: emit (m, d + 1)
 Sort/Shuffle
– Groups distances by reachable nodes
 Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path
CS 4407
Gregory M. Provan
Multiple Iterations Needed
 Each MapReduce iteration advances the “known frontier” by
one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
 Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well
CS 4407
Gregory M. Provan
BFS Pseudo-Code
CS 4407
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Adjacency matrix B C
A B C D E 1
 
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5
 Adjacency List 7
A: (B, 10), (D, 5)  

B: (C, 1), (D, 2) 2
C: (E, 4) D E
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6)
CS 4407
University College Cork, 33
Gregory M. Provan
 Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
A
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>
5 7
 Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>>  
<B, <inf, <(C, 1), (D, 2)>>> 2
<C, inf> <D, inf>
<E, inf> <C, <inf, <(E, 4)>>> D E
<B, inf> <C, inf> <E, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <E, <inf, <(A, 7), (C, 6)>>>
Flushed to local disk!!
CS 4407
Gregory M. Provan
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf> 9
0 2 3 4 6
<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf> 5 7
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>  

2
<D, 5> <D, inf>
D E
<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>
CS 4407
Gregory M. Provan
<A, <0, <(B, 10), (D, 5)>>>
 1 
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf>

5 7
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

 
<D, 5> <D, inf> 2
D E
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
CS 4407
Gregory M. Provan
 Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration 1
10 
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>

 Map output: <dest node ID, dist>
5 7
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> Flushed to local disk!!
<E, <inf, <(A, 7), (C, 6)>>>
CS 4407
Gregory M. Provan
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6
<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>
5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
CS 4407
Gregory M. Provan
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6
<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>
5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E
<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
CS 4407
Gregory M. Provan
 Reduce output: <node ID, <dist, adj list>> B C
= Map input for next iteration 1
8 11
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>> 10
A
<C, <11, <(E, 4)>>>
0 2 3 9 4 6
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
5 7
… the rest omitted … 5 7

2
D E
CS 4407
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (equal edge
weight case)?
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
 Now answer the question...
– Six degrees of separation?
 Practicalities of implementation in MapReduce
CS 4407
Gregory M. Provan
Comparison to Dijkstra
 Dijkstra’s algorithm is more efficient
– At any step it only pursues edges from the minimum-cost path inside
the frontier
 MapReduce explores all paths in parallel
– Lots of “waste”
– Useful work is only done at the “frontier”
 Why can’t we do better using MapReduce?
CS 4407
Gregory M. Provan
Weighted Edges
 Now add positive weights to the edges
– Why can’t edge weights be negative?
 Simple change: adjacency list now includes a weight w for
each edge
– In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m
CS 4407
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)
– Graph diameter D
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
CS 4407
Gregory M. Provan
Additional Complexities
1
search frontier 1
n6 n7 1
n8
r 10
1 n9
n5
n1
s 1 1
q
p n4
1
n2 1
n3
CS 4407
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)?
 Practicalities of implementation in MapReduce
CS 4407
Gregory M. Provan
Graphs and MapReduce
 Graph algorithms typically involve:
– Performing computations at each node: based on node features,
edge features, and local link structure
– Propagating computations: “traversing” the graph
 Generic recipe:
– Represent graphs as adjacency lists
– Perform local computations in mapper
– Pass along partial results via outlinks, keyed by destination node
– Perform aggregation in reducer on inlinks to a node
– Iterate until convergence: controlled by external “driver”
– Don’t forget to pass the graph structure between iterations
CS 4407
Gregory M. Provan
http://famousphil.com/blog/2011/06/a-hadoop-mapreduce-solution-to-dijkstra%E2%80%99s-algorithm/
public class Dijkstra extends Configured implements Tool {

public static String OUT = "outfile";
public static String IN = "inputlarger”;
public static class TheMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
Text word = new Text();
String line = value.toString();//looks like 1 0 2:3:
String[] sp = line.split(" ");//splits on space
int distanceadd = Integer.parseInt(sp[1]) + 1;
String[] PointsTo = sp[2].split(":");
for(int i=0; i<PointsTo.length; i++){
word.set("VALUE "+distanceadd);//tells me to look at distance value
context.write(new LongWritable(Integer.parseInt(PointsTo[i])), word);
word.clear(); }
//pass in current node's distance (if it is the lowest distance)
word.set("VALUE "+sp[1]);
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
word.clear();
word.set("NODES "+sp[2]);//tells me to append on the final tally
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
word.clear();
CS 4407
} University College Cork,
} Gregory M. Provan
public static class TheReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String nodes = "UNMODED";
Text word = new Text();
int lowest = 10009;//start at infinity
for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first
as a key
String[] sp = val.toString().split(" ");//splits on space
//look at first value
if(sp[0].equalsIgnoreCase("NODES")){
nodes = null;
nodes = sp[1];
}else if(sp[0].equalsIgnoreCase("VALUE")){
int distance = Integer.parseInt(sp[1]);
lowest = Math.min(distance, lowest);
}
}
word.set(lowest+" "+nodes);
context.write(key, word);
word.clear();
}
CS 4407
} University College Cork,
Gregory M. Provan
public int run(String[] args) throws Exception {
//http://code.google.com/p/joycrawler/source/browse/NetflixChallenge/src/org/niubility/learning/knn/KNNDriver.java?r=2
getConf().set("mapred.textoutputformat.separator", " ");//make the key -> value space separated (for iterations)
…..
while(isdone == false){
Job job = new Job(getConf());
job.setJarByClass(Dijkstra.class);
job.setJobName("Dijkstra");
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TheMapper.class);
job.setReducerClass(TheReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(infile));

FileOutputFormat.setOutputPath(job, new Path(outputfile));
success = job.waitForCompletion(true);
//remove the input file
//http://eclipse.sys-con.com/node/1287801/mobile
if(infile != IN){
String indir = infile.replace("part-r-00000", "");
Path ddir = new Path(indir);
FileSystem dfs = FileSystem.get(getConf());
dfs.delete(ddir, true);
} CS 4407
Gregory M. Provan

L11 MapReduce Dijkstra BFS

Uploaded by

Copyright:

Available Formats

L11 MapReduce Dijkstra BFS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L11 MapReduce Dijkstra BFS

Uploaded by

Copyright:

Available Formats

Map-Reduce Applications:

Counting, Graph Shortest Paths

– Partition data input many CPUs

 Examine a few algorithms executed in parallel Data

map (in_key, in_value) ->

reduce (out_key, intermediate_value list) ->

map map map map

combine combine combine combine

partition partition partition partition

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

the fox ate

What’s the impact of combiners?

A: (B, 10), (D, 5)  

<C, <inf, <(E, 4)>>>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>  

<E, <inf, <(A, 7), (C, 6)>>>

<C, inf> <C, inf> <C, inf>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

<E, inf> <E, inf>

<E, <inf, <(A, 7), (C, 6)>>>

<C, <inf, <(E, 4)>>>

<E, <inf, <(A, 7), (C, 6)>>>

<C, <inf, <(E, 4)>>>

<E, <inf, <(A, 7), (C, 6)>>>

… the rest omitted … 5 7

public class Dijkstra extends Configured implements Tool {

FileInputFormat.addInputPath(job, new Path(infile));

You might also like