Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

L11 MapReduce Dijkstra BFS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Map-Reduce Applications:

Counting, Graph Shortest Paths

Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States. See
http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
 MapReduce Introduction
 Simple counting, averaging
 Graph problems and representations
 Parallel breadth-first search

CS 4407
University College Cork,
Gregory M. Provan
MapReduce: Parallel Programming Framework
 Scaling algorithms by parallel computation
– Needed for “big data”
 MapReduce
– Google framework Jobs

– Partition data input many CPUs

 Examine a few algorithms executed in parallel Data

– Word counting Centre
– Dijkstra’s algorithm
– PageRank

CS 4407
University College Cork,
Gregory M. Provan
MapReduce Basics
 Partition data
 Two phases
– MAP: extract values
– REDUCE: combine values

Map f f f f f

Reduce g g g g g

CS 4407
University College Cork,
Gregory M. Provan
 Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
 map() produces one or more intermediate values along with
an output key from the input.

CS 4407
University College Cork,
Gregory M. Provan

map (in_key, in_value) ->

(out_key, intermediate_value) list

CS 4407
University College Cork,
Gregory M. Provan
 After the map phase is over, all the intermediate values for a
given output key are combined together into a list
 reduce() combines those intermediate values into one or
more final values for that same output key
 (in practice, usually only one final value per key)

CS 4407
University College Cork,
Gregory M. Provan

reduce (out_key, intermediate_value list) ->

out_value list



CS 4407
University College Cork,
Gregory M. Provan
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 9 8

reduce reduce reduce

r1 s1 CSr24407
s2 r3 s3
University College Cork,
Gregory M. Provan
MapReduce: Overview
 Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are reduced together
 Optionally, also:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
 The execution framework handles everything else…

CS 4407
University College Cork,
Gregory M. Provan
“Everything Else”
 The execution framework handles everything else…
– Scheduling: assigns workers to map and reduce tasks
– “Data distribution”: moves processes to data
– Synchronization: gathers, sorts, and shuffles intermediate
– Errors and faults: detects worker failures and restarts
 Limited control over data and execution flow
– All algorithms must expressed in m, r, c, p
 You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a particular mapper is processing
– Which intermediate key a particular reducer is processing

CS 4407
University College Cork,
Gregory M. Provan
Word Count Example
Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick quick, 1 adjective, article brown, 2
brown fox Map how, 1
Reduce now, 1
quick, 1
the, 1 the, 3
the, 1

the fox ate

the mouse Map
fox, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
fox, 1 Reduce mouse, 1
how now fox, 2
brown cow Map cow, 1 noun, verb

CS 4407
University College Cork,
Gregory M. Provan
Word Count: Baseline

What’s the impact of combiners?

CS 4407
University College Cork,
Gregory M. Provan
Word Count: Version 1

CS 4407
University College Cork,
Gregory M. Provan
Word Count: Version 2

CS 4407
University College Cork,
Gregory M. Provan
Design Pattern for Local Aggregation
 “In-mapper combining”
– Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
 Advantages
– Speed
– Faster than actual combiners
 Disadvantages
– Explicit memory management required
– Potential for order-dependent bugs

CS 4407
University College Cork,
Gregory M. Provan
Combiner Design
 Combiners and reducers share same method
– Sometimes, reducers can serve as combiners
– Often, not…
 Remember: combiner are optional optimizations
– Should not affect algorithm correctness
– May be run 0, 1, or multiple times
 Example: find average of all integers associated
with the same key

CS 4407
University College Cork,
Gregory M. Provan
Computing the Mean: Version 1

CS 4407
University College Cork,
Gregory M. Provan
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 First, a refresher: Dijkstra’s Algorithm

CS 4407
University College Cork,
Gregory M. Provan
Pseudocode for Dijkstra
Initialize the cost of each vertex to 
cost[s] = 0;
While (! heap.empty())
n = heap.deleteMin()
For (each vertex a which is adjacent to n along edge e)
if (cost[n] + edge_cost[e] < cost[a]) then
cost [a] = cost[n] + edge_cost[e]
previous_on_path_to[a] = n;
if (a is in the heap) then heap.decreaseKey(a)
else heap.insert(a)

CS 4407
University College Cork,
Gregory M. Provan
Dijkstra’s Algorithm Example

 


0 2 3 9 4 6

5 7

 

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

10 


0 2 3 9 4 6

5 7

5 

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

8 14


0 2 3 9 4 6

5 7

5 7

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

8 13


0 2 3 9 4 6

5 7

5 7

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

8 9


0 2 3 9 4 6

5 7

5 7

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

8 9


0 2 3 9 4 6

5 7

5 7

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 Single processor machine: Dijkstra’s Algorithm
 MapReduce: parallel Breadth-First Search (BFS)

CS 4407
University College Cork,
Gregory M. Provan
Finding the Shortest Path
 Consider simple case of equal edge weights
 Solution to the problem can be defined inductively
 Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– For all nodes p reachable from s,
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m  M)
d1 m1

s … n

… d3
CS 4407
University College Cork,
Gregory M. Provan
Visualizing Parallel BFS

n0 n1

n3 n2



CS 4407
University College Cork,
Gregory M. Provan
From Intuition to Algorithm
 Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d = 
 Mapper:
– m  adjacency list: emit (m, d + 1)
 Sort/Shuffle
– Groups distances by reachable nodes
 Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path

CS 4407
University College Cork,
Gregory M. Provan
Multiple Iterations Needed
 Each MapReduce iteration advances the “known frontier” by
one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
 Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well

CS 4407
University College Cork,
Gregory M. Provan
BFS Pseudo-Code

CS 4407
University College Cork,
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Adjacency matrix B C
A B C D E 1
 
A 10 5
B 1 2 10
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
 Adjacency List 7

A: (B, 10), (D, 5)  

B: (C, 1), (D, 2) 2
C: (E, 4) D E
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6)

CS 4407
University College Cork, 33
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>

5 7
 Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>>  
<B, <inf, <(C, 1), (D, 2)>>> 2
<C, inf> <D, inf>
<E, inf> <C, <inf, <(E, 4)>>> D E
<B, inf> <C, inf> <E, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <E, <inf, <(A, 7), (C, 6)>>>
Flushed to local disk!!
CS 4407
University College Cork, 34
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<A, inf>
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf> 9
0 2 3 4 6

<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf> 5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>  

<D, 5> <D, inf>

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>
CS 4407
University College Cork, 35
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
 1 
<A, inf>

<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf>

5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

 
<D, 5> <D, inf> 2
<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>

CS 4407
University College Cork, 36
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration 1
10 
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6

<E, <inf, <(A, 7), (C, 6)>>>

 Map output: <dest node ID, dist>
5 7

<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> Flushed to local disk!!
<E, <inf, <(A, 7), (C, 6)>>>
CS 4407
University College Cork, 37
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>

5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
CS 4407
University College Cork, 38
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>

5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
CS 4407
University College Cork, 39
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce output: <node ID, <dist, adj list>> B C
= Map input for next iteration 1
8 11
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>> 10
<C, <11, <(E, 4)>>>
0 2 3 9 4 6
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
5 7

… the rest omitted … 5 7


CS 4407
University College Cork, 40
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (equal edge
weight case)?
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
 Now answer the question...
– Six degrees of separation?
 Practicalities of implementation in MapReduce

CS 4407
University College Cork,
Gregory M. Provan
Comparison to Dijkstra
 Dijkstra’s algorithm is more efficient
– At any step it only pursues edges from the minimum-cost path inside
the frontier
 MapReduce explores all paths in parallel
– Lots of “waste”
– Useful work is only done at the “frontier”
 Why can’t we do better using MapReduce?

CS 4407
University College Cork,
Gregory M. Provan
Weighted Edges
 Now add positive weights to the edges
– Why can’t edge weights be negative?
 Simple change: adjacency list now includes a weight w for
each edge
– In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m

CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)
– Graph diameter D
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path

CS 4407
University College Cork,
Gregory M. Provan
Additional Complexities

search frontier 1
n6 n7 1
r 10
1 n9
s 1 1
p n4
n2 1

CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)?
 Practicalities of implementation in MapReduce

CS 4407
University College Cork,
Gregory M. Provan
Graphs and MapReduce
 Graph algorithms typically involve:
– Performing computations at each node: based on node features,
edge features, and local link structure
– Propagating computations: “traversing” the graph
 Generic recipe:
– Represent graphs as adjacency lists
– Perform local computations in mapper
– Pass along partial results via outlinks, keyed by destination node
– Perform aggregation in reducer on inlinks to a node
– Iterate until convergence: controlled by external “driver”
– Don’t forget to pass the graph structure between iterations

CS 4407
University College Cork,
Gregory M. Provan

public class Dijkstra extends Configured implements Tool {

public static String OUT = "outfile";
public static String IN = "inputlarger”;
public static class TheMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
Text word = new Text();
String line = value.toString();//looks like 1 0 2:3:
String[] sp = line.split(" ");//splits on space
int distanceadd = Integer.parseInt(sp[1]) + 1;
String[] PointsTo = sp[2].split(":");
for(int i=0; i<PointsTo.length; i++){
word.set("VALUE "+distanceadd);//tells me to look at distance value
context.write(new LongWritable(Integer.parseInt(PointsTo[i])), word);
word.clear(); }
//pass in current node's distance (if it is the lowest distance)
word.set("VALUE "+sp[1]);
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
word.set("NODES "+sp[2]);//tells me to append on the final tally
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
CS 4407
} University College Cork,
} Gregory M. Provan
public static class TheReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String nodes = "UNMODED";
Text word = new Text();
int lowest = 10009;//start at infinity

for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first
as a key
String[] sp = val.toString().split(" ");//splits on space
//look at first value
nodes = null;
nodes = sp[1];
}else if(sp[0].equalsIgnoreCase("VALUE")){
int distance = Integer.parseInt(sp[1]);
lowest = Math.min(distance, lowest);
word.set(lowest+" "+nodes);
context.write(key, word);
CS 4407
} University College Cork,
Gregory M. Provan
public int run(String[] args) throws Exception {
getConf().set("mapred.textoutputformat.separator", " ");//make the key -> value space separated (for iterations)
while(isdone == false){
Job job = new Job(getConf());

FileInputFormat.addInputPath(job, new Path(infile));

FileOutputFormat.setOutputPath(job, new Path(outputfile));
success = job.waitForCompletion(true);
//remove the input file
if(infile != IN){
String indir = infile.replace("part-r-00000", "");
Path ddir = new Path(indir);
FileSystem dfs = FileSystem.get(getConf());
dfs.delete(ddir, true);
} CS 4407
University College Cork,
Gregory M. Provan

You might also like