Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

L11 MapReduce Dijkstra BFS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Map-Reduce Applications:

Counting, Graph Shortest Paths

Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States. See
http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Overview
 MapReduce Introduction
 Simple counting, averaging
 Graph problems and representations
 Parallel breadth-first search

CS 4407
University College Cork,
Gregory M. Provan
MapReduce: Parallel Programming Framework
 Scaling algorithms by parallel computation
– Needed for “big data”
 MapReduce
– Google framework Jobs

– Partition data input many CPUs

 Examine a few algorithms executed in parallel Data


– Word counting Centre
– Dijkstra’s algorithm
– PageRank

CS 4407
University College Cork,
Gregory M. Provan
MapReduce Basics
 Partition data
 Two phases
– MAP: extract values
– REDUCE: combine values

Map f f f f f

Reduce g g g g g

CS 4407
University College Cork,
Gregory M. Provan
Map
 Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
 map() produces one or more intermediate values along with
an output key from the input.

CS 4407
University College Cork,
Gregory M. Provan
Map

map (in_key, in_value) ->


(out_key, intermediate_value) list

CS 4407
University College Cork,
Gregory M. Provan
Reduce
 After the map phase is over, all the intermediate values for a
given output key are combined together into a list
 reduce() combines those intermediate values into one or
more final values for that same output key
 (in practice, usually only one final value per key)

CS 4407
University College Cork,
Gregory M. Provan
Reduce

reduce (out_key, intermediate_value list) ->


out_value list

initial

returned

CS 4407
University College Cork,
Gregory M. Provan
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

combine combine combine combine

a 1 b 2 c 9 a 5 c 2 b 7 c 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys


a 1 5 b 2 7 c 2 9 8

reduce reduce reduce

r1 s1 CSr24407
s2 r3 s3
University College Cork,
Gregory M. Provan
MapReduce: Overview
 Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
– All values with the same key are reduced together
 Optionally, also:
partition (k’, number of partitions) → partition for k’
– Often a simple hash of the key, e.g., hash(k’) mod n
– Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
– Mini-reducers that run in memory after the map phase
– Used as an optimization to reduce network traffic
 The execution framework handles everything else…

CS 4407
University College Cork,
Gregory M. Provan
“Everything Else”
 The execution framework handles everything else…
– Scheduling: assigns workers to map and reduce tasks
– “Data distribution”: moves processes to data
– Synchronization: gathers, sorts, and shuffles intermediate
data
– Errors and faults: detects worker failures and restarts
 Limited control over data and execution flow
– All algorithms must expressed in m, r, c, p
 You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a particular mapper is processing
– Which intermediate key a particular reducer is processing

CS 4407
University College Cork,
Gregory M. Provan
Word Count Example
Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick quick, 1 adjective, article brown, 2
brown fox Map how, 1
Reduce now, 1
quick, 1
the, 1 the, 3
the, 1

the fox ate


the mouse Map
fox, 1

how, 1
now, 1 ate, 1 ate, 1
brown, 1 mouse, 1 cow, 1
fox, 1 Reduce mouse, 1
how now fox, 2
brown cow Map cow, 1 noun, verb

CS 4407
University College Cork,
Gregory M. Provan
Word Count: Baseline

What’s the impact of combiners?


CS 4407
University College Cork,
Gregory M. Provan
Word Count: Version 1

CS 4407
University College Cork,
Gregory M. Provan
Word Count: Version 2

CS 4407
University College Cork,
Gregory M. Provan
Design Pattern for Local Aggregation
 “In-mapper combining”
– Fold the functionality of the combiner into the mapper by
preserving state across multiple map calls
 Advantages
– Speed
– Faster than actual combiners
 Disadvantages
– Explicit memory management required
– Potential for order-dependent bugs

CS 4407
University College Cork,
Gregory M. Provan
Combiner Design
 Combiners and reducers share same method
signature
– Sometimes, reducers can serve as combiners
– Often, not…
 Remember: combiner are optional optimizations
– Should not affect algorithm correctness
– May be run 0, 1, or multiple times
 Example: find average of all integers associated
with the same key

CS 4407
University College Cork,
Gregory M. Provan
Computing the Mean: Version 1

CS 4407
University College Cork,
Gregory M. Provan
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 First, a refresher: Dijkstra’s Algorithm

CS 4407
University College Cork,
Gregory M. Provan
Pseudocode for Dijkstra
Initialize the cost of each vertex to 
cost[s] = 0;
heap.insert(s);
While (! heap.empty())
n = heap.deleteMin()
For (each vertex a which is adjacent to n along edge e)
if (cost[n] + edge_cost[e] < cost[a]) then
cost [a] = cost[n] + edge_cost[e]
previous_on_path_to[a] = n;
if (a is in the heap) then heap.decreaseKey(a)
else heap.insert(a)

CS 4407
University College Cork,
Gregory M. Provan
Dijkstra’s Algorithm Example

1
 

10

0 2 3 9 4 6

5 7

 
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

1
10 

10

0 2 3 9 4 6

5 7

5 
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

1
8 14

10

0 2 3 9 4 6

5 7

5 7
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

1
8 13

10

0 2 3 9 4 6

5 7

5 7
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

1
1
8 9

10

0 2 3 9 4 6

5 7

5 7
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Dijkstra’s Algorithm Example

1
8 9

10

0 2 3 9 4 6

5 7

5 7
2

CS 4407
University College Cork,
Example from CLR Gregory M. Provan
Single Source Shortest Path
 Problem: find shortest path from a source node to one or
more target nodes
– Shortest might also mean lowest weight or cost
 Single processor machine: Dijkstra’s Algorithm
 MapReduce: parallel Breadth-First Search (BFS)

CS 4407
University College Cork,
Gregory M. Provan
Finding the Shortest Path
 Consider simple case of equal edge weights
 Solution to the problem can be defined inductively
 Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– DISTANCETO(s) = 0
– For all nodes p reachable from s,
DISTANCETO(p) = 1
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m  M)
d1 m1

d2
s … n
m2

… d3
m3
CS 4407
University College Cork,
Gregory M. Provan
Visualizing Parallel BFS

n7
n0 n1

n3 n2
n6

n5
n4
n8

n9

CS 4407
University College Cork,
Gregory M. Provan
From Intuition to Algorithm
 Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d = 
 Mapper:
– m  adjacency list: emit (m, d + 1)
 Sort/Shuffle
– Groups distances by reachable nodes
 Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path

CS 4407
University College Cork,
Gregory M. Provan
Multiple Iterations Needed
 Each MapReduce iteration advances the “known frontier” by
one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
 Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well

CS 4407
University College Cork,
Gregory M. Provan
BFS Pseudo-Code

CS 4407
University College Cork,
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Adjacency matrix B C
A B C D E 1
 
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5
 Adjacency List 7

A: (B, 10), (D, 5)  


B: (C, 1), (D, 2) 2
C: (E, 4) D E
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6)

CS 4407
University College Cork, 33
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
A
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>

5 7
 Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>>  
<B, <inf, <(C, 1), (D, 2)>>> 2
<C, inf> <D, inf>
<E, inf> <C, <inf, <(E, 4)>>> D E
<B, inf> <C, inf> <E, inf> <D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <E, <inf, <(A, 7), (C, 6)>>>
Flushed to local disk!!
CS 4407
University College Cork, 34
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
 
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf> 9
0 2 3 4 6

<C, <inf, <(E, 4)>>>


<C, inf> <C, inf> <C, inf> 5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>  


2
<D, 5> <D, inf>
D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, inf>
CS 4407
University College Cork, 35
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
 1 
<A, inf>

10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf>


5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>


 
<D, 5> <D, inf> 2
D E
<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>

CS 4407
University College Cork, 36
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration 1
10 
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6

<E, <inf, <(A, 7), (C, 6)>>>


 Map output: <dest node ID, dist>
5 7

<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> Flushed to local disk!!
<E, <inf, <(A, 7), (C, 6)>>>
CS 4407
University College Cork, 37
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>


5 7
<C, 11> <C, 14> <C, inf>

5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, 7>
CS 4407
University College Cork, 38
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 
<A, inf>
10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>


5 7
<C, 11> <C, 14> <C, inf>

5 
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>


<E, inf> <E, 7>
CS 4407
University College Cork, 39
Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
 Reduce output: <node ID, <dist, adj list>> B C
= Map input for next iteration 1
8 11
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>> 10
A
<C, <11, <(E, 4)>>>
0 2 3 9 4 6
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
5 7

… the rest omitted … 5 7


2
D E

CS 4407
University College Cork, 40
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (equal edge
weight case)?
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
 Now answer the question...
– Six degrees of separation?
 Practicalities of implementation in MapReduce

CS 4407
University College Cork,
Gregory M. Provan
Comparison to Dijkstra
 Dijkstra’s algorithm is more efficient
– At any step it only pursues edges from the minimum-cost path inside
the frontier
 MapReduce explores all paths in parallel
– Lots of “waste”
– Useful work is only done at the “frontier”
 Why can’t we do better using MapReduce?

CS 4407
University College Cork,
Gregory M. Provan
Weighted Edges
 Now add positive weights to the edges
– Why can’t edge weights be negative?
 Simple change: adjacency list now includes a weight w for
each edge
– In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m

CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)
– Graph diameter D
 Convince yourself: when a node is first “discovered”, we’ve
found the shortest path

CS 4407
University College Cork,
Gregory M. Provan
Additional Complexities

1
search frontier 1
n6 n7 1
n8
r 10
1 n9
n5
n1
s 1 1
q
p n4
1
n2 1
n3

CS 4407
University College Cork,
Gregory M. Provan
Stopping Criterion
 How many iterations are needed in parallel BFS (positive
edge weight case)?
 Practicalities of implementation in MapReduce

CS 4407
University College Cork,
Gregory M. Provan
Graphs and MapReduce
 Graph algorithms typically involve:
– Performing computations at each node: based on node features,
edge features, and local link structure
– Propagating computations: “traversing” the graph
 Generic recipe:
– Represent graphs as adjacency lists
– Perform local computations in mapper
– Pass along partial results via outlinks, keyed by destination node
– Perform aggregation in reducer on inlinks to a node
– Iterate until convergence: controlled by external “driver”
– Don’t forget to pass the graph structure between iterations

CS 4407
University College Cork,
Gregory M. Provan
http://famousphil.com/blog/2011/06/a-hadoop-mapreduce-solution-to-dijkstra%E2%80%99s-algorithm/

public class Dijkstra extends Configured implements Tool {


public static String OUT = "outfile";
public static String IN = "inputlarger”;
public static class TheMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
Text word = new Text();
String line = value.toString();//looks like 1 0 2:3:
String[] sp = line.split(" ");//splits on space
int distanceadd = Integer.parseInt(sp[1]) + 1;
String[] PointsTo = sp[2].split(":");
for(int i=0; i<PointsTo.length; i++){
word.set("VALUE "+distanceadd);//tells me to look at distance value
context.write(new LongWritable(Integer.parseInt(PointsTo[i])), word);
word.clear(); }
//pass in current node's distance (if it is the lowest distance)
word.set("VALUE "+sp[1]);
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
word.clear();
word.set("NODES "+sp[2]);//tells me to append on the final tally
context.write( new LongWritable( Integer.parseInt( sp[0] ) ), word );
word.clear();
CS 4407
} University College Cork,
} Gregory M. Provan
public static class TheReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String nodes = "UNMODED";
Text word = new Text();
int lowest = 10009;//start at infinity

for (Text val : values) {//looks like NODES/VALUES 1 0 2:3:, we need to use the first
as a key
String[] sp = val.toString().split(" ");//splits on space
//look at first value
if(sp[0].equalsIgnoreCase("NODES")){
nodes = null;
nodes = sp[1];
}else if(sp[0].equalsIgnoreCase("VALUE")){
int distance = Integer.parseInt(sp[1]);
lowest = Math.min(distance, lowest);
}
}
word.set(lowest+" "+nodes);
context.write(key, word);
word.clear();
}
CS 4407
} University College Cork,
Gregory M. Provan
public int run(String[] args) throws Exception {
//http://code.google.com/p/joycrawler/source/browse/NetflixChallenge/src/org/niubility/learning/knn/KNNDriver.java?r=2
getConf().set("mapred.textoutputformat.separator", " ");//make the key -> value space separated (for iterations)
…..
while(isdone == false){
Job job = new Job(getConf());
job.setJarByClass(Dijkstra.class);
job.setJobName("Dijkstra");
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TheMapper.class);
job.setReducerClass(TheReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(infile));


FileOutputFormat.setOutputPath(job, new Path(outputfile));
success = job.waitForCompletion(true);
//remove the input file
//http://eclipse.sys-con.com/node/1287801/mobile
if(infile != IN){
String indir = infile.replace("part-r-00000", "");
Path ddir = new Path(indir);
FileSystem dfs = FileSystem.get(getConf());
dfs.delete(ddir, true);
} CS 4407
University College Cork,
Gregory M. Provan

You might also like