Map Reduced B Seminar
Map Reduced B Seminar
Map Reduced B Seminar
MapReduce provides
Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Monitoring & status updates
Map/Reduce ala Google
map(key, val) is run on each item in set
emits intermediate key / val pairs
map(key=url, val=contents):
For each word w in contents, emit (w, 1)
reduce(key=word, values=uniq_counts):
Sum all 1s in values list
Emit result (word, sum)
map(key=url, val=contents):
Count, For each word w in contents, emit (w, 1)
Illustrated reduce(key=word, values=uniq_counts):
Sum all 1s in values list
Emit result (word, sum)
see 1 bob 1
see bob throw
bob 1 run 1
see spot run
run 1 see 2
see 1 spot 1
spot 1 throw 1
throw 1
Grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):
If contents matches regexp, emit (line, 1)
reduce(key=line, values=uniq_counts):
Dont do anything; just emit line
Model is Widely Applicable
MapReduce Programs In Google Source Tree
Example uses:
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
statistical machine
document clustering machine learning
translation
... ... ...
Implementation Overview
Typical cluster:
Effect
Thousands of machines read input at local disk speed
Without this, rack switches limit read rate
Performance
Tests run on cluster of 1800 machines:
4 GB of memory
Dual-processor 2 GHz Xeons with Hyperthreading
Dual 160 GB IDE disks
Gigabit Ethernet per machine
Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR_GrepScan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
Fun to use:
focus on problem,
let library deal w/ messy details