MapReduce is a programming framework that allows for distributed and parallel processing of large datasets. It consists of a map step that processes key-value pairs in parallel, and a reduce step that aggregates the outputs of the map step. As an example, a word counting problem is presented where words are counted by mapping each word to a key-value pair of the word and 1, and then reducing by summing the counts of each unique word. MapReduce jobs are executed on a cluster in a reliable way using YARN to schedule tasks across nodes, restarting failed tasks when needed.
4. Map Reduce is a programming framework that
allows us to perform distributed and parallel
processing on large data sets in a distributed
environment.
MapReduce
4
7. Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data <key, value> pair
REDUCE: <key, value> pair <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
……
Map
7
…
8. A Map Reduce job is a unit of work that the
client wants to be performed
It consists of input data, the map reduce
program and the configuration information
The tasks are scheduled using YARN which
run on nodes in the clusters
If a task fails, it will be automatically
reschedule and run on different node
Data flow
8
9. Contd...
A good split size is the size of an HDFS block i.e. 128 MB by default
If the number of splits are more then the overhead of managing the
splits and the map task creation begins to dominate the total job
execution time
9
10. Scope of MapReduce
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
10
11. Summary
We introduced MapReduce programming model for
processing large scale data
We discussed the supporting Hadoop Distributed
File System
The concepts were illustrated using a simple
example
We reviewed some important parts of the source
code for the example.
Relationship to Cloud Computing
11
12. References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html
12