Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
MapReduce
1
Submitted By
GAURAV BISWAS
Contents
Traditional Big Data Processing Approach
MapReduce
Word count Problem
Reduce Operation
Data Flow
Scope of Map Reduce
Summary
i
2
Traditional Big Data Processing Approach
3
Map Reduce is a programming framework that
allows us to perform distributed and parallel
processing on large data sets in a distributed
environment.
MapReduce
4
MapReduce
5
Word Counter Problem
6
Reduce
Reduce
Reduce
Reduce Operation
MAP: Input data  <key, value> pair
REDUCE: <key, value> pair  <result>
Data
Collection: split1 Split the data to
Supply multiple
processors
Data
Collection: split 2
Data
Collection: split n Map
Map
……
Map
7
…
A Map Reduce job is a unit of work that the
client wants to be performed
It consists of input data, the map reduce
program and the configuration information
The tasks are scheduled using YARN which
run on nodes in the clusters
If a task fails, it will be automatically
reschedule and run on different node
Data flow
8
Contd...
A good split size is the size of an HDFS block i.e. 128 MB by default
If the number of splits are more then the overhead of managing the
splits and the map task creation begins to dominate the total job
execution time
9
Scope of MapReduce
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
10
Summary
We introduced MapReduce programming model for
processing large scale data
We discussed the supporting Hadoop Distributed
File System
The concepts were illustrated using a simple
example
We reviewed some important parts of the source
code for the example.
Relationship to Cloud Computing
11
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html
12
Thank you

More Related Content

Map reduce in BIG DATA

  • 2. Contents Traditional Big Data Processing Approach MapReduce Word count Problem Reduce Operation Data Flow Scope of Map Reduce Summary i 2
  • 3. Traditional Big Data Processing Approach 3
  • 4. Map Reduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment. MapReduce 4
  • 7. Reduce Reduce Reduce Reduce Operation MAP: Input data  <key, value> pair REDUCE: <key, value> pair  <result> Data Collection: split1 Split the data to Supply multiple processors Data Collection: split 2 Data Collection: split n Map Map …… Map 7 …
  • 8. A Map Reduce job is a unit of work that the client wants to be performed It consists of input data, the map reduce program and the configuration information The tasks are scheduled using YARN which run on nodes in the clusters If a task fails, it will be automatically reschedule and run on different node Data flow 8
  • 9. Contd... A good split size is the size of an HDFS block i.e. 128 MB by default If the number of splits are more then the overhead of managing the splits and the map task creation begins to dominate the total job execution time 9
  • 10. Scope of MapReduce Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large 10
  • 11. Summary We introduced MapReduce programming model for processing large scale data We discussed the supporting Hadoop Distributed File System The concepts were illustrated using a simple example We reviewed some important parts of the source code for the example. Relationship to Cloud Computing 11
  • 12. References 1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tu torial.html 2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. 3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic 4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html 12