Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 2 Topic 4 Map Reduce

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

Map Reduce framework and basics

Dr. Anil Kumar Dubey

Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic of MapReduce
Is a processing technique and a program model for
distributed computing based on java.

Algorithm contains two important tasks

 Map
 Reduce

Reduce task is always performed after the map job.

◦ Map takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples
(key/value pairs).
◦ Reduce task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples.
Example: A Word Count
Let us have a text file called example.txt whose contents
are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the

sample.txt using MapReduce.

So,we will be finding unique words and the number of

occurrences of those unique words.
First,divide the input into three splits as shown in the
figure. This will distribute the work among all the map

Then, tokenize the words in each of the mappers and

give a hardcoded value (1) to each of the tokens or

The rationale behind giving a hardcoded value equal to

1 is that every word, in itself, will occur once.
Now, a list of key-value pair will be created where the key
is nothing but the individual words and value is one. So,
for the first line (Dear Bear River) 3 key-value pairs:
◦ Dear, 1
◦ Bear, 1
◦ River, 1
The mapping process remains the same on all the nodes.

After the mapper phase, a partition process takes place

where sorting and shuffling happen so that all the tuples
with the same key are sent to the corresponding reducer.
After sorting & shuffling phase, each reducer will have a
unique key and a list of values corresponding to that
very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present
in that list of values. As shown in the figure, reducer gets
a list of values which is [1,1] for the key Bear. Then, it
counts the number of ones in the very list and gives the
final output as — Bear, 2.
Finally, all the output key/value pairs are then collected
and written in the output file.
Assignment 1
Apply word count using map reduce
The Department of Computer Science and Engineering at ABES
Engineering College Ghaziabad was established in the year 2000.
Benefits of MapReduce
 During the middle of a map-reduce job, if a machine carrying a few
data blocks fails architecture handles the failure.
 It considers replicated copies of the blocks in alternate machines for
further processing.

 Each node periodically updates its status to the master node.
 If a slave node doesn’t send its notification, the master node reassigns
the currently running task of that slave node to other available nodes
in the cluster.
 Data processing is quick as MapReduce uses HDFS as the storage system.
 MapReduce takes minutes to process terabytes of unstructured large
volumes of data.

Parallel Processing
 In MapReduce, dividing the job among multiple nodes and each node
works with a part of the job simultaneously.
 MapReduce is based on Divide and Conquer paradigm which helps us to
process the data using different machines.
 As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a tremendous
 Multiple replicas of the same data are sent to numerous nodes in the
 Thus, in case of any failure, other copies are readily available for
processing without any loss.

 Hadoop is a highly scalable platform.
 Traditional RDBMS systems are not scalable according to the increase
in data volume.
 MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are

then input to the reduce tasks.

Typically both the input and the output of the job are
stored in a file-system.
How Map Reduce works
MapReduce can perform distributed and parallel
computations using large datasets across a large number
of nodes.

A MapReduce job usually splits the input datasets and

then process each of them independently by the Map
tasks in a completely parallel manner.

The output is then sorted and input to reduce tasks.

Both job input and output are stored in file systems.

Tasks are scheduled and monitored by the framework.

Map Reduce architecture contains two core components

as Daemon services responsible for running mapper and
reducer tasks, monitoring, and re-executing the tasks on
failure. In Hadoop 2 onwards Resource Manager and
Node Manager are the daemon services.
When the job client submits a MapReduce job, these
daemons come into action. They are also responsible for
parallel processing and fault-tolerance features of
MapReduce jobs.

In Hadoop 2 onwards resource management and job

scheduling or monitoring functionalities are segregated
by YARN (Yet Another Resource Negotiator) as different
Compared to Hadoop 1 with Job Tracker and Task
Tracker, Hadoop 2 contains a global Resource Manager
(RM) and Application Masters (AM) for each

Job Client submits the job to the Resource Manager.

YARN Resource Manager’s scheduler is responsible for

the coordination of resource allocation of the cluster
among the running applications.
YARN Node Manager runs on each node and does node-
level resource management, coordinating with the
Resource manager. It launches and monitors the compute
containers on the machine on the cluster.

ApplicationMaster helps the resources from Resource

Manager and use Node Manager to run and coordinate
MapReduce tasks.

HDFS is usually used to share the job files between other

Phases of the MapReduce model
MapReduce model has three major and one optional
 Mapper
 Shuffle and Sort
 Reducer
 Combiner
It is the first phase of MapReduce programming and
contains the coding logic of the mapper function.
The conditional logic is applied to the ‘n’ number of data
blocks spread across various data nodes.
Mapper function accepts key-value pairs as input as (k,
v), where the key represents the offset address of each
record and the value represents the entire record content.
The output of the Mapper phase will also be in the key-
value format as (k’, v’).
Shuffle and Sort
The output of various mappers (k’, v’), then goes into
Shuffle and Sort phase.
All the duplicate values are removed, and different values
are grouped together based on similar keys.
The output of the Shuffle and Sort phase will be key-
value pairs again as key and array of values (k, v[]).
The output of the Shuffle and Sort phase (k, v[]) will be
the input of the Reducer phase.
In this phase reducer function’s logic is executed and all
the values are aggregated against their corresponding
Reducer consolidates outputs of various mappers and
computes the final job output.
The final output is then written into a single file in an
output directory of HDFS.
 It is an optional phase in the MapReduce model.
 The combiner phase is used to optimize the performance of
MapReduce jobs.
 In this phase, various outputs of the mappers are locally
reduced at the node level.
 For example, if different mapper outputs (k, v) coming from a
single node contains duplicates, then they get combined i.e.
locally reduced as a single (k, v[]) output.
 This phase makes the Shuffle and Sort phase work even quicker
thereby enabling additional performance in MapReduce jobs.
Class Discussion Query
Welcome to Hadoop
Class Hadoop is good
Hadoop is bad

Mango Banana Orange Apple Mango Orange Grapes Pineapple

Pomegranate Papaya Apple Orange Cherry Mango Papaya
Let us assume we have employee data in four different
files − A, B, C, and D. Let us also assume there are
duplicate employee records in all four files because of
importing the employee data from all database tables
repeatedly. See the following illustration.
The Map phase processes each input file and provides
the employee data in key-value pairs (<k, v> : <emp
name, salary>). See the following illustration.
The combiner phase (searching technique) will accept
the input from the Map phase as a key-value pair with
employee name and salary. Using searching technique,
the combiner will check all the employee salary to find
the highest salaried employee in each file. See the
following snippet.
 expected result
Reducer phase − Form each file, you will find the
highest salaried employee. To avoid redundancy, check
all the <k, v> pairs and eliminate duplicate entries, if
any. The same algorithm is used in between the four <k,
v> pairs, which are coming from four input files. The
final output should be as follows −
Considerfollowing MovieLens dataset, find out how
many movies did each user rate using MapReduce.
196 242 3
186 302 3
196 377 1
244 51 2
166 346 1
186 474 4
186 265 2
Step 1: First we have to map the values , it is happen in 1st
phase of Map Reduce model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ;
186:274 ; 186:265

Step 2: After Mapping we have to shuffle and sort the values.

166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51

Step 3: After completion of step1 and step2 we have to

reduce each key’s values.
Conti… (CODE)
from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreak(MRJob):
def steps(self):
return [

def mapper_get_ratings(self, _, line):

(User_id, Movie_id, Rating, Timestamp) = line.split('/t')
yield rating,

def reducer_count_ratings(self, key, values):

yield key, sum(values)
Example 3
Write pseudo code for map reduce

You might also like