0% found this document useful (0 votes)

6 views

Unit 2 Topic 4 Map Reduce

Uploaded by

teotia.harsh22

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Unit 2 Topic 4 Map Reduce

Uploaded by

teotia.harsh22

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Map Reduce framework and basics

Dr. Anil Kumar Dubey

Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic of MapReduce
Is a processing technique and a program model for
distributed computing based on java.

Algorithm contains two important tasks

 Map
 Reduce

Reduce task is always performed after the map job.

Conti…
◦ Map takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples
(key/value pairs).
◦ Reduce task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples.
Example: A Word Count
Let us have a text file called example.txt whose contents
are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the

sample.txt using MapReduce.

So,we will be finding unique words and the number of

occurrences of those unique words.
Conti…
Conti…
First,divide the input into three splits as shown in the
figure. This will distribute the work among all the map
nodes.

Then, tokenize the words in each of the mappers and

give a hardcoded value (1) to each of the tokens or
words.

The rationale behind giving a hardcoded value equal to

1 is that every word, in itself, will occur once.
Conti…
Now, a list of key-value pair will be created where the key
is nothing but the individual words and value is one. So,
for the first line (Dear Bear River) 3 key-value pairs:
◦ Dear, 1
◦ Bear, 1
◦ River, 1
The mapping process remains the same on all the nodes.

After the mapper phase, a partition process takes place

where sorting and shuffling happen so that all the tuples
with the same key are sent to the corresponding reducer.
Conti…
After sorting & shuffling phase, each reducer will have a
unique key and a list of values corresponding to that
very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present
in that list of values. As shown in the figure, reducer gets
a list of values which is [1,1] for the key Bear. Then, it
counts the number of ones in the very list and gives the
final output as — Bear, 2.
Finally, all the output key/value pairs are then collected
and written in the output file.
Assignment 1
Apply word count using map reduce
The Department of Computer Science and Engineering at ABES
Engineering College Ghaziabad was established in the year 2000.
Benefits of MapReduce
Fault-tolerance
 During the middle of a map-reduce job, if a machine carrying a few
data blocks fails architecture handles the failure.
 It considers replicated copies of the blocks in alternate machines for
further processing.

Resilience
 Each node periodically updates its status to the master node.
 If a slave node doesn’t send its notification, the master node reassigns
the currently running task of that slave node to other available nodes
in the cluster.
Conti…
Quick
 Data processing is quick as MapReduce uses HDFS as the storage system.
 MapReduce takes minutes to process terabytes of unstructured large
volumes of data.

Parallel Processing
 In MapReduce, dividing the job among multiple nodes and each node
works with a part of the job simultaneously.
 MapReduce is based on Divide and Conquer paradigm which helps us to
process the data using different machines.
 As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a tremendous
amount
Conti…
Conti…
Availability
 Multiple replicas of the same data are sent to numerous nodes in the
network.
 Thus, in case of any failure, other copies are readily available for
processing without any loss.

Scalability
 Hadoop is a highly scalable platform.
 Traditional RDBMS systems are not scalable according to the increase
in data volume.
 MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are

then input to the reduce tasks.

Typically both the input and the output of the job are
stored in a file-system.
Conti…
Conti…
How Map Reduce works
MapReduce can perform distributed and parallel
computations using large datasets across a large number
of nodes.

A MapReduce job usually splits the input datasets and

then process each of them independently by the Map
tasks in a completely parallel manner.

The output is then sorted and input to reduce tasks.

Conti…
Both job input and output are stored in file systems.

Tasks are scheduled and monitored by the framework.

Map Reduce architecture contains two core components

as Daemon services responsible for running mapper and
reducer tasks, monitoring, and re-executing the tasks on
failure. In Hadoop 2 onwards Resource Manager and
Node Manager are the daemon services.
Conti…
When the job client submits a MapReduce job, these
daemons come into action. They are also responsible for
parallel processing and fault-tolerance features of
MapReduce jobs.

In Hadoop 2 onwards resource management and job

scheduling or monitoring functionalities are segregated
by YARN (Yet Another Resource Negotiator) as different
daemons.
Conti…
Compared to Hadoop 1 with Job Tracker and Task
Tracker, Hadoop 2 contains a global Resource Manager
(RM) and Application Masters (AM) for each
application.

Job Client submits the job to the Resource Manager.

YARN Resource Manager’s scheduler is responsible for

the coordination of resource allocation of the cluster
among the running applications.
Conti…
YARN Node Manager runs on each node and does node-
level resource management, coordinating with the
Resource manager. It launches and monitors the compute
containers on the machine on the cluster.

ApplicationMaster helps the resources from Resource

Manager and use Node Manager to run and coordinate
MapReduce tasks.

HDFS is usually used to share the job files between other

entities.
Conti…
Phases of the MapReduce model
MapReduce model has three major and one optional
phase
 Mapper
 Shuffle and Sort
 Reducer
 Combiner
Conti…
Mapper
It is the first phase of MapReduce programming and
contains the coding logic of the mapper function.
The conditional logic is applied to the ‘n’ number of data
blocks spread across various data nodes.
Mapper function accepts key-value pairs as input as (k,
v), where the key represents the offset address of each
record and the value represents the entire record content.
The output of the Mapper phase will also be in the key-
value format as (k’, v’).
Conti…
Shuffle and Sort
The output of various mappers (k’, v’), then goes into
Shuffle and Sort phase.
All the duplicate values are removed, and different values
are grouped together based on similar keys.
The output of the Shuffle and Sort phase will be key-
value pairs again as key and array of values (k, v[]).
Conti…
Reducer
The output of the Shuffle and Sort phase (k, v[]) will be
the input of the Reducer phase.
In this phase reducer function’s logic is executed and all
the values are aggregated against their corresponding
keys.
Reducer consolidates outputs of various mappers and
computes the final job output.
The final output is then written into a single file in an
output directory of HDFS.
Conti…
Combiner
 It is an optional phase in the MapReduce model.
 The combiner phase is used to optimize the performance of
MapReduce jobs.
 In this phase, various outputs of the mappers are locally
reduced at the node level.
 For example, if different mapper outputs (k, v) coming from a
single node contains duplicates, then they get combined i.e.
locally reduced as a single (k, v[]) output.
 This phase makes the Shuffle and Sort phase work even quicker
thereby enabling additional performance in MapReduce jobs.
Conti…
Class Discussion Query
Welcome to Hadoop
Class Hadoop is good
Hadoop is bad

Mango Banana Orange Apple Mango Orange Grapes Pineapple

Pomegranate Papaya Apple Orange Cherry Mango Papaya
Conti…
Conti…
Example
Let us assume we have employee data in four different
files − A, B, C, and D. Let us also assume there are
duplicate employee records in all four files because of
importing the employee data from all database tables
repeatedly. See the following illustration.
Conti…
The Map phase processes each input file and provides
the employee data in key-value pairs (<k, v> : <emp
name, salary>). See the following illustration.
Conti…
The combiner phase (searching technique) will accept
the input from the Map phase as a key-value pair with
employee name and salary. Using searching technique,
the combiner will check all the employee salary to find
the highest salaried employee in each file. See the
following snippet.
Conti…
 expected result
Conti…
Reducer phase − Form each file, you will find the
highest salaried employee. To avoid redundancy, check
all the <k, v> pairs and eliminate duplicate entries, if
any. The same algorithm is used in between the four <k,
v> pairs, which are coming from four input files. The
final output should be as follows −
Example
Considerfollowing MovieLens dataset, find out how
many movies did each user rate using MapReduce.
USER_ID MOVIE_ID RATING
196 242 3
186 302 3
196 377 1
244 51 2
166 346 1
186 474 4
186 265 2
Conti…
Step 1: First we have to map the values , it is happen in 1st
phase of Map Reduce model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ;
186:274 ; 186:265

Step 2: After Mapping we have to shuffle and sort the values.

166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51

Step 3: After completion of step1 and step2 we have to

reduce each key’s values.
Conti…
Conti… (CODE)
from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreak(MRJob):
def steps(self):
return [
MRstep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
# MAPPER CODE

def mapper_get_ratings(self, _, line):

(User_id, Movie_id, Rating, Timestamp) = line.split('/t')
yield rating,
# REDUCER CODE

def reducer_count_ratings(self, key, values):

yield key, sum(values)
Example 3
Write pseudo code for map reduce
Reference
https://www.youtube.com/watch?v=EmHc9hV5Xi8
THANK
YOU

Easytrieve Manual
No ratings yet
Easytrieve Manual
310 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Rohit
No ratings yet
Rohit
14 pages
Data Science
No ratings yet
Data Science
7 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
6.UNIT 3 BDA
No ratings yet
6.UNIT 3 BDA
18 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Unit 3
No ratings yet
Unit 3
13 pages
3-bda-unit-3-notes
No ratings yet
3-bda-unit-3-notes
12 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
10 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Big Data Management
No ratings yet
Big Data Management
5 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Unit 3 MapReduce Part 1
No ratings yet
Unit 3 MapReduce Part 1
12 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Mapreduce
No ratings yet
Mapreduce
5 pages
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
No ratings yet
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
5 pages
MapReduce
No ratings yet
MapReduce
14 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Unit 3
No ratings yet
Unit 3
10 pages
2 Bda Chapter2 Answer
No ratings yet
2 Bda Chapter2 Answer
9 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Max Marks:50: Computer Science
No ratings yet
Max Marks:50: Computer Science
8 pages
Nif Device For Sigtran Protocol
No ratings yet
Nif Device For Sigtran Protocol
4 pages
ASR9000 Upgrade Procedure 431 PDF
No ratings yet
ASR9000 Upgrade Procedure 431 PDF
12 pages
C Programming Lesson 6
No ratings yet
C Programming Lesson 6
54 pages
SQL Slides
No ratings yet
SQL Slides
65 pages
128Mb DDR Sdram: HY5DU28422ET HY5DU28822ET HY5DU281622ET
No ratings yet
128Mb DDR Sdram: HY5DU28422ET HY5DU28822ET HY5DU281622ET
35 pages
Ready Reckoner For Microsoft Excel 2003
No ratings yet
Ready Reckoner For Microsoft Excel 2003
8 pages
Assignment SQL (DBMS)
No ratings yet
Assignment SQL (DBMS)
5 pages
Logical Link Control and Adaptation Protocol
No ratings yet
Logical Link Control and Adaptation Protocol
7 pages
Asa Remote Access VPN Technologies: SSLVPN Webvpn Ipsecvpn: Security Consulting Se Ccie, Cissp
No ratings yet
Asa Remote Access VPN Technologies: SSLVPN Webvpn Ipsecvpn: Security Consulting Se Ccie, Cissp
43 pages
Effective Web Searching
No ratings yet
Effective Web Searching
13 pages
9.1.1.7 Lab - Encrypting and Decrypting Data Using A Hacker Tool
No ratings yet
9.1.1.7 Lab - Encrypting and Decrypting Data Using A Hacker Tool
6 pages
Exam Invigilation Management System
100% (1)
Exam Invigilation Management System
6 pages
Synopsis VPN
100% (2)
Synopsis VPN
4 pages
Msi MS-7592 Ver. 7.1
No ratings yet
Msi MS-7592 Ver. 7.1
33 pages
NetBackup81 Upgrade Guide
No ratings yet
NetBackup81 Upgrade Guide
109 pages
Ebudget For NGAS
No ratings yet
Ebudget For NGAS
64 pages
ComProg1 Lesson 2 - Introduction To Programming
No ratings yet
ComProg1 Lesson 2 - Introduction To Programming
2 pages
QA
No ratings yet
QA
170 pages
The Architecture of Computer Hardware and Systems Software
No ratings yet
The Architecture of Computer Hardware and Systems Software
20 pages
Salesforc E Rest Api Integration: Dhananjay Aher
No ratings yet
Salesforc E Rest Api Integration: Dhananjay Aher
12 pages
Data Analysis Using Python Lab
No ratings yet
Data Analysis Using Python Lab
22 pages
Oracle Quality Setup
100% (6)
Oracle Quality Setup
10 pages
Introduction To Matlab: Matlab GUI, Variables, Printing, Scripts and Functions
No ratings yet
Introduction To Matlab: Matlab GUI, Variables, Printing, Scripts and Functions
20 pages
3006 NP02
No ratings yet
3006 NP02
2 pages
CRS Document
No ratings yet
CRS Document
63 pages
MIC 22415 Model Answers
100% (1)
MIC 22415 Model Answers
147 pages
Comparison of 8-QAM, 16-QAM, 32-QAM, 64-QAM 128-QAM, 256-QAM, Etc
No ratings yet
Comparison of 8-QAM, 16-QAM, 32-QAM, 64-QAM 128-QAM, 256-QAM, Etc
8 pages
Enterprise PLI Presentation
No ratings yet
Enterprise PLI Presentation
30 pages