Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter4 - MapReduce

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Chapter 4

MapReduce
Introduction
• MapReduce is a parallel programming model for processing the huge
amount of data.
• It is a programming paradigm that runs in the background of Hadoop
to provide scalability and easy data-processing solutions.
• It provides analytical capabilities for analyzing huge volumes of
complex data.
• It is a framework used to write applications to process big amount os
data in parallel on large clusters of hardware.
Parallel Processing
• Traditional Enterprise Systems normally have a centralized server to
store and process data. 
• It is certainly not suitable to process huge volumes of scalable data
and cannot be accommodated by standard database servers.
• It will split the data into smaller parts or blocks and store them in
different machines.
Challenges in traditional way
• Critical path problem
• Reliability problem
• Equal split issue
• Single split may fail
• Aggregation of the result
• The centralized system creates too much of a bottleneck while processing
multiple files simultaneously.
• To solve this issues, the MapReduce framework which allows us to perform
such parallel computations without bothering about the issues like
reliability, fault tolerance etc. 
MapReduce
• MapReduce is a processing techniques and program model for distributed
computing.
• It makes easy to distribute tasks across nodes and performs Sort or Merge
based on distributed computing.
• It gives you the flexibility to write code logic without caring about the design
issues of the system.
• It allows us to perform distributed and parallel processing on large data sets in
a distributed environment.
Architecture
It provides,
• Automatic parallelization and distribution
• Fault-tolerance
• I/O scheduling
• Monitoring & Status updates
How it works?
• It contains two tasks
• Map
• Reduce
• Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
• Reduce task takes the output from the Map as an input and combines those
data tuples (key-value pairs) into a smaller set of tuples.
• The reduce task is always performed after the map job.
• Input Phase - a Record Reader that translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
• Map - is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
• Intermediate Keys - key-value pairs generated by the mapper are known as intermediate
keys.
• Combiner - is a type of local Reducer that groups similar data from the map phase into
identifiable sets.
• Shuffle & Sort - downloads the grouped key-value pairs onto the local machine, where
the Reducer is running. The individual key-value pairs are sorted by key into a larger data
list. The data list groups the equivalent keys together so that their values can be iterated
easily in the Reducer task.
• Reducer -  takes the grouped key-value paired data as input and runs a Reducer function
on each one of them
• Output Phase - have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.
Example: A word count
• Dear, Bear, River, Car, Car, River, Deer, Car and Bear – words are stored in
sample.txt file.
• Have to perform a word count on the sample.txt using MapReduce. 
• Divide the input into three splits as shown in the figure.

• Tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words.

• A list of key-value pair will be created where the key is nothing but the individual words and value is one. The
first line (Dear Bear River) we have 3 key-value pairs — Dear, 1; Bear, 1; River, 1. The mapping process remains
the same on all the nodes.

• After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the
tuples with the same key are sent to the corresponding reducer.

• After the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to
that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.

• Each Reducer counts the values which are present in that list of values. As shown in the figure, reducer gets a
list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very list and gives the
final output as — Bear, 2.

• Finally all the output key/value pairs are then collected and written in the output file.
Real-time Example
• Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how

Tweeter manages its tweets with the help of MapReduce.

• Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

• Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

• Count − Generates a token counter per word.

• Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
Algorithm
• The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
• The map task is done by means of Mapper Class

• The reduce task is done by means of Reducer Class.

• Mapper class takes the input, tokenizes it, maps and sorts it. The output of
Mapper class is used as input by Reducer class, which in turn searches
matching pairs and reduces them.
• Algorithm works as,
• Map phase
• Sort & Shuffle phase
• Reduce phase
Map Phase
• It will work on key & value pairs input.

• It takes input tasks and divides them into smaller sub-taks and then

perform required computation on each sub-task in parallel.

• Key & value is in the form of byte offset values.

• A list of data elements is provided to mapper function called map().

• Map() transforms input data to an intermediate output data element.

• It performs two sub-steps,


• Splitting – takes input dataset from source and divide into smaller sub datasets.

• Mapping – takes the smaller sub-datasets as an input and perform required

action or computation on each sub dataset.


Shuffle & Sort Phase
• Shuffle function is also know as “Combine
Function”.
• Mapper output will be taken as input to
sort & shuffle.
• The shuffling is the grouping of the data
from various nodes based on the key.
• Sort is used to list the shuffled inputs in
sorted order.
• Two sub-steps,
• Merging – combines all key-value pair which
have same keys and returns <key, List<Value>>
• Sorting – takes output from merging and sort
all key-value pairs by using keys.
Reducer Phase
• Reduce is inherently sequential unless processing multiple
tasks.

• Reduce function receives an iterator values from an output


list for the specific key.

• Reducer combines all these values together and provide


single output for the specific key.

• It performs Reduce step


• Reduce step – <Key,Value> are different from map step
<Key,Value> pairs. Reduce steps are computed and sorted pairs.
The cluster collects the data to form an appropriate result and
sends it back to the Hadoop server.
Algorithm Techniques

MapReduce implements various mathematical algorithms to divide a task into small


parts and assign them to multiple systems.

• Sorting - to process and analyze data.

• Searching - helps in the combiner phase (optional) and in the Reducer phase.

• Indexing - is used to point to a particular data and its address.

• TF-IDF - is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency.
Hadoop
• Hadoop is an open-source framework that allows to store and process
big data in a distributed environment across clusters of computers
using simple programming models.
• It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
Big Data
• Big Data is a collection of large datasets that cannot be processed
using traditional computing techniques. 
• It is not a single technique or a tool, rather it has become a complete
subject, which involves various tools, techniques and frameworks.
• Example:
• The volume of data on Facebook or YouTube need to collect and manage on a
daily basis.
• Big Data includes huge volume, high velocity, and extensible variety of
data. The data in it will be of three types.
• Structured data − Relational data.
• Semi Structured data − XML data.
• Unstructured data − Word, PDF, Text, Media Logs.
• Benefits of Big Data
• Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and
other advertising mediums.
• Using the information in the social media like preferences and product perception
of their consumers, product companies and retail organizations are planning their
production.
• Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Big Data Challenges
• Capturing data
• Curation
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Traditional Approach
• An enterprise will have a computer to store • Google solved this problem using an
and process big data.
algorithm called MapReduce.
• For storage purpose, the programmers will
• This algorithm divides the task into small
take the help of their choice of database
vendors such as Oracle, IBM, etc. parts and assigns them to many computers,

• In this approach, the user interacts with the and collects the results from them which

application, which in turn handles the part of when integrated, form the result dataset.
data storage and analysis.
Hadoop
• Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project
called HADOOP.

• Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.

• Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of
data.
Hadoop Architecture
• Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models.
• The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of
computers.
• Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
• Hadoop has two major layers namely −
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System).
Hadoop Distributed File System (HDFS)
• It is based on Google File Systems (GFS) and provides a distributed file
system that designed to run on commodity hardware.
• It is highly fault-tolerance and is designed to be deployed on low-cost
hardware.
• It provides high throughput access to application data and is suitable
for applications having large datasets.
• It is cheaper than one high-end server. 
• It runs across clustered and low-cost machines.
Hadoop works
Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs −

• Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M

(preferably 128M).

• These files are then distributed across various cluster nodes for further processing.

• HDFS, being on top of the local file system, supervises the processing.

• Blocks are replicated for handling hardware failure.

• Checking that the code was executed successfully.

• Performing the sort that takes place between the map and reduce stages.

• Sending the sorted data to a certain computer.

• Writing the debugging logs for each job.


Advantage of Hadoop
• Hadoop is a open source, it is compatible on all the platforms since it is
Java based.
• It allows the user to quickly write and test distributed systems.
• It is efficient, and it automatic distributes the data and work across the
machines.
• It does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to
detect and handle failures at the application layer.
• Servers can be added or removed from the cluster dynamically and
Hadoop continues to operate without interruption.
Installation
• Download Hadoop
• https://hadoop.apache.org/releases.html
• Download Java JDK 8
• https://
www.oracle.com/in/java/technologies/javase/javase8u211-later-archive-dow
nloads.html

You might also like