Hadoop Architecture

Dr. C.V. Suresh Babu
(CentreforKnowledgeTransfer)
institute

Discussion Topics
• Introduction
• Components of Hadoop
• MapReduce
• Map Task
• Reduce Task
• Anatomy of a Map Reduce
institute

Introduction
• Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data.
• Hadoop works on MapReduce Programming Algorithm that was introduced
by Google.
• Today lots of Big Brand Companys are using Hadoop in their Organization
to deal with big data for eg. Facebook, Yahoo, Netflix, eBay, etc.
institute

Components of Hadoop
The Hadoop Architecture Mainly consists of 4 components.
• MapReduce
• HDFS(Hadoop distributed File System)
• YARN(Yet Another Resource Framework)
• Common Utilities or Hadoop Common
institute

A Hadoop cluster
consists of a single
master and multiple
slave nodes. The master
node includes Job
Tracker, Task Tracker,
NameNode, and
DataNode whereas the
slave node includes
DataNode and
TaskTracker.
institute

MapReduce
• MapReduce nothing but just like an Algorithm or a data structure that is
based on the YARN framework.
• The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial processing is no more of any use.
• MapReduce has mainly 2 tasks which are divided phase-wise:
 Map Task
 Reduce Task
institute

Map Task
Here, we can see that
the Input is provided to
the Map() function then
it’s output is used as an
input to the Reduce
function and after that,
we receive our final
output.
In first phase, Map is utilized and in next phase Reduce is
utilized.
institute

Map()
• As we can see that an Input is provided to
the Map(), now as we are using Big Data. The
Input is a set of Data.
• The Map() function here breaks this
DataBlocks into Tuples that are nothing but
a key-value pair.
• These key-value pairs are now sent as input
to the Reduce().
Reduce()
• The Reduce() function then combines this
broken Tuples or key-value pair based on
its Key value and form set of Tuples, and
perform some operation like sorting,
summation type job, etc. which is then
sent to the final Output Node.
• Finally, the Output is Obtained.
Note: The data processing is always done in Reducer depending upon the business requirement
of that industry. This is How First Map() and then Reduce is utilized one by one.
institute

Map Task
• RecordReader The purpose of recoredreader is to break the records. It is responsible
for providing key-value pairs in a Map() function. The key is actually is its locational
information and value is the data associated with it.
• Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate any
key-value pair or generate multiple pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is similar
to a Local reducer. The intermediate key-value that are generated in the Map is
combined with the help of this combiner. Using a combiner is not necessary as it is
optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated in the
Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then partitioner performs
it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).
institute

Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value and
transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key
value.Once some of the Mapping tasks are done Shuffling begins that
is why it is a faster process and does not wait for the completion of the
task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation
sort of process on those key-value depending on its key element.
• OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each
record in a new line, and the key and value in a space-separated
manner.
institute

Anatomy
of a
Map
Reduce
institute

Hadoop Architecture

Related slideshows

More Related Content

Hadoop Architecture