Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Dr. C.V. Suresh Babu
(CentreforKnowledgeTransfer)
institute
Discussion Topics
• Introduction
• Components of Hadoop
• MapReduce
• Map Task
• Reduce Task
• Anatomy of a Map Reduce
(CentreforKnowledgeTransfer)
institute
Introduction
• Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data.
• Hadoop works on MapReduce Programming Algorithm that was introduced
by Google.
• Today lots of Big Brand Companys are using Hadoop in their Organization
to deal with big data for eg. Facebook, Yahoo, Netflix, eBay, etc.
(CentreforKnowledgeTransfer)
institute
Components of Hadoop
The Hadoop Architecture Mainly consists of 4 components.
• MapReduce
• HDFS(Hadoop distributed File System)
• YARN(Yet Another Resource Framework)
• Common Utilities or Hadoop Common
(CentreforKnowledgeTransfer)
institute
A Hadoop cluster
consists of a single
master and multiple
slave nodes. The master
node includes Job
Tracker, Task Tracker,
NameNode, and
DataNode whereas the
slave node includes
DataNode and
TaskTracker.
(CentreforKnowledgeTransfer)
institute
MapReduce
• MapReduce nothing but just like an Algorithm or a data structure that is
based on the YARN framework.
• The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial processing is no more of any use.
• MapReduce has mainly 2 tasks which are divided phase-wise:
 Map Task
 Reduce Task
(CentreforKnowledgeTransfer)
institute
Map Task
Here, we can see that
the Input is provided to
the Map() function then
it’s output is used as an
input to the Reduce
function and after that,
we receive our final
output.
In first phase, Map is utilized and in next phase Reduce is
utilized.
(CentreforKnowledgeTransfer)
institute
Map()
• As we can see that an Input is provided to
the Map(), now as we are using Big Data. The
Input is a set of Data.
• The Map() function here breaks this
DataBlocks into Tuples that are nothing but
a key-value pair.
• These key-value pairs are now sent as input
to the Reduce().
Reduce()
• The Reduce() function then combines this
broken Tuples or key-value pair based on
its Key value and form set of Tuples, and
perform some operation like sorting,
summation type job, etc. which is then
sent to the final Output Node.
• Finally, the Output is Obtained.
Note: The data processing is always done in Reducer depending upon the business requirement
of that industry. This is How First Map() and then Reduce is utilized one by one.
(CentreforKnowledgeTransfer)
institute
Map Task
• RecordReader The purpose of recoredreader is to break the records. It is responsible
for providing key-value pairs in a Map() function. The key is actually is its locational
information and value is the data associated with it.
• Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate any
key-value pair or generate multiple pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is similar
to a Local reducer. The intermediate key-value that are generated in the Map is
combined with the help of this combiner. Using a combiner is not necessary as it is
optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated in the
Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then partitioner performs
it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).
(CentreforKnowledgeTransfer)
institute
Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value and
transfers them to the Reducer task is known as Shuffling. Using the
Shuffling process the system can sort the data using its key
value.Once some of the Mapping tasks are done Shuffling begins that
is why it is a faster process and does not wait for the completion of the
task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation
sort of process on those key-value depending on its key element.
• OutputFormat: Once all the operations are performed, the key-value
pairs are written into the file with the help of record writer, each
record in a new line, and the key and value in a space-separated
manner.
(CentreforKnowledgeTransfer)
institute
Anatomy
of a
Map
Reduce
(CentreforKnowledgeTransfer)
institute

More Related Content

Hadoop Architecture

  • 1. Dr. C.V. Suresh Babu (CentreforKnowledgeTransfer) institute
  • 2. Discussion Topics • Introduction • Components of Hadoop • MapReduce • Map Task • Reduce Task • Anatomy of a Map Reduce (CentreforKnowledgeTransfer) institute
  • 3. Introduction • Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. • Hadoop works on MapReduce Programming Algorithm that was introduced by Google. • Today lots of Big Brand Companys are using Hadoop in their Organization to deal with big data for eg. Facebook, Yahoo, Netflix, eBay, etc. (CentreforKnowledgeTransfer) institute
  • 4. Components of Hadoop The Hadoop Architecture Mainly consists of 4 components. • MapReduce • HDFS(Hadoop distributed File System) • YARN(Yet Another Resource Framework) • Common Utilities or Hadoop Common (CentreforKnowledgeTransfer) institute
  • 5. A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker. (CentreforKnowledgeTransfer) institute
  • 6. MapReduce • MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN framework. • The major feature of MapReduce is to perform the distributed processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. • When you are dealing with Big Data, serial processing is no more of any use. • MapReduce has mainly 2 tasks which are divided phase-wise:  Map Task  Reduce Task (CentreforKnowledgeTransfer) institute
  • 7. Map Task Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to the Reduce function and after that, we receive our final output. In first phase, Map is utilized and in next phase Reduce is utilized. (CentreforKnowledgeTransfer) institute
  • 8. Map() • As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of Data. • The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair. • These key-value pairs are now sent as input to the Reduce(). Reduce() • The Reduce() function then combines this broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent to the final Output Node. • Finally, the Output is Obtained. Note: The data processing is always done in Reducer depending upon the business requirement of that industry. This is How First Map() and then Reduce is utilized one by one. (CentreforKnowledgeTransfer) institute
  • 9. Map Task • RecordReader The purpose of recoredreader is to break the records. It is responsible for providing key-value pairs in a Map() function. The key is actually is its locational information and value is the data associated with it. • Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from record reader. The Map() function either does not generate any key-value pair or generate multiple pairs of these tuples. • Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local reducer. The intermediate key-value that are generated in the Map is combined with the help of this combiner. Using a combiner is not necessary as it is optional. • Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases. The partitioner generates the shards corresponding to each reducer. Hashcode of each key is also fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of reducers)). (CentreforKnowledgeTransfer) institute
  • 10. Reduce Task • Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper generates the intermediate key-value and transfers them to the Reducer task is known as Shuffling. Using the Shuffling process the system can sort the data using its key value.Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does not wait for the completion of the task performed by Mapper. • Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and then perform some sorting and aggregation sort of process on those key-value depending on its key element. • OutputFormat: Once all the operations are performed, the key-value pairs are written into the file with the help of record writer, each record in a new line, and the key and value in a space-separated manner. (CentreforKnowledgeTransfer) institute