Hadoop Ecosystem and Their Components
Hadoop Ecosystem and Their Components
Components – A Complete
Tutorial
1. Hadoop Ecosystem Components
The objective of this Apache Hadoop ecosystem components tutorial is to
have an overview of what are the different components of Hadoop ecosystem
that make Hadoop so powerful and due to which several Hadoop job roles are
available now. We will also learn about Hadoop ecosystem components
like HDFS and HDFS components, MapReduce, YARN, Hive, Apache
Pig, Apache HBase and HBase
components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache
Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data
Hadoop and to acquire master level knowledge of the Hadoop Ecosystem.
Hadoop Ecosystem and Their Components
Refer HDFS Comprehensive Guide to read Hadoop HDFS in detail and then
proceed with the Hadoop Ecosystem tutorial.
2.2. MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides
data processing. MapReduce is a software framework for easily writing
applications that process the vast amount of structured and unstructured data
stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster.
Thus, it improves the speed and reliability of cluster this parallel processing.
Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing
into two phases:
• Map phase
• Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer
also specifies two functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Read
Mapper in detail.
Reduce function takes the output from the Map as an input and combines
those data tuples based on the key and accordingly modifies the value of the
key. Read Reducer in detail.
Features of MapReduce
• Simplicity – MapReduce jobs are easy to run. Applications can be
written in any language such as java, C++, and python.
• Scalability – MapReduce can process petabytes of data.
• Speed – By means of parallel processing problems that take days to
solve, it is solved in hours and minutes by MapReduce.
• Fault Tolerance – MapReduce takes care of failures. If one copy
of data is unavailable, another machine has a copy of the same key
pair which can be used for solving the same subtask.
Refer MapReduce Comprehensive Guide for more details.
Hope the Hadoop Ecosystem explained is helpful to you. The next component
we take is YARN.
2.3. YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem
component that provides the resource management. Yarn is also one the most
important component of Hadoop Ecosystem. YARN is called as the operating
system of Hadoop as it is responsible for managing and monitoring workloads.
It allows multiple data processing engines such as real-time streaming and
batch processing to handle data stored on a single platform.
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
Benefits of HCatalog:
• Enables notifications of data availability.
• With the table abstraction, HCatalog frees the user from overhead of
data storage.
• Provide visibility for data cleaning and archiving tools.
2.8. Avro
Acro is a part of Hadoop ecosystem and is a most popular Data serialization
system. Avro is an open source project that provides data serialization and
data exchange services for Hadoop. These services can be used together or
independently. Big data can exchange programs written in different languages
using Avro.
Using serialization service programs can serialize data into files or messages.
It stores data definition and data together in one message or file making it
easy for programs to dynamically understand information stored in Avro file
or message.
2.9. Thrift
It is a software framework for scalable cross-language services development.
Thrift is an interface definition language for RPC(Remote procedure call)
communication. Hadoop does a lot of RPC calls so there is a possibility of
using Hadoop Ecosystem componet Apache Thrift for performance or other
reasons.
Thrift Diagram
Apache Flume
Features of Ambari:
• Simplified installation, configuration, and management – Ambari
easily and efficiently create and manage clusters at scale.
• Centralized security setup – Ambari reduce the complexity to
administer and configure cluster security across the entire platform.
• Highly extensible and customizable – Ambari is highly extensible
for bringing custom services under management.
• Full visibility into cluster health – Ambari ensures that the cluster
is healthy and available with a holistic approach to monitoring.
2.15. Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem
component for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. Zookeeper
manages and coordinates a large cluster of machines.
ZooKeeper Diagram
Features of Zookeeper:
• Fast – Zookeeper is fast with workloads where reads to data are
more common than writes. The ideal read/write ratio is 10:1.
• Ordered – Zookeeper maintains a record of all transactions.
2.16. Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie
combines multiple jobs sequentially into one logical unit of work. Oozie
framework is fully integrated with apache Hadoop stack, YARN as an
architecture center and supports Hadoop jobs for apache MapReduce, Pig,
Hive, and Sqoop.
Oozie Diagram
In Oozie, users can create Directed Acyclic Graph of workflow, which can run
in parallel and sequentially in Hadoop. Oozie is scalable and can manage
timely execution of thousands of workflow in a Hadoop cluster. Oozie is very
much flexible as well. One can easily start, stop, suspend and rerun jobs. It is
even possible to skip a specific failed node or rerun it in Oozie.
There are two basic types of Oozie jobs: