Apache Hadoop Ecosystem
Apache Hadoop Ecosystem
2. Map-Reduce
Map-Reduce is the data processing layer of Hadoop, It distributes
the task into small pieces and assigns those pieces to many
machines joined over a network, and assembles all the events to
form the last event dataset. The basic detail required by Map-
Reduce is a key-value pair. All the data, whether structured or not,
needs to be translated to the key-value pair before it is passed
through the Map-Reduce model. In the Map-Reduce Framework, the
processing unit is moved to the data rather than moving the data to
the processing unit.
3. YARN
YARN stands for “Yet Another Resource Negotiator” which is the
Resource Management level of the Hadoop Cluster. YARN is used to
implement resource management and job scheduling in the Hadoop
cluster. The primary idea of YARN is to split the job scheduling and
resource management into various processes and make the
operation.
YARN gives two daemons; the first one is called Resource Manager
and the seconds one is called Node Manager. Both components are
used to process data-computation in YARN. The Resource Manager
runs on the master node of the Hadoop cluster and negotiates
resources in all applications whereas the Node Manager is hosted on
all Slave nodes. The responsibility of the Node Manager is to monitor
the containers, resource usage such as (CPU, memory, disk, and
network) and provide detail to the Resource Manager.
4. Apache Hive
Apache Hive is a data-warehousing project of Hadoop. Hive is
intended to facilitate informal data summarization, ad-hoc querying,
and interpretation of large volumes of data. With the help of HiveQL,
a user can perform ad-hoc queries on the dataset store in HDFS and
use that data to do further analysis. Hive also supports custom user-
defined functions that can be used by users to perform custom
analysis.
5. Apache Pig
Apache Pig was developed by Yahoo to analyze large data stored in
Hadoop HDFS. Pig provides a platform to analyze massive data sets
that consists of a high-level language for communicating data
analysis applications, linked with infrastructure for assessing these
programs.
Optimization Opportunities
Apache Pig provides the optimization of the query that helps users
to concentrate on meaning rather than efficiency.
Extensibility
6. Apache Mahout
Apache Mahout is a framework to create machine-learning
applications. It provides a rich set of components from which you
can construct a customized recommender system from a selection
of algorithms. Mahout is developed to provide enforcement,
scalability, and compliance.
The following are the important packages that define the Mahout
interfaces to these key abstractions.
DataModel
UserSimilarity
ItemSimilarity
UserNeighborhood
Recommender
7. Apache HBase
Apache HBase is a distributed, open-source, versioned, and non-
relational database that is created after Google's Bigtable. It is an
import component of the Hadoop ecosystem that leverages the fault
tolerance feature of HDFS and it provides real-time read and writes
access to data. Hbase can be called a data storage system despite a
database because it doesn’t provide RDBMS features like triggers,
query language, and secondary indexes.
8. Apache Zookeeper
Apache Zookeeper acts as a coordinator between different services
of Hadoop and is used for maintaining configuration information,
naming, providing distributed synchronization, and providing group
services. Zookeeper is used to fix bugs and race conditions for those
applications, which are newly deployed in a distributed environment.
9. Apache Sqoop
Apache Sqoop is a data transfer tool, which is used to transfer data
between Hadoop and relational databases. It is used to import data
from a relational database management system like (MySQL or
Oracle) or a mainframe into the Hadoop (HDFS), transform the data
in Hadoop MapReduce. It is also used to export the data back to an
RDBMS. Sqoop uses map-reduce to import and export data due to
which it gets parallel processing and fault-tolerance property.
10. Apache Flume
Apache Flume is a log transfer tool similar to Sqoop but it works on
unstructured data (logs) whereas Sqoop is used for structure and
unstructured data. Flume is a reliable, distributed, and available
system for efficiently collecting, aggregating, and moving large
amounts of log data from many different sources to an HDFS. It is
not restricted to log data aggregation only but it can also use to
transport massive quantities of event data.
Source
Channel
Sink
Apache Oozie workflow has the following two nodes namely Control
Flow Nodes and Action Node.
2. Action Node