DS Unit 4.1

18-03-2024
HADOOP ECOSYSTEM
Based on Data Flair (https://data-flair.training/blogs/hadoop-tutorial/)
INTRODUCTION
• With the inception of the internet, web grew from few to
million web pages from the 1900s to 2000s and entering
search results by human became tedious and needed
automation.
• Doug Cutting and Mike Cafarella started working on open
source web engine called NUTCH for faster data
distribution and collection.
• In 2006, Cutting joined Yahoo and took with him the Nutch
project as well as ideas based on Google’s early work with
automating distributed data storage and processing.
• The Nutch project was divided – the web crawler portion
remained as Nutch and the distributed computing and
processing portion became Hadoop.
• "Hadoop” was the name of a yellow toy elephant owned
by the son of Doug Cutting.
• In 2006, Hadoop was released by Yahoo and today is
maintained and distributed by Apache Software
Foundation (ASF).
1
18-03-2024
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured.
• Open source project from Apache Software Foundation
HADOOP ARCHITECTURE
• Hadoop works in master-slave fashion. There is
a master node and there are n numbers of
slave nodes.
• Master manages, maintains and monitors the

slaves while slaves are the actual worker
nodes.
• Master stores the metadata (data about data)

while slaves are the nodes which store the
data in the cluster.
• The client connects with master node to

perform any task.
2
18-03-2024
HADOOP DAEMONS
• Daemons are the processes that run in the
background.
• There are mainly 4 daemons which run for Hadoop.
• Namenode – runs on master node for HDFS.

• Datanode – runs on slave nodes for HDFS.
• ResourceManager – runs on master node for Yarn.
• NodeManager – runs on slave node for Yarn.
• These 4 demons run for Hadoop to be functional.
• Apart from this, there can be secondary

NameNode, standby NameNode, Job HistoryServer,
etc.
3
18-03-2024
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

• the primary storage system of Hadoop.
• a java based file system that provides scalable, fault tolerance, reliable and cost
efficient data storage for Big data.
• a distributed filesystem that runs on commodity hardware.
• HDFS is developed to handle huge volumes of data. The file size expected is in the
range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored
distributedly across multiple machines. These blocks replicate as per the
replication factor. After replication, it stored at different nodes. This handles
the failure of a node in the cluster.
HDFS COMPONENTS
• NameNode is also known as Master node.
• NameNode does not store actual data or dataset.
• NameNode stores Metadata i.e. number of blocks, their location, on which

Rack, which Datanode the data is stored and other details.
• Tasks of HDFS NameNode

• Manages file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing, opening files and
directories.
4
18-03-2024
HDFS COMPONENTS
• DataNode is also known as Slave.
• responsible for storing actual data in HDFS.
• performs read and write operation as per the request of the clients.
• Tasks of HDFS DataNode

• performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
• manages data storage of the system.
10
MAPREDUCE
• Hadoop MapReduce is a framework for distributed processing of huge
volumes of data set over a cluster of nodes.
• a software framework for easily writing applications that process the vast
amount of structured and unstructured data stored in the Hadoop
Distributed File system.
• As data is stored in a distributed manner in HDFS. It provides the way to Map–

Reduce to perform parallel processing.
5
18-03-2024
11
MAPREDUCE
• ‘MapReduce’ works by breaking the processing into two phases:
• Map phase
• Reduce phase
• Each phase has key-value pairs as input and output.
• Map function takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs).
• Reduce function takes the output from the Map as an input and combines
those data tuples based on the key and accordingly modifies the value of
the key.
12
YARN (YET ANOTHER RESOURCE NEGOTIATOR)
• a Hadoop ecosystem component that provides the resource management.
• called as the operating system of Hadoop as it is responsible for managing

and monitoring workloads.
• allows multiple data processing engines such as real-time streaming and

batch processing to handle data stored on a single platform.
6
18-03-2024
13
HIVE
• an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files.
• Does three main functions: data summarization, query, and analysis.
• Hive uses HiveQL (HQL), which is similar to SQL.
• HiveQL automatically translates SQL-like queries into MapReduce jobs which will
execute on Hadoop.
• Main parts of Hive are:

• Metastore – Stores the metadata.
• Driver – Manages the lifecycle of a HiveQL statement.
• Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
• Hive server – Provides a thrift interface and JDBC/ODBC server.
14
PIG
• a high-level language platform for analyzing and querying huge dataset
that are stored in HDFS.
• Pig uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the
required format.
• For Programs execution, pig requires Java runtime environment.
7
18-03-2024
15
HBASE
• HBase is a Hadoop ecosystem component which is distributed database that
was designed to store structured data in tables that could have billions of row
and millions of columns.
• HBase is scalable, distributed, and Nosql database that is built on top of HDFS.
• HBase provides real time access to read or write data in HDFS.
• HBase Components:
• HBase Master
• It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
• Maintains and monitors the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting tables).
• Controls the failover.
• HMaster handles DDL operation.
• RegionServer
• It is the worker node which handle read, write, update and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.
16
MAHOUT
• Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop
HDFS, mahout provides the data science tools to automatically find
meaningful patterns in those big data sets.
• Algorithms of Mahout are:

• Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
• Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping
cart or terms in query session) and then identifies which items typically appear
together.
8
18-03-2024
17
ZOOKEEPER
• Zookeeper is a centralized service and a Hadoop Ecosystem component for
maintaining configuration information, naming, providing distributed
synchronization, and providing group services.
• Zookeeper manages and coordinates a large cluster of machines.
• Zookeeper is fast with workloads where reads to data are more

common than writes. The ideal read/write ratio is 10:1.
• Zookeeper maintains a record of all transactions.
18
OOZIE
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• In Oozie, users can create Directed Acyclic Graph of workflow, which can
run in parallel and sequentially in Hadoop.
• Oozie is scalable and can manage timely execution of thousands of

workflow in a Hadoop cluster. Oozie is very much flexible as well.
• There are two basic types of Oozie jobs:

• Oozie workflow – stores and runs workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
• Oozie Coordinator – It runs workflow jobs based on predefined schedules and
availability of data.
9
18-03-2024
19
SQOOP
• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.
• It also exports data from Hadoop to other external sources.
• Sqoop works with relational databases such as teradata, Netezza, oracle,

MySQL.
20
FLUME
• Flume efficiently collects, aggregate and moves a large amount
of data from its origin and sending it back to HDFS.
• It is fault tolerant and reliable mechanism. This Hadoop Ecosystem

component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online
analytic application.
• Using Flume, we can get the data from multiple servers

immediately into hadoop.
10
18-03-2024
21
AMBARI
• Ambari is a management platform for provisioning, managing,
monitoring and securing apache Hadoop cluster.
• Hadoop management gets simpler as Ambari provide consistent,

secure platform for operational control.
• Ambari easily and efficiently creates and manages clusters at

scale.
• Ambari reduces the complexity to administer and configure

cluster security across the entire platform.
• Ambari ensures that the cluster is healthy and available with a

holistic approach to monitoring.
22
DRILL
• Remember that the main purpose of the Hadoop Ecosystem

Component is large-scale data processing including structured
and semi-structured data.
• Drill is a low latency distributed query engine that is designed to

scale to several thousands of nodes and query petabytes of data.
• The drill is the first distributed SQL query engine that has a schema-
free model.
11
18-03-2024
23
AVRO
• Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system.
• Using serialization service programs can serialize data into files or messages. It
stores data definition and data together in one message or file making it
easy for programs to dynamically understand information stored in Avro file
or message.
• Avro schema – It relies on schemas for serialization/deserialization. Avro

requires the schema for data writes/read. When Avro data is stored in a file
its schema is stored with it, so that files may be processed later by any
program.
• Dynamic typing – It refers to serialization and deserialization without code

generation. It complements the code generation which is available in Avro
for statically typed language as an optional optimization.
24
THRIFT
• Thrift is a software framework for scalable cross-language services

development.
• Thrift is an interface definition language for RPC(Remote procedure call)

communication.
12
18-03-2024
25
• Downloadable from https://hadoop.apache.org/releases.html
26
Hadoop: Setting up a Single Node Cluster.
• Now you are ready to start your Hadoop cluster in one of the three
supported modes:
• Local (Standalone) Mode

• Pseudo-Distributed Mode
• Fully-Distributed Mode
13
18-03-2024
27
• Standalone Operation
• By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This is useful for debugging.
• Pseudo-Distributed Operation
• Hadoop can also be run on a single-node in a pseudo-distributed mode
where each Hadoop daemon runs in a separate Java process.
• Hadoop Cluster Setup

• How to install and configure Hadoop clusters ranging from a few nodes to
extremely large clusters with thousands of nodes.
14

DS Unit 4.1

Uploaded by

Copyright:

Available Formats

DS Unit 4.1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Unit 4.1

Uploaded by

Copyright:

Available Formats

18-03-2024

Based on Data Flair (https://data-flair.training/blogs/hadoop-tutorial/)

• Open source project from Apache Software Foundation

• Master manages, maintains and monitors the

• Master stores the metadata (data about data)

• The client connects with master node to

• There are mainly 4 daemons which run for Hadoop.

• Namenode – runs on master node for HDFS.

• These 4 demons run for Hadoop to be functional.

• Apart from this, there can be secondary

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

• a distributed filesystem that runs on commodity hardware.

• NameNode does not store actual data or dataset.

• NameNode stores Metadata i.e. number of blocks, their location, on which

• Tasks of HDFS NameNode

• Tasks of HDFS DataNode

• As data is stored in a distributed manner in HDFS. It provides the way to Map–

• Each phase has key-value pairs as input and output.

YARN (YET ANOTHER RESOURCE NEGOTIATOR)

• a Hadoop ecosystem component that provides the resource management.

• called as the operating system of Hadoop as it is responsible for managing

• allows multiple data processing engines such as real-time streaming and

• Does three main functions: data summarization, query, and analysis.

• Hive uses HiveQL (HQL), which is similar to SQL.

• Main parts of Hive are:

• Pig uses PigLatin language. It is very similar to SQL.

• For Programs execution, pig requires Java runtime environment.

• Algorithms of Mahout are:

• Zookeeper manages and coordinates a large cluster of machines.

• Zookeeper is fast with workloads where reads to data are more

• Zookeeper maintains a record of all transactions.

• Oozie is scalable and can manage timely execution of thousands of

• There are two basic types of Oozie jobs:

• It also exports data from Hadoop to other external sources.

• Sqoop works with relational databases such as teradata, Netezza, oracle,

• It is fault tolerant and reliable mechanism. This Hadoop Ecosystem

• Using Flume, we can get the data from multiple servers

• Hadoop management gets simpler as Ambari provide consistent,

• Ambari easily and efficiently creates and manages clusters at

• Ambari reduces the complexity to administer and configure

• Ambari ensures that the cluster is healthy and available with a

• Remember that the main purpose of the Hadoop Ecosystem

• Drill is a low latency distributed query engine that is designed to

• Avro schema – It relies on schemas for serialization/deserialization. Avro

• Dynamic typing – It refers to serialization and deserialization without code

• Thrift is a software framework for scalable cross-language services

• Thrift is an interface definition language for RPC(Remote procedure call)

• Downloadable from https://hadoop.apache.org/releases.html

Hadoop: Setting up a Single Node Cluster.

• Local (Standalone) Mode

• Hadoop Cluster Setup

You might also like