DS Unit 4.1
DS Unit 4.1
DS Unit 4.1
HADOOP ECOSYSTEM
INTRODUCTION
• With the inception of the internet, web grew from few to
million web pages from the 1900s to 2000s and entering
search results by human became tedious and needed
automation.
• Doug Cutting and Mike Cafarella started working on open
source web engine called NUTCH for faster data
distribution and collection.
• In 2006, Cutting joined Yahoo and took with him the Nutch
project as well as ideas based on Google’s early work with
automating distributed data storage and processing.
• The Nutch project was divided – the web crawler portion
remained as Nutch and the distributed computing and
processing portion became Hadoop.
• "Hadoop” was the name of a yellow toy elephant owned
by the son of Doug Cutting.
• In 2006, Hadoop was released by Yahoo and today is
maintained and distributed by Apache Software
Foundation (ASF).
1
18-03-2024
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured.
HADOOP ARCHITECTURE
• Hadoop works in master-slave fashion. There is
a master node and there are n numbers of
slave nodes.
2
18-03-2024
HADOOP DAEMONS
• Daemons are the processes that run in the
background.
3
18-03-2024
• a java based file system that provides scalable, fault tolerance, reliable and cost
efficient data storage for Big data.
• HDFS is developed to handle huge volumes of data. The file size expected is in the
range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored
distributedly across multiple machines. These blocks replicate as per the
replication factor. After replication, it stored at different nodes. This handles
the failure of a node in the cluster.
HDFS COMPONENTS
• NameNode is also known as Master node.
4
18-03-2024
HDFS COMPONENTS
• DataNode is also known as Slave.
• responsible for storing actual data in HDFS.
• performs read and write operation as per the request of the clients.
10
MAPREDUCE
• Hadoop MapReduce is a framework for distributed processing of huge
volumes of data set over a cluster of nodes.
• a software framework for easily writing applications that process the vast
amount of structured and unstructured data stored in the Hadoop
Distributed File system.
5
18-03-2024
11
MAPREDUCE
• ‘MapReduce’ works by breaking the processing into two phases:
• Map phase
• Reduce phase
• Map function takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs).
• Reduce function takes the output from the Map as an input and combines
those data tuples based on the key and accordingly modifies the value of
the key.
12
6
18-03-2024
13
HIVE
• an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files.
• HiveQL automatically translates SQL-like queries into MapReduce jobs which will
execute on Hadoop.
14
PIG
• a high-level language platform for analyzing and querying huge dataset
that are stored in HDFS.
• It loads the data, applies the required filters and dumps the data in the
required format.
7
18-03-2024
15
HBASE
• HBase is a Hadoop ecosystem component which is distributed database that
was designed to store structured data in tables that could have billions of row
and millions of columns.
• HBase is scalable, distributed, and Nosql database that is built on top of HDFS.
• HBase provides real time access to read or write data in HDFS.
• HBase Components:
• HBase Master
• It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
• Maintains and monitors the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting tables).
• Controls the failover.
• HMaster handles DDL operation.
• RegionServer
• It is the worker node which handle read, write, update and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.
16
MAHOUT
• Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop
HDFS, mahout provides the data science tools to automatically find
meaningful patterns in those big data sets.
8
18-03-2024
17
ZOOKEEPER
• Zookeeper is a centralized service and a Hadoop Ecosystem component for
maintaining configuration information, naming, providing distributed
synchronization, and providing group services.
18
OOZIE
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• In Oozie, users can create Directed Acyclic Graph of workflow, which can
run in parallel and sequentially in Hadoop.
9
18-03-2024
19
SQOOP
• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.
20
FLUME
• Flume efficiently collects, aggregate and moves a large amount
of data from its origin and sending it back to HDFS.
• It uses a simple extensible data model that allows for the online
analytic application.
10
18-03-2024
21
AMBARI
• Ambari is a management platform for provisioning, managing,
monitoring and securing apache Hadoop cluster.
22
DRILL
• The drill is the first distributed SQL query engine that has a schema-
free model.
11
18-03-2024
23
AVRO
• Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system.
• Using serialization service programs can serialize data into files or messages. It
stores data definition and data together in one message or file making it
easy for programs to dynamically understand information stored in Avro file
or message.
24
THRIFT
12
18-03-2024
25
26
• Now you are ready to start your Hadoop cluster in one of the three
supported modes:
13
18-03-2024
27
• Standalone Operation
• By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This is useful for debugging.
• Pseudo-Distributed Operation
• Hadoop can also be run on a single-node in a pseudo-distributed mode
where each Hadoop daemon runs in a separate Java process.
14