Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DS Unit 4.1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

18-03-2024

HADOOP ECOSYSTEM

Based on Data Flair (https://data-flair.training/blogs/hadoop-tutorial/)

INTRODUCTION
• With the inception of the internet, web grew from few to
million web pages from the 1900s to 2000s and entering
search results by human became tedious and needed
automation.
• Doug Cutting and Mike Cafarella started working on open
source web engine called NUTCH for faster data
distribution and collection.
• In 2006, Cutting joined Yahoo and took with him the Nutch
project as well as ideas based on Google’s early work with
automating distributed data storage and processing.
• The Nutch project was divided – the web crawler portion
remained as Nutch and the distributed computing and
processing portion became Hadoop.
• "Hadoop” was the name of a yellow toy elephant owned
by the son of Doug Cutting.
• In 2006, Hadoop was released by Yahoo and today is
maintained and distributed by Apache Software
Foundation (ASF).

1
18-03-2024

• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.

• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured.

• Open source project from Apache Software Foundation

HADOOP ARCHITECTURE
• Hadoop works in master-slave fashion. There is
a master node and there are n numbers of
slave nodes.

• Master manages, maintains and monitors the


slaves while slaves are the actual worker
nodes.

• Master stores the metadata (data about data)


while slaves are the nodes which store the
data in the cluster.

• The client connects with master node to


perform any task.

2
18-03-2024

HADOOP DAEMONS
• Daemons are the processes that run in the
background.

• There are mainly 4 daemons which run for Hadoop.

• Namenode – runs on master node for HDFS.


• Datanode – runs on slave nodes for HDFS.
• ResourceManager – runs on master node for Yarn.
• NodeManager – runs on slave node for Yarn.

• These 4 demons run for Hadoop to be functional.

• Apart from this, there can be secondary


NameNode, standby NameNode, Job HistoryServer,
etc.

3
18-03-2024

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)


• the primary storage system of Hadoop.

• a java based file system that provides scalable, fault tolerance, reliable and cost
efficient data storage for Big data.

• a distributed filesystem that runs on commodity hardware.

• HDFS is developed to handle huge volumes of data. The file size expected is in the
range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored
distributedly across multiple machines. These blocks replicate as per the
replication factor. After replication, it stored at different nodes. This handles
the failure of a node in the cluster.

HDFS COMPONENTS
• NameNode is also known as Master node.

• NameNode does not store actual data or dataset.

• NameNode stores Metadata i.e. number of blocks, their location, on which


Rack, which Datanode the data is stored and other details.

• Tasks of HDFS NameNode


• Manages file system namespace.
• Regulates client’s access to files.
• Executes file system execution such as naming, closing, opening files and
directories.

4
18-03-2024

HDFS COMPONENTS
• DataNode is also known as Slave.
• responsible for storing actual data in HDFS.
• performs read and write operation as per the request of the clients.

• Tasks of HDFS DataNode


• performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
• manages data storage of the system.

10

MAPREDUCE
• Hadoop MapReduce is a framework for distributed processing of huge
volumes of data set over a cluster of nodes.

• a software framework for easily writing applications that process the vast
amount of structured and unstructured data stored in the Hadoop
Distributed File system.

• As data is stored in a distributed manner in HDFS. It provides the way to Map–


Reduce to perform parallel processing.

5
18-03-2024

11

MAPREDUCE
• ‘MapReduce’ works by breaking the processing into two phases:
• Map phase
• Reduce phase

• Each phase has key-value pairs as input and output.

• Map function takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value pairs).

• Reduce function takes the output from the Map as an input and combines
those data tuples based on the key and accordingly modifies the value of
the key.

12

YARN (YET ANOTHER RESOURCE NEGOTIATOR)

• a Hadoop ecosystem component that provides the resource management.

• called as the operating system of Hadoop as it is responsible for managing


and monitoring workloads.

• allows multiple data processing engines such as real-time streaming and


batch processing to handle data stored on a single platform.

6
18-03-2024

13

HIVE
• an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files.

• Does three main functions: data summarization, query, and analysis.

• Hive uses HiveQL (HQL), which is similar to SQL.

• HiveQL automatically translates SQL-like queries into MapReduce jobs which will
execute on Hadoop.

• Main parts of Hive are:


• Metastore – Stores the metadata.
• Driver – Manages the lifecycle of a HiveQL statement.
• Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
• Hive server – Provides a thrift interface and JDBC/ODBC server.

14

PIG
• a high-level language platform for analyzing and querying huge dataset
that are stored in HDFS.

• Pig uses PigLatin language. It is very similar to SQL.

• It loads the data, applies the required filters and dumps the data in the
required format.

• For Programs execution, pig requires Java runtime environment.

7
18-03-2024

15

HBASE
• HBase is a Hadoop ecosystem component which is distributed database that
was designed to store structured data in tables that could have billions of row
and millions of columns.
• HBase is scalable, distributed, and Nosql database that is built on top of HDFS.
• HBase provides real time access to read or write data in HDFS.
• HBase Components:
• HBase Master
• It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
• Maintains and monitors the Hadoop cluster.
• Performs administration (interface for creating, updating and deleting tables).
• Controls the failover.
• HMaster handles DDL operation.
• RegionServer
• It is the worker node which handle read, write, update and delete requests from clients.
Region server process runs on every node in Hadoop cluster. Region server runs on HDFS
DateNode.

16

MAHOUT
• Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop
HDFS, mahout provides the data science tools to automatically find
meaningful patterns in those big data sets.

• Algorithms of Mahout are:


• Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
• Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping
cart or terms in query session) and then identifies which items typically appear
together.

8
18-03-2024

17

ZOOKEEPER
• Zookeeper is a centralized service and a Hadoop Ecosystem component for
maintaining configuration information, naming, providing distributed
synchronization, and providing group services.

• Zookeeper manages and coordinates a large cluster of machines.

• Zookeeper is fast with workloads where reads to data are more


common than writes. The ideal read/write ratio is 10:1.

• Zookeeper maintains a record of all transactions.

18

OOZIE
• It is a workflow scheduler system for managing apache Hadoop jobs.

• Oozie combines multiple jobs sequentially into one logical unit of work.

• In Oozie, users can create Directed Acyclic Graph of workflow, which can
run in parallel and sequentially in Hadoop.

• Oozie is scalable and can manage timely execution of thousands of


workflow in a Hadoop cluster. Oozie is very much flexible as well.

• There are two basic types of Oozie jobs:


• Oozie workflow – stores and runs workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
• Oozie Coordinator – It runs workflow jobs based on predefined schedules and
availability of data.

9
18-03-2024

19

SQOOP

• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.

• It also exports data from Hadoop to other external sources.

• Sqoop works with relational databases such as teradata, Netezza, oracle,


MySQL.

20

FLUME
• Flume efficiently collects, aggregate and moves a large amount
of data from its origin and sending it back to HDFS.

• It is fault tolerant and reliable mechanism. This Hadoop Ecosystem


component allows the data flow from the source into Hadoop
environment.

• It uses a simple extensible data model that allows for the online
analytic application.

• Using Flume, we can get the data from multiple servers


immediately into hadoop.

10
18-03-2024

21

AMBARI
• Ambari is a management platform for provisioning, managing,
monitoring and securing apache Hadoop cluster.

• Hadoop management gets simpler as Ambari provide consistent,


secure platform for operational control.

• Ambari easily and efficiently creates and manages clusters at


scale.

• Ambari reduces the complexity to administer and configure


cluster security across the entire platform.

• Ambari ensures that the cluster is healthy and available with a


holistic approach to monitoring.

22

DRILL

• Remember that the main purpose of the Hadoop Ecosystem


Component is large-scale data processing including structured
and semi-structured data.

• Drill is a low latency distributed query engine that is designed to


scale to several thousands of nodes and query petabytes of data.

• The drill is the first distributed SQL query engine that has a schema-
free model.

11
18-03-2024

23

AVRO
• Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system.

• Using serialization service programs can serialize data into files or messages. It
stores data definition and data together in one message or file making it
easy for programs to dynamically understand information stored in Avro file
or message.

• Avro schema – It relies on schemas for serialization/deserialization. Avro


requires the schema for data writes/read. When Avro data is stored in a file
its schema is stored with it, so that files may be processed later by any
program.

• Dynamic typing – It refers to serialization and deserialization without code


generation. It complements the code generation which is available in Avro
for statically typed language as an optional optimization.

24

THRIFT

• Thrift is a software framework for scalable cross-language services


development.

• Thrift is an interface definition language for RPC(Remote procedure call)


communication.

12
18-03-2024

25

• Downloadable from https://hadoop.apache.org/releases.html

26

Hadoop: Setting up a Single Node Cluster.

• Now you are ready to start your Hadoop cluster in one of the three
supported modes:

• Local (Standalone) Mode


• Pseudo-Distributed Mode
• Fully-Distributed Mode

13
18-03-2024

27

• Standalone Operation
• By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This is useful for debugging.

• Pseudo-Distributed Operation
• Hadoop can also be run on a single-node in a pseudo-distributed mode
where each Hadoop daemon runs in a separate Java process.

• Hadoop Cluster Setup


• How to install and configure Hadoop clusters ranging from a few nodes to
extremely large clusters with thousands of nodes.

14

You might also like