Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
34 views

Unit-2 - Introduction To Hadoop and Hadoop Architecture

Uploaded by

21dce011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Unit-2 - Introduction To Hadoop and Hadoop Architecture

Uploaded by

21dce011
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

INTRODUCTION TO HADOOP &

HADOOP ARCHITECTURE
PREPARED BY: PARMANAND PATEL
AGENDA

• What is Hadoop?
• Need of using Hadoop?
• What is HDFS?
• Hadoop Ecosystem
• Moving Data in and out of Hadoop
• Data Serialization
WHAT IS HADOOP AND ITS COMPONENTS?

 Hadoop is a framework that allows the


distributed processing of large data sets
across clusters of commodity computers
using a simple programming model(map
reduce).
 It is open source data management with
scale out storage and distributed
processing.
 It stores the data in flat file structure(data
stored in plain text) built in Linux
environment.
Hadoop Architecture
NEED OF USING HADOOP?
 Suppose if there is 1 TB data that needs to be
processed by one machine having 4 I/O
channels with 100mb/s.
 It would take 43 minutes to process this data
using single machine.

 If 10 machines are used instead of one machine,


that data gets stored in distributed fashion and
gets processed individually on each machine
having same configuration.
 It would take 4.3 minutes to process it.
WHAT IS HDFS?
 HDFS is abbreviated as Hadoop Distributed File System.
 It is a single unit for storing the big data.
 HDFS has two components : NameNode(Master) & DataNode(Slave).
 NameNode
 It known as the Master
 Stores metadata of HDFS -the directory tree of all files in the file system, and tracks the
files across the cluster.
 It does not store the actual data or the dataset. The data itself is actually stored in the
DataNodes.
 It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
 It is a single point of failure in Hadoop cluster.
DataNode
 It is also known as the Slave
 It is responsible for storing the actual data in HDFS.
 HDFS cluster contains multiple DataNodes.
 Each DataNode contains multiple data blocks. These data blocks are used to store
data.
 It is the responsibility of DataNode to read and write requests from the file system's
clients.
 It performs block creation, deletion, and replication upon instruction from the
NameNode
HDFS ARCHITECTURE
SECONDARY NAMENODE

 The Secondary NameNode works concurrently with the primary


NameNode as a helper daemon.
 The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
 It is responsible for combining the EditLogs with FsImage from the
NameNode.
 It downloads the EditLogs from the NameNode at regular intervals and
applies to FsImage.
PROBLEMS & SOLUTIONS

 Storing Big data


 HDFS – Distributed way to store
the data, data is stored in blocks.
 We can specify the size of each
block.
 Divide the data in four blocks,
because of commodity hardwares,
we can store the data easily.
 Storing variety of data.
 The data is in structured, unstructured and semi-structured form.
 Once we write the data, it can be accessed many times later on.
 No schema validation is done while dumping the data.
 Moving processing logic to datanode
 If all data nodes send the data for processing to master node, then the network would get
congested, so instead, if logic is sent to the datanodes then they would process their data
on their slave nodes itself.
 Then a small chunks of result can be sent to master node.
MapReduce
 It is a processing technique and a program model for distributed computing based on java.
 The MapReduce algorithm contains two important tasks: 1) Map 2) Reduce.
 Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs).
 Reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
 Map() is used to perform actions like filtering, grouping and sorting data.
 The result of map() function is aggregated in reduce() phase.
 Various languages are supported in map reduce framework.
Exmple
Yarn
 Yarn(Yet another Resource Negotiator) is a resource management layer in Hadoop.
 Yarn allows to various data processing engine such as interactive processing, graph
processing, batch processing, and stream processing to run and process data stored in
HDFS.
 The main components of yarn :
 Resource Manager
 It is master daemon of yarn.
 It is responsible for resource assignment and managing several other applications.
 It is used for job scheduling.
 When it receives a processing request, it forward it to the appropriate node manager and
allocates resources for the completion of the request accordingly.
 It has two components: Scheduler & Application Manager
Yarn Architecture
Yarn
 Node Manager
 It has to monitor the container’s resource usage, along with reporting it to the Resource
Manager.
 The health of the node on which YARN is running is tracked by the Node Manager.
 It takes care of each node in the cluster while managing the workflow, along with user jobs
on a particular node.
 It keeps the data in the Resource Manager updated
 Node Manager can also destroy or kill the container as directed by the resource manager.
Yarn
 Application Master
 An application is a single job submitted to the framework. Each such application has a unique
Application Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also manages
faults.
 Its task is to negotiate resources from the Resource Manager and work with the Node Manager
to execute and monitor the component tasks.
 It is responsible for negotiating appropriate resource containers from the Resource Manager,
tracking their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and
to update the record of its resource demands.
Yarn
 Container
 It is a collection of physical resources such as RAM, CPU cores, and disks on a single node.
 YARN containers are managed by a container launch context which is container life-
cycle(CLC). This record contains a map of environment variables, dependencies stored in a
remotely accessible storage, security tokens, payload for Node Manager services and the
command necessary to create the process.
HADOOP ECOSYSTEM
HADOOP ECOSYSTEM
 Hadoop ecosystem is a framework having group of tools, used by various companies
for different tasks.
 We need varieties of tools on top of HDFS to process big data.
 Hadoop is a batch processing system.
 Another tools are required to process the big data, if we need realtime insights from it.
 HDFS : It is the storage unit of hadoop, formed using data nodes and name node.
 YARN : Yet another resource negotiator, It allocate resources to run particular task over
the hadoop cluster.
 Two main services: Resource manager and Node manager
APACHE PIG

 Tool is developed by Yahoo. Two main parts: Pig latin: language and Pig runtime:
execution environment.
 Pig Latin is easy for those who don’t like to write huge mapreduce programs.
 1 line of pig = approx. 100 lines of map reduce job.
 Pig first loads the data, then performs various functions like grouping, filtering,
joining, sorting, etc.
 Pig internally has a run time engine, which converts the pig latin queries into map
reduce job.
 Pig can digest any kind of thing.
APACHE HIVE

 Hive was developed at Facebook.


 It is data warehousing component, which analyses datasets in a distributed
environment using SQL like interface.
 HQL – Hive Query Language
 Hive is also used for writing simple queries for processing data.
 Syntax of HQL is similar to SQL.
 Two basic components : Hive command Line and JDBC/ODBC driver
 The Hive Command line interface is used to execute HQL commands.
 While, Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
MAHOUT

 Mahout is the machine learning library written in java.


 Mahout, allows Machine Learnability to a system or application.
 Used for classifying the huge data in various groups, for creating recommendation
engines.
 It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification.
 It has a predefined set of libraries, which contain inbuilt algorithms for different
usecases. Eg: Market Basket Analysis
SPARK

 Apache Spark is a framework for real time data analytics in a distributed computing
environment.
 It executes in-memory computations to increase speed of data processing over Map-
Reduce.
 Apache Spark is the leading tool in hadoop ecosystem, it performs realtime analytics
on huge data sets.
 It is 100 times faster than Map reduce system.
 It does in-memory computation to increase the speed of data processing.
 It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
HBASE

 HBase is an open source, non-relational distributed database or NoSQL database.


 It supports all types of data and that is why, it’s capable of handling anything and
everything inside a Hadoop ecosystem.
 It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
 The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
 It gives us a fault tolerant way of storing sparse data, which is common in most Big
Data use cases.
 The HBase is written in Java, whereas HBase applications can be written in REST, Avro
and Thrift APIs.
DRILL

 Apache Drill is used to drill into any kind of data.


 It’s an open source application which works with distributed environment to analyze
large data sets.
 It is a replica of Google Dremel.
 It supports different kinds NoSQL databases and file systems, which is a powerful
feature of Drill.
 For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-
DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.
 The main strength of Apache Drill is its ability to combine several data stores with a
single query.
ZOOKEEPER & OOZIE

 There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
 Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.

 Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit.
 There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
 Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
Moving data in and out of Hadoop
Moving data in and out of Hadoop

 Moving data in and out of Hadoop is as data ingress and egress, is the process by which
data is transported from an external system into an Internal system, and vice versa. Hadoop
supports Ingress and egress at a low level in HDFS and MapReduce.
 Files can be moved in and out of HDFS, and data can be pulled from external data sources
and pushed to external data sinks using MapReduce.
Hadoop data ingress and egress transports data
to and from an external system to an internal one
Methods and tools for Data Ingestion

 In Hadoop, data ingestion (ingress) and data egress (egress) processes are critical for
efficiently moving data in and out of the Hadoop ecosystem.
 Data Ingestion (Ingres)
 Data ingestion in Hadoop refers to the process of importing data from various sources
into the Hadoop Distributed File System (HDFS) or other Hadoop ecosystem
components. This process is crucial for ensuring that the data required for analysis is
available in the Hadoop cluster.
 There are several methods and tools for data ingestion in Hadoop:
 Batch Ingestion:
 HDFS Command Line: Using HDFS shell commands like hadoop fs -put or hadoop fs -
copyFromLocal to load data into HDFS.
 HDFS Command Line: Using HDFS shell commands like hadoop fs -put or hadoop
fs -copyFromLocal to load data into HDFS.
 Apache Sqoop: A tool designed for efficiently transferring bulk data between
Hadoop and structured data stores like relational databases.
 Flume: Used for collecting, aggregating, and moving large amounts of log data to
HDFS.
 Apache Nifi: A data integration tool that supports a wide range of data sources and
can ingest data into HDFS.
 Real-time Ingestion:
 Apache Kafka: A distributed streaming platform that can be used to ingest real-time
data streams into Hadoop.
 Apache Flume: Can also be configured for real-time data ingestion.
 Apache Storm: A real-time computation system that can process streams of data and
feed the results into Hadoop.
 File Formats:
 Data can be ingested in various formats such as plain text, CSV, JSON, Avro,
Parquet, ORC, etc.

 Data Egress (Egres)


 Data egress in Hadoop refers to the process of exporting data from Hadoop to external
systems or users. This process is necessary for sharing processed data, integrating with
other systems, or for further analysis outside the Hadoop environment.
 Common methods for data egress include:
 Batch Export:
 HDFS Command Line: Using commands like hadoop fs -get or hadoop fs -copyToLocal to
move data from HDFS to the local filesystem.
 Apache Sqoop: Also supports exporting data from Hadoop to relational databases.
 Apache Nifi: Can be used to move data from HDFS to various destinations.
 Real-time Export:
 Apache Kafka: Can be used to stream data from Hadoop to other systems in real-time.
 Apache Flume: Can be configured to move data from HDFS to other systems in real-time.
 File Formats:
 Data can be exported in the same formats it was ingested or transformed into other formats as
needed.
Data Serialization in Big Data
What is Data serialization ?

 Data serialization in big data refers to the process of converting complex data
structures, such as objects or records, into a format that can be easily stored,
transmitted, and reconstructed later.

 It is essential for efficiently managing and processing large volumes of data in big
data environments.

 Serialization helps in data exchange between different components of a big data


system, such as data producers, consumers, storage systems, and processing
frameworks.
Architecture of Data serialization
Example of Data serialization
 The architecture of data serialization in the context of big data involves several key
components and processes that work together to convert data into a format that can be
efficiently stored, transmitted, and reconstructed. The architecture can be broadly
divided into the following stages:
1. Data Sources and Producers
 Data can originate from various sources, such as databases, sensors, log files, user
interactions, and external APIs. These sources are often referred to as data producers.
2. Serialization Libraries/Frameworks
 Serialization libraries or frameworks provide the tools and APIs required to serialize and
deserialize data. These libraries handle the conversion of data structures into serialized
formats and vice versa.
 Common serialization libraries include:
 Apache Avro
 Apache Parquet
 Apache ORC
 Google Protocol Buffers
 Apache Thrift
 JSON Libraries (e.g., Jackson, Gson)
 XML Libraries (e.g., JAXB)
3. Schema Definition and Management
 Some serialization formats, such as Avro, Parquet, and Protocol Buffers, use schemas to define the
structure of the data. The schema provides a blueprint for how data should be serialized and
deserialized. Schemas are essential for ensuring data compatibility and enabling schema evolution.
 Schema Registry: A centralized repository that stores and manages schemas, ensuring that all
components in the data pipeline use consistent and compatible schemas.
4. Data Serialization Process
 The process of serialization involves converting data from its native format into a serialized format.
This typically involves the following steps:
 Schema Validation: Ensuring that the data conforms to the defined schema.
 Serialization: Converting the data into the specified serialized format (binary or text-based).
 Compression: Optionally compressing the serialized data to reduce its size.
5. Data Transmission
 Serialized data can be transmitted over networks to various destinations, such as storage
systems, processing frameworks, or other services. Efficient serialization ensures that data
transmission is fast and reliable.
6. Data Storage
 Serialized data is stored in storage systems such as HDFS (Hadoop Distributed File System),
cloud storage (e.g., Amazon S3, Google Cloud Storage), or traditional databases. The choice
of storage depends on factors like data volume, access patterns, and performance
requirements.
7. Data Processing Frameworks
 Big data processing frameworks (e.g., Apache Hadoop, Apache Spark, Apache Flink) read
and process serialized data. These frameworks use serialization libraries to deserialize the
data, perform computations, and serialize the results if needed.
8. Data Consumers
 Data consumers are applications or services that consume serialized data for various
purposes, such as analytics, machine learning, reporting, and visualization. They deserialize
the data to reconstruct the original data structures for further use.
9. Data Deserialization Process
 Deserialization is the reverse process of serialization, where serialized data is converted back
into its original or usable format. This involves:
 Schema Lookup: Retrieving the appropriate schema for deserialization.
 Deserialization: Converting the serialized data back into its native format.
 Validation: Ensuring the reserialized data conforms to the expected structure and content.
Importance of data serialization of big data ?

 Efficient Storage: Serialized data can be stored in a compact and efficient manner,
reducing the storage footprint and improving access speeds.
 Data Transmission: Serialization enables data to be easily transmitted over a
network between different systems or components in a big data architecture.
 Interoperability: It allows data to be shared between different programming
languages and platforms, facilitating integration and collaboration.
 Schema Evolution: Some serialization formats support schema evolution, allowing
changes to the data structure without breaking existing applications.
Applications
 Persisting data onto files
 Storing data into Databases
 Transferring data through the network
 Remote Method Invocation
 Sharing data in a Distributed Object Model
Thank You

You might also like