Unit-2 - Introduction To Hadoop and Hadoop Architecture
Unit-2 - Introduction To Hadoop and Hadoop Architecture
HADOOP ARCHITECTURE
PREPARED BY: PARMANAND PATEL
AGENDA
• What is Hadoop?
• Need of using Hadoop?
• What is HDFS?
• Hadoop Ecosystem
• Moving Data in and out of Hadoop
• Data Serialization
WHAT IS HADOOP AND ITS COMPONENTS?
Tool is developed by Yahoo. Two main parts: Pig latin: language and Pig runtime:
execution environment.
Pig Latin is easy for those who don’t like to write huge mapreduce programs.
1 line of pig = approx. 100 lines of map reduce job.
Pig first loads the data, then performs various functions like grouping, filtering,
joining, sorting, etc.
Pig internally has a run time engine, which converts the pig latin queries into map
reduce job.
Pig can digest any kind of thing.
APACHE HIVE
Apache Spark is a framework for real time data analytics in a distributed computing
environment.
It executes in-memory computations to increase speed of data processing over Map-
Reduce.
Apache Spark is the leading tool in hadoop ecosystem, it performs realtime analytics
on huge data sets.
It is 100 times faster than Map reduce system.
It does in-memory computation to increase the speed of data processing.
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
HBASE
There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit.
There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs.
Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
Moving data in and out of Hadoop
Moving data in and out of Hadoop
Moving data in and out of Hadoop is as data ingress and egress, is the process by which
data is transported from an external system into an Internal system, and vice versa. Hadoop
supports Ingress and egress at a low level in HDFS and MapReduce.
Files can be moved in and out of HDFS, and data can be pulled from external data sources
and pushed to external data sinks using MapReduce.
Hadoop data ingress and egress transports data
to and from an external system to an internal one
Methods and tools for Data Ingestion
In Hadoop, data ingestion (ingress) and data egress (egress) processes are critical for
efficiently moving data in and out of the Hadoop ecosystem.
Data Ingestion (Ingres)
Data ingestion in Hadoop refers to the process of importing data from various sources
into the Hadoop Distributed File System (HDFS) or other Hadoop ecosystem
components. This process is crucial for ensuring that the data required for analysis is
available in the Hadoop cluster.
There are several methods and tools for data ingestion in Hadoop:
Batch Ingestion:
HDFS Command Line: Using HDFS shell commands like hadoop fs -put or hadoop fs -
copyFromLocal to load data into HDFS.
HDFS Command Line: Using HDFS shell commands like hadoop fs -put or hadoop
fs -copyFromLocal to load data into HDFS.
Apache Sqoop: A tool designed for efficiently transferring bulk data between
Hadoop and structured data stores like relational databases.
Flume: Used for collecting, aggregating, and moving large amounts of log data to
HDFS.
Apache Nifi: A data integration tool that supports a wide range of data sources and
can ingest data into HDFS.
Real-time Ingestion:
Apache Kafka: A distributed streaming platform that can be used to ingest real-time
data streams into Hadoop.
Apache Flume: Can also be configured for real-time data ingestion.
Apache Storm: A real-time computation system that can process streams of data and
feed the results into Hadoop.
File Formats:
Data can be ingested in various formats such as plain text, CSV, JSON, Avro,
Parquet, ORC, etc.
Data serialization in big data refers to the process of converting complex data
structures, such as objects or records, into a format that can be easily stored,
transmitted, and reconstructed later.
It is essential for efficiently managing and processing large volumes of data in big
data environments.
Efficient Storage: Serialized data can be stored in a compact and efficient manner,
reducing the storage footprint and improving access speeds.
Data Transmission: Serialization enables data to be easily transmitted over a
network between different systems or components in a big data architecture.
Interoperability: It allows data to be shared between different programming
languages and platforms, facilitating integration and collaboration.
Schema Evolution: Some serialization formats support schema evolution, allowing
changes to the data structure without breaking existing applications.
Applications
Persisting data onto files
Storing data into Databases
Transferring data through the network
Remote Method Invocation
Sharing data in a Distributed Object Model
Thank You