Bda 18CS72 Mod-2
Bda 18CS72 Mod-2
Bda 18CS72 Mod-2
MODULE-2
INTRODUCTION TO HADOOP
Text book-1
• Hadoop is an open-source software framework
for storing data and running applications on
clusters of commodity hardware.
• Apache Hadoop is a collection of open-source
software utilities that facilitates using a network
of many computers to solve problems involving
massive amounts of data and computation.
• It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs.
• Hadoop Distributed File System means a
system of storing files (set of data records, key-
value pairs, hash key-value pairs or applications
data) at distributed computing nodes according
to Hadoop architecture and accessibility of
data blocks after finding reference to their
racks and cluster.
• Ecosystem refers to a system made up of
multiple computing components, which work
together.
• That is similar to a biological ecosystem, a
complex system of living organisms, their
physical environment and all their inter-
relationships in a particular unit of space.
HADOOP AND ITS ECOSYSTEM
• Apache initiated the project for developing
storage and processing framework for Big Data
storage and processing.
• Doug Cutting and Machael J. Cafarelle the
creators named that framework as Hadoop.
Cutting's son was fascinated by a stuffed toy
elephant, named Hadoop, and this is how the
name Hadoop was derived.
• The project consisted of two components, one of
them is for data store in blocks in the clusters and
the other is computations at each individual
cluster in parallel with another .
• Hadoop components are written in Java with part
of native code in C. The command line utilities
are written in shell scripts.
• Hadoop is a computing environment in which
input data stores, processes and stores the
results .
• The environment consists of clusters which
distribute at the cloud or set of servers.
• Each cluster consists of a string of data files
constituting data blocks.
• The toy named Hadoop consisted of a stuffed
elephant. The Hadoop system cluster stuffs files
in data blocks.
• The hardware scales up from a single server to
thousands of machines that store the clusters.
Each cluster stores a large number of data blocks
in racks.
• Default data block size is 64 MB. IBM Big lnsights,
built on Hadoop deploys default 128 MB block
size.
• Hadoop framework provides the computing
features of a system of distributed, flexible,
scalable, fault tolerant computing with high
computing power.
• Hadoop system is an efficient platform for the
distributed storage and processing of a large
amount of data.
• Hadoop enables Big Data storage and cluster
computing.
• The Hadoop system manages both, large-sized
structured and unstructured data in different
formats, such as XML, JSON and text with
efficiency and effectiveness.
• The Hadoop system performs better with
clusters of many servers when the focus is on
horizontal scalability.
• The system provides faster results from Big
Data and from unstructured data as well.
• Yahoo has more than 100000 CPUs in over
40000 servers running Hadoop, with its biggest
Hadoop cluster running 4500 nodes as of March
2017, according to the Apache Hadoop website.
• Facebook has 2 major clusters: a cluster has
1100-machines with 8800 cores and about 12
PB raw storage. A 300- machine cluster with
2400 cores and about 3 PB (1 PB = 1015 B,
nearly 250 B) raw-storage.
• Each (commodity) node has 8 cores and 12 TB
(1 TB= 1012 , nearly 240 B = 1024 GB) of storage
Hadoop Core Components
• Figure 2.1 shows the core components of the
Apache Software Foundation's Hadoop
framework
• The Hadoop core components of the framework
are:
1.Hadoop Common-The common module contains
the libraries and utilities that are required by the
other modules of Hadoop. For example, Hadoop
common provides various components and
interfaces for distributed file system and general
input/ output. This includes serialization, Java
RPC (Remote Procedure Call) and file-based data
structures.
2. Hadoop Distributed File System (HDFS) - A
Java-based distributed file system which can
store all kinds of data on the disks at the
clusters.
3. MapReduce v1 - Software programming model
in Hadoop 1 using Mapper and Reducer. The v1
processes large sets of data in parallel and in
batches.
4. YARN - Software for managing resources for
computing. The user application tasks or sub-
tasks run in parallel at the Hadoop, uses
scheduling and handles the requests for the
resources in distributed running of the tasks.
5. MapReduce v2 - Hadoop 2 YARN-based
system for parallel processing of large datasets
and distributed processing of the application
tasks.
Spark
• Spark is an open-source cluster-computing
framework of Apache Software Foundation.
Hadoop deploys data at the disks. Spark provisions
for in-memory analytics.
• Therefore, it also enables OLAP and real-time
processing. Spark does faster processing of Big
Data.
• Spark has been adopted by large organizations,
such as Amazon, eBay and Yahoo. Several
organizations run Spark on clusters with thousands
of nodes.
• Spark is now increasingly becoming popular .
Features of Hadoop
1. Fault-efficient scalable, flexible and modular
design which uses simple and modular
programming model.
• The system provides servers at high scalability.
The system is scalable by adding new nodes to
handle larger data.
• Hadoop proves very helpful in storing,
managing, processing and analyzing Big Data.
Modular functions make the system flexible.
One can add or replace components at ease.
Modularity allows replacing its components
for a different software tool.
2. Robust design of HDFS: Execution of Big Data
applications continue even when an individual
server or cluster fails. This is because of
Hadoop provisions for backup (due to
replications at least three times for each data
block) and a data recovery mechanism.
• HDFS thus has high reliability.
Text Book-2
HDFS Design Features
Important aspects of HDFS includes:
• Write-once/read-many design is intended to
facilitate streaming reads.
• Files may be appended, but random seeks are not
permitted. There is no caching of data.
• Data storage and processing happen on the same
server nodes.
• “Moving computation is cheaper than moving data”.
• A reliable file system maintains multiple copies of
data across the cluster Failure of a single node or
even a rack in a large cluster will not bring down the
file system.
• Specialized file system is used, which is not
designed for general use.
HDFS Components
• Two types of Nodes:
• a Name Node and multiple DataNodes.
• NameNode manages all the metadata ( Data about
Data / Description of all data) needed to store and
retrieve the actual data from the DataNode.
• No data is actually stored on the NameNode.
• The design is Master/slave architecture in which the
master (NameNode) manages the file system
namespace and regulates access to files by clients.
• Filesystem namespace operations including opening,
closing and renaming files and directories are
managed by the NameNode.
• NameNode also determines mapping of blocks to
DataNodes and handles DataNode failures.
• The slaves (DataNodes) are responsible for serving read
and write requests from the file system to the clients.
Secondary NameNode
• It performs periodic checkpoints that evaluate
the status of NameNode.
• Download fsimage and edits files periodically and
then joins them into a new fsimage inorder to
upload it into NameNode.
65
Summary of various roles in HDFS
• It uses master/slave model designed for large file
reading/streaming
Text Book-2
Hadoop & Hadoop Ecosystem
• Hadoop is an open-source software framework for storing data
and running applications on clusters of commodity hardware. It
provides massive storage for any kind of data, enormous
processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
Channel
• A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
• These channels are fully transactional and they can work with any number of
sources
and sinks.
• Example − JDBC channel, File system channel, Memory channel, etc.
• Sink
• A sink stores the data into centralized stores
like HBase and HDFS. It consumes the data
(events) from the channels and delivers it to
the destination. The destination of the sink
might be another agent or the central stores.
• Example − HDFS sink
Setting multi-agent flow
• In order to flow the data across multiple
agents or hops, the sink of the previous agent
and source of the current hop need to be avro
type with the sink pointing to the hostname
(or IP address) and port of the source.
• Within Flume, there can be multiple agents
and before reaching the final destination, an
event may travel through more than one
agent. This is known as multi-hop flow.
Setting multi-agent flow
Flume Consolidation
• A very common scenario in log collection is a large
number of log producing clients sending data to a few
consumer agents that are attached to the storage
subsystem. For example, logs collected from hundreds
of web servers sent to a dozen of agents that write to
HDFS cluster.
• This can be achieved in Flume by configuring a number
of first tier agents with an avro sink, all pointing to an
avro source of single agent (Again you could use the
thrift sources/sinks/clients in such a scenario). This
source on the second tier agent consolidates the
received events into a single channel which is consumed
by a sink to its final destination.
Consolidation
Apache Hive
• The Apache Hive data warehouse software facilitates
reading, writing, and managing large datasets
residing in distributed storage using SQL. Structure
can be projected onto data already in storage. A
command line tool and JDBC driver are provided to
connect users to Hive.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques
to store data into file system.
Working of Hive
Working of Hive
Step Operation
No.
1 Execute Query The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any
database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
5 Send Plan The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Working of Hive
7 Execute Job Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data
nodes.
9 Send Results The execution engine sends those resultant values
to the driver.
10 Send Results The driver sends the results to Hive Interfaces.
Hive - Data Types
All the data types in Hive are classified into four
types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
Apache Oozie
• Apache Oozie is a workflow scheduler for
Hadoop.
• It is a system which runs the workflow of
dependent jobs.
• Here, users are permitted to create Directed
Acyclic Graphs of workflows, which can be run
in parallel and sequentially in Hadoop.
Apache Oozie
It consists of Three parts:
• Workflow engine: Responsibility of a workflow
engine is to store and run workflows composed
of Hadoop jobs e.g., MapReduce, Pig, Hive.
• Coordinator engine: It runs workflow jobs based
on predefined schedules and availability of data.
• Bundle: Higher level abstraction that will batch a
set of coordinator jobs
Apache Oozie
• Oozie is scalable and can manage the timely
execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
• Oozie is very much flexible, as well. One can
easily start, stop, suspend and rerun jobs. Oozie
makes it very easy to rerun failed workflows. One
can easily understand how difficult it can be to
catch up missed or failed jobs due to downtime
or failure. It is even possible to skip a specific
failed node.
How does OOZIE work?
• Oozie runs as a service in the cluster and clients submit workflow
definitions for immediate or later processing.
• Oozie workflow consists of action nodes and control-flow nodes.
• A control-flow node controls the workflow execution between actions by
allowing constructs like conditional logic wherein different branches may
be followed depending on the result of earlier action node.
• Start Node, End Node, and Error Node fall under this category of nodes.
• Start Node, designates the start of the workflow job.
• End Node, signals end of the job.
• Error Node designates the occurrence of an error and corresponding error
message to be printed.
• An action node represents a workflow task, e.g., moving files into HDFS,
running a MapReduce, Pig or Hive jobs, importing data using Sqoop or
running a shell script of a progrwawmw.vtwupurlsiet.tcoemnin Java.
Example Workflow Diagram
HBase
● HBase is a data model that is similar to
Google’s big table designed to provide quick
random access to huge amounts of structured
data.
● Since 1970, RDBMS is the solution for data
storage and maintenance related problems.
After the advent of big data, companies
realized the benefit of processing big data and
started opting for solutions like Hadoop.
Limitations of Hadoop
● Hadoop can perform only batch processing,
and data will be accessed only in a sequential
manner. That means one has to search the
entire dataset even for the simplest of jobs.
● A huge dataset when processed results in
another huge data set, which should also be
processed sequentially. At this point, a new
solution is needed to access any point of data
in a single unit of time (random access).
What is HBase?
● HBase is a distributed column-oriented
database built on top of the Hadoop file
system. It is an open-source project and is
horizontally scalable.