Bda 18CS72 Mod-2

BIG DATA AND ANALYTICS
MODULE-2
INTRODUCTION TO HADOOP
Text book-1
• Hadoop is an open-source software framework
for storing data and running applications on
clusters of commodity hardware.
• Apache Hadoop is a collection of open-source
software utilities that facilitates using a network
of many computers to solve problems involving
massive amounts of data and computation.
• It provides massive storage for any kind of data,
enormous processing power and the ability to
handle virtually limitless concurrent tasks or jobs.
• Hadoop Distributed File System means a
system of storing files (set of data records, key-
value pairs, hash key-value pairs or applications
data) at distributed computing nodes according
to Hadoop architecture and accessibility of
data blocks after finding reference to their
racks and cluster.
• Ecosystem refers to a system made up of
multiple computing components, which work
together.
• That is similar to a biological ecosystem, a
complex system of living organisms, their
physical environment and all their inter-
relationships in a particular unit of space.
HADOOP AND ITS ECOSYSTEM
• Apache initiated the project for developing
storage and processing framework for Big Data
storage and processing.
• Doug Cutting and Machael J. Cafarelle the
creators named that framework as Hadoop.
Cutting's son was fascinated by a stuffed toy
elephant, named Hadoop, and this is how the
name Hadoop was derived.
• The project consisted of two components, one of
them is for data store in blocks in the clusters and
the other is computations at each individual
cluster in parallel with another .
• Hadoop components are written in Java with part
of native code in C. The command line utilities
are written in shell scripts.
• Hadoop is a computing environment in which
input data stores, processes and stores the
results .
• The environment consists of clusters which
distribute at the cloud or set of servers.
• Each cluster consists of a string of data files
constituting data blocks.
• The toy named Hadoop consisted of a stuffed
elephant. The Hadoop system cluster stuffs files
in data blocks.
• The hardware scales up from a single server to
thousands of machines that store the clusters.
Each cluster stores a large number of data blocks
in racks.
• Default data block size is 64 MB. IBM Big lnsights,
built on Hadoop deploys default 128 MB block
size.
• Hadoop framework provides the computing
features of a system of distributed, flexible,
scalable, fault tolerant computing with high
computing power.
• Hadoop system is an efficient platform for the
distributed storage and processing of a large
amount of data.
• Hadoop enables Big Data storage and cluster
computing.
• The Hadoop system manages both, large-sized
structured and unstructured data in different
formats, such as XML, JSON and text with
efficiency and effectiveness.
• The Hadoop system performs better with
clusters of many servers when the focus is on
horizontal scalability.
• The system provides faster results from Big
Data and from unstructured data as well.
• Yahoo has more than 100000 CPUs in over
40000 servers running Hadoop, with its biggest
Hadoop cluster running 4500 nodes as of March
2017, according to the Apache Hadoop website.
• Facebook has 2 major clusters: a cluster has
1100-machines with 8800 cores and about 12
PB raw storage. A 300- machine cluster with
2400 cores and about 3 PB (1 PB = 1015 B,
nearly 250 B) raw-storage.
• Each (commodity) node has 8 cores and 12 TB
(1 TB= 1012 , nearly 240 B = 1024 GB) of storage
Hadoop Core Components
• Figure 2.1 shows the core components of the
Apache Software Foundation's Hadoop
framework
• The Hadoop core components of the framework
are:
1.Hadoop Common-The common module contains
the libraries and utilities that are required by the
other modules of Hadoop. For example, Hadoop
common provides various components and
interfaces for distributed file system and general
input/ output. This includes serialization, Java
RPC (Remote Procedure Call) and file-based data
structures.
2. Hadoop Distributed File System (HDFS) - A
Java-based distributed file system which can
store all kinds of data on the disks at the
clusters.
3. MapReduce v1 - Software programming model
in Hadoop 1 using Mapper and Reducer. The v1
processes large sets of data in parallel and in
batches.
4. YARN - Software for managing resources for
computing. The user application tasks or sub-
tasks run in parallel at the Hadoop, uses
scheduling and handles the requests for the
resources in distributed running of the tasks.
5. MapReduce v2 - Hadoop 2 YARN-based
system for parallel processing of large datasets
and distributed processing of the application
tasks.
Spark
• Spark is an open-source cluster-computing
framework of Apache Software Foundation.
Hadoop deploys data at the disks. Spark provisions
for in-memory analytics.
• Therefore, it also enables OLAP and real-time
processing. Spark does faster processing of Big
Data.
• Spark has been adopted by large organizations,
such as Amazon, eBay and Yahoo. Several
organizations run Spark on clusters with thousands
of nodes.
• Spark is now increasingly becoming popular .
Features of Hadoop
1. Fault-efficient scalable, flexible and modular
design which uses simple and modular
programming model.
• The system provides servers at high scalability.
The system is scalable by adding new nodes to
handle larger data.
• Hadoop proves very helpful in storing,
managing, processing and analyzing Big Data.
Modular functions make the system flexible.
One can add or replace components at ease.
Modularity allows replacing its components
for a different software tool.
2. Robust design of HDFS: Execution of Big Data
applications continue even when an individual
server or cluster fails. This is because of
Hadoop provisions for backup (due to
replications at least three times for each data
block) and a data recovery mechanism.
• HDFS thus has high reliability.
3. Store and process Big Data: Processes Big Data

of 3V characteristics.
4. Distributed clusters computing model with
data locality: Processes Big Data at high
speed as the application tasks and sub-tasks
submit to the Data Nodes.
• One can achieve more computing power by
increasing the number of computing nodes.
The processing splits across multiple Data
Nodes (servers), and thus fast processing and
aggregated results.
5. Hardware fault-tolerant: A fault does not affect
data and application processing. If a node goes
down, the other nodes take care of the residue.
This is due to multiple copies of all data blocks
which replicate automatically. Default is three
copies of data blocks.
6. Open-source framework: Open source access and
cloud services enable large data store. Hadoop
uses a cluster of multiple inexpensive servers or
the cloud.
7. Java and Linux based: Hadoop uses Java
interfaces. Hadoop base is Linux but has its own
set of shell commands support.
Hadoop Ecosystem Components
• Hadoop ecosystem refers to a combination of
technologies.
• Hadoop ecosystem consists of own family of
applications which tie up together with the
Hadoop.
• The system components support the storage,
processing, access, analysis, governance,
security and operations for Big Data.
• The system enables the applications which run
Big Data and deploy HDFS.
• The data store system consists of clusters,
racks, DataNodes and blocks.
• Hadoop deploys application programming
model, such as MapReduce and HBase.
• YARN manages resources and schedules sub-
tasks of the application.
• HBase uses columnar databases and does
OLAP.
• Figure 2.2 shows Hadoop core components
HDFS, MapReduce and YARN along with the
ecosystem.
• Figure 2.2 also shows Hadoop ecosystem. The
system includes the application support layer
and application layer components- AVRO,
ZooKeeper, Pig, Hive, Sqoop, Ambari, Chukwa,
Mahout, Spark, Flink and Flume.
• The figure also shows the components and
their usages
• The four layers in Figure 2.2 are as follows:
(i) Distributed storage layer
(ii)Resource-manager layer for job or application
sub-tasks scheduling and execution
(iii)Processing-framework layer, consisting of
Mapper and Reducer for the MapReduce process-
flow
(iv) APIs at application support layer (applications
such as Hive and Pig). The codes communicate
and run using MapReduce or YARN at processing
framework layer. Reducer output communicate to
APIs
• AVRO enables data serialization between the
layers. Zookeeper enables coordination among
layer components
• The holistic view of Hadoop architecture
provides an idea of implementation of
Hadoop components of the ecosystem. Client
hosts run applications using Hadoop
ecosystem projects, such as Pig, Hive and
Mahout.
• Most commonly, Hadoop uses Java
programming. Such Hadoop programs run on
any platform with the Java virtual-machine
deployment model.
• HDFS is a Java-based distributed file system
that can store various kinds of data on the
computers .
HADOOP DISTRIBUTED FILE SYSTEM
• Big Data analytics applications are software
applications that leverage large-scale data.
• The applications analyze Big Data using
massive parallel processing frameworks.
• HDFS is a core component of Hadoop.
• HDFS is designed to run on a cluster of
computers and servers at cloud-based utility
services
• HDFS stores Big Data which may range from
GBs (1 GB= 230 B) to PBs (1 PB= 1015 B, nearly
the 250 B).
• HDFS stores the data in a distributed manner
in order to compute fast.
• The distributed data store in HDFS stores data
in any format regardless of schema.
• HDFS provides high throughput access to data-
centric applications that require large-scale
data processing workloads.
HDFS Data Storage
• Hadoop data store concept implies storing the
data at a number of clusters. Each cluster has
a number of data stores, called racks.
• Each rack stores a number of DataNodes. Each
DataNode has a large number of data blocks.
The racks distribute across a cluster.
• The nodes have processing and storage
capabilities. The nodes have the data in data
blocks to run the application tasks.
• The data blocks replicate by default at least on
three DataNodes in same or remote nodes.
• Data at the stores enable running the
distributed applications including analytics,
data mining, OLAP using the clusters.
• A file, containing the data divides into data
blocks. A data block default size is 64 MBs
(HDFS division of files concept is similar to
Linux or virtual memory page in Intel x86 and
Pentium processors where the block size is
fixed and is of 4 KB).
• Hadoop HDFS features are as follows:
(i) Create, append, delete, rename and attribute
modification functions
(ii) Content of individual file cannot be modified
or replaced but appended with new data at the
end of the file
(iii) Write once but read many times during
usages and processing
(iv) Average file size can be more than 500 MB.
The following is an example how the files store
at a Hadoop cluster.
Hadoop Physical Organization
• The conventional file system uses directories.
A directory consists of folders. A folder
consists of files.
• When data processes, the data sources
identify by pointers for the resources. A data-
dictionary stores the resource pointers.
Master tables at the dictionary store at a
central location.
• The centrally stored tables enable
administration easier when the data sources
change during processing.
• Similarly, the files, DataNodes and blocks need
the identification during processing at HDFS.
• HDFS use the NameNodes and DataNodes.
• A NameNode stores the file's meta data. Meta
data gives information about the file of user
application, but does not participate in the
computations.
• The DataNode stores the actual data files in
the data blocks.
• Few nodes in a Hadoop cluster act as
NameNodes. These nodes are termed as
MasterNodes or simply masters.
• The masters have a different configuration
supporting high DRAM and processing power. The
masters have much less local storage.
• Majority of the nodes in Hadoop cluster act as
DataNodes and Task Trackers. These nodes are
referred to as slave nodes or slaves.
• The slaves have lots of disk storage and moderate
amounts of processing capabilities and DRAM.
• Slaves are responsible to store the data and
process the computation tasks submitted by the
clients
• Figure 2.4 shows the client, master
NameNode, primary and secondary
MasterNodes and slave nodes in the Hadoop
physical architecture.
• Clients as the users run the application with the
help of Hadoop ecosystem projects. For example,
Hive, Mahout and Pig are the ecosystem's
projects. They are not required to be present at
the Hadoop cluster.
• A single MasterNode provides HDFS, MapReduce
and Hbase using threads in small to medium sized
clusters. When the cluster size is large, multiple
servers are used, such as to balance the load.
• The secondary NameNode provides NameNode
management services and Zookeeper is used by
Hbase for metadata storage.
• The MasterNode fundamentally plays the role
of a coordinator. The MasterNode receives
client connections, maintains the description
of the global file system namespace, and the
allocation of file blocks.
• It also monitors the state of the system in
order to detect any failure.
• The Masters consists of three components
NameNode, Secondary NameNode and
JobTracker.
• The NameNode stores all the file system
related information such as:
 The file section is stored in which part of the
cluster
Last access time for the files
User permissions like which user has access to
the file.
• Secondary NameNode is an alternate for
NameNode. Secondary node keeps a copy of
NameNode meta data.
• Thus, stored meta data can be rebuilt easily, in
case of NameNode failure. The JobTracker
coordinates the parallel processing of data.
• Masters and slaves, and Hadoop client (node)
load the data into cluster, submit the
processing job and then retrieve the data to
see the response after the job completion.
HDFS Commands
• Figure 2.1 showed Hadoop common module,
which contains the libraries and utilities. They
are common to other modules of Hadoop.
• The HDFS shell is not compliant with the
POSIX. Thus, the shell cannot interact similar
to Unix or Linux. Commands for interacting
with the files in HDFS require /bin/hdfs dfs ,
where args stands for the command
arguments.
• Full set of the Hadoop shell commands can be
found at Apache Software Foundation
website. - copyToLocal is the command for
copying a file at HDFS to the local.
• -cat is command for copying to standard
output (stdout).
• All Hadoop commands are invoked by the
bin/Hadoop script. % Hadoop fsck / -files -
blocks
• Table 2.1 gives the examples of command
usages.
MAPREDUCE FRAMEWORK AND
PROGRAMMING MODEL
• MapReduce is a programming model for
distributed computing
• Mapper means software for doing the
assigned task after organizing the data blocks
imported using the keys.
• A key specifies in a command line of Mapper.
The command maps the key to the data,
which an application uses.
• Reducer means software for reducing the
mapped data by using the aggregation, query or
user-specified function. The reducer provides a
concise cohesive response for the application.
• Aggregation function means the function that
groups the values of multiple rows together to
result a single value of more significant meaning
or measurement. For example, function such as
count, sum, maximum, minimum, deviation and
standard deviation.
• Querying function means a function that finds
the desired values. For example, function for
finding a best student of a class who has shown
the best performance in examination
• MapReduce allows writing applications to
process reliably the huge amounts of data, in
parallel, on large clusters of servers.
• The cluster size does not limit as such to
process in parallel.
• The parallel programs of MapReduce are
useful for performing large scale data analysis
using multiple machines in the cluster.
Features of MapReduce framework are as follows:
1. Provides automatic parallelization and distribution of
computation based on several processors
2. Processes data stored on distributed clusters of
DataNodes and racks
3. Allows processing large amount of data in parallel
4. Provides scalability for usages of large number of
servers
5. Provides MapReduce batch-oriented programming
model in Hadoop version 1
6. Provides additional processing modes in Hadoop 2
YARN-basedsystem and enables required parallel
processing. For example, for queries, graph databases,
streaming data, messages, real-time OLAP and ad hoc
analytics with Big Data 3V characteristics.
HADOOP YARN
• YARN is a resource management platform. It
manages computer resources. The platform is
responsible for providing the computational
resources, such as CPUs, memory, network I/O
which are needed when an application
executes.
• An application task has a number of sub-tasks.
YARN manages the schedules for running of the
sub-tasks. Each sub-task uses the resources in
allotted time intervals.
• YARN separates the resource management
and processing components.
• YARN stands for Yet Another Resource
Negotiator. An application consists of a
number of tasks. Each task can consist of a
number of sub-tasks (threads), which run in
parallel at the nodes in the cluster.
• YARN enables running of multi-threaded
applications.
• YARN manages and allocates the resources for
the application sub-tasks and submits the
resources for them at the Hadoop system.
Hadoop 2 Execution Model
• Figure 2.5 shows the YARN-based execution
model. The figure shows the YARN
components-Client, Resource Manager (RM),
Node Manager (NM), Application Master (AM)
and Containers.
• Figure 2.5 also illustrates YARN components

namely, Client, Resource Manager (RM), Node
Manager (RM), Application Master (AM) and
Containers.
• List of actions of YARN resource allocation and
scheduling functions is as follows:
• A MasterNode has two components: (i) Job
History Server and (ii) Resource Manager(RM).
• A Client Node submits the request of an
application to the RM.
• The RM is the master. One RM exists per cluster.
The RM keeps information of all the slave NMs.
• Information is about the location (Rack
Awareness) and the number of resources (data
blocks and servers) they have. The RM also
renders the Resource Scheduler service that
decides how to assign the resources. It, therefore,
performs resource management as well as
scheduling.
• Multiple NMs are at a cluster. An NM creates an
AM instance (AMI) and starts up. The AMI
initializes itself and registers with the RM.
Multiple AMIs can be created in an AM.
• The AMI performs role of an Application
Manager (ApplM),that estimates the resources
requirement for running an application
program or sub-task.
• The ApplMs send their requests for the
necessary resources to the RM. Each NM
includes several containers for uses by the
subtasks of the application.
• NM is a slave of the infrastructure. It signals
whenever it initializes. All active NMs send the
controlling signal periodically to the RM signaling
their presence.
• Each NM assigns a container(s) for each AMI. The
container(s) assigned at an instance may be at same
NM or another NM. ApplM uses just a fraction of
the resources available. The ApplM at an instance
uses the assigned container(s) for running the
application sub-task.
• RM allots the resources to AM, and thus to ApplMs
for using assigned containers on the same or other
NM for running the application subtasks in parallel.
Hadoop Distributed file
system (HDFS) basics
Text Book-2
HDFS Design Features
Important aspects of HDFS includes:
• Write-once/read-many design is intended to
facilitate streaming reads.
• Files may be appended, but random seeks are not
permitted. There is no caching of data.
• Data storage and processing happen on the same
server nodes.
• “Moving computation is cheaper than moving data”.
• A reliable file system maintains multiple copies of
data across the cluster  Failure of a single node or
even a rack in a large cluster will not bring down the
file system.
• Specialized file system is used, which is not
designed for general use.
HDFS Components
• Two types of Nodes:
• a Name Node and multiple DataNodes.
• NameNode manages all the metadata ( Data about
Data / Description of all data) needed to store and
retrieve the actual data from the DataNode.
• No data is actually stored on the NameNode.
• The design is Master/slave architecture in which the
master (NameNode) manages the file system
namespace and regulates access to files by clients.
• Filesystem namespace operations including opening,
closing and renaming files and directories are
managed by the NameNode.
• NameNode also determines mapping of blocks to
DataNodes and handles DataNode failures.
• The slaves (DataNodes) are responsible for serving read
and write requests from the file system to the clients.
• The NameNode also manages block creation, deletion

and replication.
• During write operation

– The client first communicates with the NameNode and puts a
request to create a file.
– The NameNode determines no.of blocks needed and provides

the client with DataNode that will store the data.
– As part of storage process, the data blocks are replicated once

it is written to the assigned node.
– Depending on the no. of nodes in the cluster,
NameNode will attempt to write replicas of the data
blocks on nodes that are in other separate racks (if
possible).
– Else if there is only one rack , then replicated blocks
are written to other servers in the same rack.
– Operation gets completed once DataNode
acknowledges that the file block replication is
complete.
• NameNode has 2 diskfiles that track changes to
the metadata.
– fsimage_* : An image of the filesystem state
– edit_* : Series of modification done to the filesystem.
• NameNode does not write any data directly to the
DataNodes.
• It gives the client limited amount of time to
complete the operation.
• If it does not complete in the time period, the
operation gets canceled.
• Reading operation also takes place in similar fashion.
– The client requests a file from the NameNode, which
returns the best DataNodes from which to read the data.
– The client then accesses the data directly from the
DataNodes.
• The major work of NameNode is to deliver the
metadata.
• NameNode monitors the DataNodes by listening for
heartbeats sent from DataNodes during data
transfer process.
• Lack of heartbeat indicates node failure.
• The block reports are sent every 10 heartbeats
• The report enables namenodes to keep an up to
date account of all datablocks in the cluster.
• There is secondary name node in all hadoop
deployments.
Secondary NameNode
• It performs periodic checkpoints that evaluate
the status of NameNode.
• Download fsimage and edits files periodically and
then joins them into a new fsimage inorder to
upload it into NameNode.
65
Summary of various roles in HDFS
• It uses master/slave model designed for large file
reading/streaming
• Namenode is a metadata server or data traffic cop
• It provides a single namespace managed by the

namenode
• No data on the namenode
• Secondary namenode performs checkpoints of

namenode file system’s state
HDFS Block Replication
• The amount of replication is based on the value
of dfs.replication in the hdfs-site.xml file.
• Hadoop clusters containing more than 8
datanodes , replication value is set to 3
• Hadoop cluster of 8 or fewer datanodes , but
more than 1 replication value is set to 2
• For a single machine, replication factor is 1
HDFS Safe Mode
• NameNode enters a read-only safe mode when
it starts, where blocks cannot be replicated or
deleted.
• Safe mode enables two important processes to
be performed by NameNode
– Previous file system is reconstructed by loading
fsimage file into memory and replaying the edit log.
– Mapping between blocks and data nodes is created
by waiting DataNodes to register so that at least
one copy of data is available.
• HDFS can also enter this Safe Mode for
maintenance using the command
hdfs dfsadmin-safemode
• It is addressed by the administrator when
there is a file system issue
Rack Awareness
• It deals with data locality
• Typical Hadoop cluster will exhibit 3 levels of data
locality:
– Data resides on the local machine(best).
 Best performance but suffers from single point of failure
– Data resides in the same rack(better).

 Better performance but suffers from single point of
failure
– Data resides in a different rack(good).

 Good performance
• When YARN (Yet Another Resource
Negotiator,) scheduler is assigning MapReduce
containers to work as mappers, it tries to
place container first on the local machine,
then on the same rack and finally on another
rack.
• A default Hadoop installation assumes all the
nodes belong to the same(large) rack
Name Node High Availability(HA)
• In order to provide true failover service
for NameNode , NameNode HA was
implemented as a solution.
• HA Hadoop cluster has 2(or more) separate
NameNode machines
• Each machine configured with exactly same
software
• One in active state and other one in standby state
• Active NameNode reponsible for all client HDFS
operations in the cluster
• Standby NameNode maintains enough state to
provide a fast failover( if required)
• Both the NameNodes recieves block reports from
the DataNodes inorder to gurantee the file system
state is preserved
• Active node also sends file system edits to a
quorum of Journal nodes
• The standby node continuously reads the edits
from the journal nodes inorder to ensure its
namespace is synchronized with that of the
active node
• When active namenode fails, standby node
reads all remaining edits from journal nodes
before promoting itself to the active state
• Secondary Namenode is not required in the HA
configuration
• Apache ZooKeeper monitors the NameNode health.
• Where ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed
synchronization, and providing group services
• ZooKeeper is a highly available service for ..

– Maintaining small amounts of coordination data
– Notifying clients of changes in that data
– Monitoring clients for failures
• HDFS failover relies on ZooKeeper for failure detection

and for standby to active NameNode election.
HDFS NameNode Federation
• Older versions of HDFS provided a single
Namespace for the entire cluster managed by the
single NameNode.
• So the resources of a single NameNode determines
the size of the whole namespace.
• So Federation addresses this limitations by adding
support for multiple NameNode /Namespace to
the HDFS file system.
• Key benefits are..
– Namespace scalability
– Better Performance
– System isolation
78
HDFS Checkpoints and Backups
• HDFS Backupnode maintains an up-to-date
copy of the file system namespace both in
memory and on disk
• Since it has an up-to-date namespace state in
memory, it need not download the fsimage
and edits files from active NameNode
• A NameNode supports one backup node at a
time
• No checkpoint nodes may be registered if a
backup node is in use
Hdfs snapshots
• These are created by administrators using the
hdfs dfs –snapshot command.
• They are read-only point-in-time copies of the file
system.
• They offer the following features:
– Snapshots can be taken of a sub-tree of the file system
or entire file system
– It can be used for data backup, protection against user
errors, and disaster recovery
– Snapshot creation is instantaneous
– Snapshot files record the block list and the file size
– Snapshot do not adversely affect regular HDFS
operations
Hdfs nfs gateway
• It supports NFSv3 and enables HDFS to be
mounted as part of the clients local file system
• The feature offers users the following
capabilities:
– Users can easily download/upload files from/to the
HDFS file system to/from their local file system
– Users can stream data directly to HDFS through the
mount point
Hdfs user commands
• Examples given here are simple use-cases
• There are alternative options for each command
Brief HDFS command reference
• hdfs command will interact with HDFS in hadoop
version 2
Usage: hdfs [--config confdir] COMMAND

https://www.edureka.co/blog/hdfs-commands-
hadoop-shell-command
• Where COMMAND is one of:
dfs, namenode, journalnode,datanode, version,
crypto etc….
General hdfs commands
• The version of HDFS can be found from
the version option.
$ hdfs version
• Command to check the health of a Hadoop File system
• $ hdfs fsck /
• Status: HEALTHY
• Total size: 0 B
• Total dirs: 1
• Total files: 0
• Total symlinks: 0
• Total blocks (validated): 0
• Minimally replicated blocks: 0
• Over-replicated blocks: 0
• Under-replicated blocks: 0
• Mis-replicated blocks: 0
• Default replication factor: 1
• Average block replication: 0.0
• Corrupt blocks: 0
• Missing replicas: 0
• Number of data-nodes: 1
• Number of racks: 1
• FSCK ended at Tue Jan 08 23:38:05 PST 2019 in 7 milliseconds
• The filesystem under path '/' is HEALTHY
List files in hdfs
• To list the files in the root HDFS directory, enter
the following command:
$ hdfs dfs -ls /
• To list the files in your HOME directory, enter the
following command:
$ hdfs dfs -ls
• The same result can be obtained by issuing the

following command:
$ hdfs dfs -ls /user/hdfs

• Make a directory in HDFS
$ hdfs dfs -mkdir stuff
• Copy files to HDFS

To copy a file from your current local directory into
HDFS:
$ hdfs dfs –put test stuff
(file test placed in the directory stuff which was
created previously)
The file transfer can be confirmed by using –ls
$ hdfs dfs -ls stuff

• Copy files from HDFS
The file we copied into HDFS, test, will be copied back to
the current local directory with the name
test-local
$ hdfs dfs -get stuff/test test-local
• Copy files within HDFS

$ hdfs dfs -cp stuff/test test.hdfs
• Delete a File within HDFS

$ hdfs dfs -rm test.hdfs
• Even if you want to delete from user’s Trash directory
$ hdfs dfs -rm -skipTrash stuff/test
Copy files to HDFS
To copy a file from current local directory to HDFS, the

following command is used.If the full path is not specified,
home directory is used.
$ hdfs dfs -put comm /countwords
where comm is a file that is already created and copied to
the directory countwords
it can be confirmed using the ls command
$hdfs dfs -ls countwords
Found 1 items
-rw-r--r-- 1 bharani supergroup
26 2019-01-09 03:55 countwords/comm
Copy files from HDFS
Files can be copied back to your local file

system.
$ hdfs dfs -get countwords/comm sample-
local
where comm is a file in the countwords
directory in HDFS and sample-local is the
name of the file that is being created locally
on the local file system.
Copy files within HDFS
• $ hdfs dfs -cp countwords/comm

countwords/sample
• where comm is copied to sample in the new

directory within HDFS
• Delete a directory in HDFS
$ hdfs dfs -rm -r -skipTrash stuff
• Get an HDFS Status Report
$ hdfs dfsadmin -report
..\..\Hadoop-dfsadmin-report1.png
• HDFS Web GUI
– HDFS provides an informational web interface
– HDFS must be started and running on the cluster
before the GUI can be used
• Using HDFS in programs
– HDFS Java Application Example
– HDFS C Application Example
Get an HDFS status report
• $ hdfs dfsadmin -report

• Configured Capacity: 50596724736 (47.12 GB)
• Present Capacity: 37902954496 (35.30 GB)
• DFS Remaining: 37902901248 (35.30 GB)
• DFS Used: 53248 (52 KB)
• DFS Used%: 10.83%
• Under replicated blocks: 5
• Blocks with corrupt replicas: 0
• Missing blocks: 0
• Missing blocks (with replication factor 1): 0
• Pending deletion blocks: 0
Module 2
Essential Hadoop Tools
Text Book-2
Hadoop & Hadoop Ecosystem
• Hadoop is an open-source software framework for storing data
and running applications on clusters of commodity hardware. It
provides massive storage for any kind of data, enormous
processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
• Hadoop Ecosystem comprises of various tools that are required to

perform different tasks in Hadoop. These tools provide you a
number of Hadoop services which can help you handle big data
more efficiently.
Hadoop ecosystem
Hadoop ecosystem Components
 HDFS(Hadoop distributed file system)
 YARN
 MapReduce
 APACHE SPARK
 Hive
 H Base
 H Catalogue
 Apache Pig
 Apache Sqoop
 Oozie
 Avro
 Apache Drill
 Apache Zookeeper
 Apache Flume
 Apache Ambari
Hadoop ecosystem
 Sqoop: It is used to import and export data to and from
between HDFS and RDBMS.
 Pig: It is a procedural language platform used to develop a
script for MapReduce operations.
 Hbase: HBase is a distributed column-oriented database
built on top of the Hadoop file system.
 Hive: It is a platform used to develop SQL type scripts to do
MapReduce operations.
 Flume: Used to handle streaming data on the top of
Hadoop.
 Oozie: Apache Oozie is a workflow scheduler for
Hadoop.
Essential Hadoop Tools
Introduction to Pig
● Pig raises the level of abstraction for
processing large amount of datasets.
● It is a fundamental platform for analyzing large
amount of data sets which consists of a high
level language for expressing data analysis
programs.
● It is an open source platform developed by
yahoo.
What is PIG?
• Pig is a high-level programming language useful for
analyzing large data sets. A pig was a result of
development effort at Yahoo!
• In a MapReduce framework, programs need to be
translated into a series of Map and Reduce stages.
• However, this is not a programming model which
data analysts are familiar with. So, in order to bridge
this gap, an abstraction called Pig was built on top of
Hadoop.
• Apache Pig enables people to focus more
on analyzing bulk data sets and to spend less time
writing Map-Reduce programs. Similar to Pigs, who
eat anything, the Pig programming language is
designed to work upon any kind of data. That's why
the name, Pig!
Advantages of Pig
● Reusing the code
● Faster development
● Less number of lines of code
● Schema and type checking etc
Apache Sqoop
• Sqoop is a tool designed to transfer data between
Hadoop and relational databases. you can use
Sqoop to import data from a relational database
management system (RDBMS) into the Hadoop
Distributed File System (HDFS), transform the
data in Hadoop, and then export the data back
into an RDBMS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”

Apache Sqoop
• The traditional application management system, that
is, the interaction of applications with relational
database using RDBMS, is one of the sources that
generate Big Data. Such Big Data, generated by
RDBMS, is stored in Relational Database Servers in
the relational database structure.
• When Big Data storages and analyzers such as

MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required
a tool to interact with the relational database servers
for importing and exporting the Big Data residing in
them.
Apache Sqoop
• Sqoop can be used with any Java Database
Connectivity ODBC)—compliant database and has
been tested on Microsoft SQL Server, PostgresSQL,
MySQL, and Oracle.
• In version I of Sqoop, data were accessed using
connectors written for specific databases.
• Version 2 (in beta) does not support connectors
or version 1 data transfer from a RDBMS directly
to Hive or HBase, or data transfer from Hive or
HBase to your RDBMS.
How Sqoop Works?
How Sqoop Works?
• Sqoop Import
– The import tool imports individual tables from RDBMS to
HDFS. Each row in a table is treated as a record in HDFS.
– All records are stored as text data in text files or as binary
data in Avro and Sequence files.
• Sqoop Export
– The export tool exports a set of files from HDFS back to an
RDBMS.
– The files given as input to Sqoop contain records, which
are called as rows in table.
– Those are read and parsed into a set of records and
delimited with user- specified delimiter.
Apache Sqoop Import and Export
Methods
• Figure describes the Sqoop data import (to HDFS)
process. The data import is done in two steps.
• In the first step, shown in the figure, Sqoop examines

the database to gather the necessary metadata for the
data to be imported.
• The second step is a map-only (no reduce step)
Hadoop job that Sqoop submits to the cluster.
• Note that each node doing the import must have
access to the database.
• The imported data are saved in an HDFS
directory. Sqoop will use the database name
for the directory, or the user can specify any
alternative directory where the files should
be populated. By default, these files contain
comma-delimited fields, with new lines
separating different records.
• You can easily override the format in which

data are copied over by explicitly specifying
the field separator and record terminator
characters. Once placed in HDFS, the data are
ready for processing.
Fig. Two-step Apache sqoop data import method
• Data export from the cluster works in a similar
fashion. The export is done in two steps, as
shown In Figure.
• As in the import process, the first step is to

examine the database for metadata. The export
step again uses a map-only Hadoop job to write
the data to the database.
• Sqoop divides the input data set into splits, then

uses individual map tasks to push the splits to the
database. Again, this process assumes the map
tasks have access to the database.
Fig. Two-step Apache sqoop data export method
Apache Sqoop Version Changes
Feature Sqoop Version 1 Sqoop Version 1
Connectors for all Supported. Not supported. Use

major RDBMs the generic JDBC connector.
Kerberos security Supported. Not supported.
integration
Data Supported. Not supported. First
transfer from import data from RDBMS into
RDBMS to Hive or HDFS, then load data into Hive
HBase or HBase manually.
Data Not supported. First Not supported. First

transfer from export data from Hive or export data from Hive or
Hive or Hbase into HDFS, and then HBase into HDFS, then use
HBase to RDBMS use Sqoop for export. Sqoop for export
What is Flume?
• Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and
transporting large amounts of streaming data
such as log files, events (etc...) from various
sources to a centralized data store.
• Flume is a highly reliable, distributed, and
configurable tool. It is principally designed to
copy streaming data (log data) from various
web servers to HDFS.
Applications of Flume
• Assume an e-commerce web application
wants to analyze the customer behavior from
a particular region. To do so, they would need
to move the available log data in to Hadoop
for analysis. Here, Apache Flume comes to our
rescue.
• Flume is used to move the log data generated
by application servers into HDFS at a higher
speed.
Advantages of Flume
• Using Apache Flume we can store the data in to any of the
centralized stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which

data can be written to the destination, Flume acts as a
mediator between data producers and the centralized stores
and provides a steady flow of data between them.
• The transactions in Flume are channel-based where two

transactions (one sender and one receiver) are maintained
for each message. It guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and

customizable.
Apache Flume - Architecture
• The following illustration depicts the basic
architecture of Flume. As shown in the
illustration, data generators (such as
Facebook, Twitter) generate data which gets
collected by individual Flume agents running
on them. Thereafter, a data collector (which is
also an agent) collects the data from the
agents which is aggregated and pushed into a
centralized store such as HDFS or HBase.
What is Flume?
Fig.Flume agent with source ,channel and sink
Apache Flume - Architecture
Flume Agent
• An agent is an independent daemon process
(JVM) in Flume. It receives the data (events)
from clients or other agents and forwards it
to its next destination (sink or agent). Flume
may have more than one agent. Following
diagram represents
a Flume Agent
Source
• A source is the component of an Agent which receives data from the data
generators
and transfers it to one or more channels in the form of Flume events.
• Apache Flume supports several types of sources and each source receives
events from a specified data generator.
• Example − Facebook, Avro source, Thrift source, twitter 1% source etc.
Channel
• A channel is a transient store which receives the events from the source and
buffers them till they are consumed by sinks. It acts as a bridge between the
sources and the sinks.
• These channels are fully transactional and they can work with any number of
sources
and sinks.
• Example − JDBC channel, File system channel, Memory channel, etc.
• Sink
• A sink stores the data into centralized stores
like HBase and HDFS. It consumes the data
(events) from the channels and delivers it to
the destination. The destination of the sink
might be another agent or the central stores.
• Example − HDFS sink
Setting multi-agent flow
• In order to flow the data across multiple
agents or hops, the sink of the previous agent
and source of the current hop need to be avro
type with the sink pointing to the hostname
(or IP address) and port of the source.
• Within Flume, there can be multiple agents
and before reaching the final destination, an
event may travel through more than one
agent. This is known as multi-hop flow.
Setting multi-agent flow
Flume Consolidation
• A very common scenario in log collection is a large
number of log producing clients sending data to a few
consumer agents that are attached to the storage
subsystem. For example, logs collected from hundreds
of web servers sent to a dozen of agents that write to
HDFS cluster.
• This can be achieved in Flume by configuring a number
of first tier agents with an avro sink, all pointing to an
avro source of single agent (Again you could use the
thrift sources/sinks/clients in such a scenario). This
source on the second tier agent consolidates the
received events into a single channel which is consumed
by a sink to its final destination.
Consolidation
Apache Hive
• The Apache Hive data warehouse software facilitates
reading, writing, and managing large datasets
residing in distributed storage using SQL. Structure
can be projected onto data already in storage. A
command line tool and JDBC driver are provided to
connect users to Hive.
• Hive is a data warehouse infrastructure tool to

process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later
the Apache Software Foundation took it up
and developed it further as an open source
under the name Apache Hive. It is used by
different companies. For example, Amazon
uses it in Amazon Elastic MapReduce.
Features of Hive
• It stores schema in a database and processed
data into HDFS.
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
Characteristics of Hive
1.Databases and tables are built before loading the data.
2.Hive as data warehouse is built to manage and query only
structured data which is residing under tables.
3.At the time of handling structured data, MapReduce lacks
optimization and usability function such as UDFs whereas Hive
framework have optimization and usability.
4.Programming in Hadoop deals directly with the files. So, Hive can
partition the data with directory structures to improve performance on
certain queries.
5.Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.
6.Hive uses derby database in single user metadata storage and it
uses MYSQL for multiple user Metadata or shared Metadata.
Architecture of Hive
User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight (In
Windows server).
Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It
Engine is one of the replacements of traditional approach for MapReduce
program. Instead of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates
results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques
to store data into file system.
Working of Hive
Working of Hive
Step Operation
No.
1 Execute Query The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any
database).
4 Send Metadata Metastore sends metadata as a response to the compiler.
5 Send Plan The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution engine.
Working of Hive
7 Execute Job Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to
JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes
MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data
nodes.
9 Send Results The execution engine sends those resultant values
to the driver.
10 Send Results The driver sends the results to Hive Interfaces.
Hive - Data Types
All the data types in Hive are classified into four
types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
Apache Oozie
• Apache Oozie is a workflow scheduler for
Hadoop.
• It is a system which runs the workflow of
dependent jobs.
• Here, users are permitted to create Directed
Acyclic Graphs of workflows, which can be run
in parallel and sequentially in Hadoop.
Apache Oozie
It consists of Three parts:
• Workflow engine: Responsibility of a workflow
engine is to store and run workflows composed
of Hadoop jobs e.g., MapReduce, Pig, Hive.
• Coordinator engine: It runs workflow jobs based
on predefined schedules and availability of data.
• Bundle: Higher level abstraction that will batch a
set of coordinator jobs
Apache Oozie
• Oozie is scalable and can manage the timely
execution of thousands of workflows (each
consisting of dozens of jobs) in a Hadoop cluster.
• Oozie is very much flexible, as well. One can
easily start, stop, suspend and rerun jobs. Oozie
makes it very easy to rerun failed workflows. One
can easily understand how difficult it can be to
catch up missed or failed jobs due to downtime
or failure. It is even possible to skip a specific
failed node.
How does OOZIE work?
• Oozie runs as a service in the cluster and clients submit workflow
definitions for immediate or later processing.
• Oozie workflow consists of action nodes and control-flow nodes.
• A control-flow node controls the workflow execution between actions by
allowing constructs like conditional logic wherein different branches may
be followed depending on the result of earlier action node.
• Start Node, End Node, and Error Node fall under this category of nodes.
• Start Node, designates the start of the workflow job.
• End Node, signals end of the job.
• Error Node designates the occurrence of an error and corresponding error
message to be printed.
• An action node represents a workflow task, e.g., moving files into HDFS,
running a MapReduce, Pig or Hive jobs, importing data using Sqoop or
running a shell script of a progrwawmw.vtwupurlsiet.tcoemnin Java.
Example Workflow Diagram
HBase
● HBase is a data model that is similar to
Google’s big table designed to provide quick
random access to huge amounts of structured
data.
● Since 1970, RDBMS is the solution for data
storage and maintenance related problems.
After the advent of big data, companies
realized the benefit of processing big data and
started opting for solutions like Hadoop.
Limitations of Hadoop
● Hadoop can perform only batch processing,
and data will be accessed only in a sequential
manner. That means one has to search the
entire dataset even for the simplest of jobs.
● A huge dataset when processed results in
another huge data set, which should also be
processed sequentially. At this point, a new
solution is needed to access any point of data
in a single unit of time (random access).
What is HBase?
● HBase is a distributed column-oriented
database built on top of the Hadoop file
system. It is an open-source project and is
horizontally scalable.
● It leverages the fault tolerance provided by

the Hadoop File Systems(HDFS).
HBase and HDFS
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the

suitable for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch It provides low latency access to single
processing; no concept of batch rows from billions of records (Random
processing. access).
It provides only sequential access of HBase internally uses Hash tables and
data. provides random access, and it stores
the data in indexed HDFS files for
faster lookups.
What is HBase?
Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are sorted by
row.
• The table schema defines only column families, which are the key
value pairs.
• A table have multiple column families and each column family can
have any number of columns.
• Subsequent column values are stored contiguously on the disk. Each
cell value of the table has a timestamp.
In short, in an HBase:
● Table is a collection of rows.
● Row is a collection of column families.
● Column family is a collection of columns.
● Column is a collection of key value pairs.

Storage Mechanism in HBase
Where to Use HBase
● Apache HBase is used to have random, real-
time read/write access to Big Data.
● It hosts very large tables on top of clusters of

commodity hardware.
● Apache HBase is a non-relational database

modeled after Google's Bigtable. Bigtable acts
up on Google File system, likewise Apache
Hbase Architecture

Bda 18CS72 Mod-2

Uploaded by

Copyright:

Available Formats

Bda 18CS72 Mod-2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda 18CS72 Mod-2

Uploaded by

Copyright:

Available Formats

BIG DATA AND ANALYTICS

3. Store and process Big Data: Processes Big Data

• Figure 2.5 also illustrates YARN components

• The NameNode also manages block creation, deletion

• During write operation

– The NameNode determines no.of blocks needed and provides

– As part of storage process, the data blocks are replicated once

• Namenode is a metadata server or data traffic cop

• It provides a single namespace managed by the

• No data on the namenode

• Secondary namenode performs checkpoints of

– Data resides in the same rack(better).

– Data resides in a different rack(good).

• ZooKeeper is a highly available service for ..

• HDFS failover relies on ZooKeeper for failure detection

Usage: hdfs [--config confdir] COMMAND

• The same result can be obtained by issuing the

$ hdfs dfs -ls /user/hdfs

• Copy files to HDFS

The file transfer can be confirmed by using –ls

$ hdfs dfs -ls stuff

• Copy files within HDFS

• Delete a File within HDFS

To copy a file from current local directory to HDFS, the

Files can be copied back to your local file

• $ hdfs dfs -cp countwords/comm

• where comm is copied to sample in the new

• $ hdfs dfsadmin -report

Essential Hadoop Tools

• Hadoop Ecosystem comprises of various tools that are required to

Sqoop − “SQL to Hadoop and Hadoop to SQL”

• When Big Data storages and analyzers such as

• In the first step, shown in the figure, Sqoop examines

• You can easily override the format in which

• As in the import process, the first step is to

• Sqoop divides the input data set into splits, then

Connectors for all Supported. Not supported. Use

Data Not supported. First Not supported. First

• When the rate of incoming data exceeds the rate at which

• The transactions in Flume are channel-based where two

• Flume is reliable, fault tolerant, scalable, manageable, and

• Hive is a data warehouse infrastructure tool to

● It leverages the fault tolerance provided by

HDFS is a distributed file system HBase is a database built on top of the

● Table is a collection of rows.

● Row is a collection of column families.

● Column family is a collection of columns.

● Column is a collection of key value pairs.

● It hosts very large tables on top of clusters of

● Apache HBase is a non-relational database

You might also like