Unit II Big Data Final PDF
Unit II Big Data Final PDF
Unit II Big Data Final PDF
UNIT-II
DISTRIBUTED FILE SYSTEMS LEADING TO HADOOP
FILE SYSTEM
Big Data :
'Big Data' is a term used to describe collection of data that is huge in size and yet growing
exponentially with time.
Normally we work on data of size MB(Word Doc, Excel) or maximum GB(Movies, Codes) but
data in Peta bytes i.e. 10^15 byte size is called Big Data.
It is stated that almost 90% of today's data has been generated in the past 3 years.
In short, such a data is so large and complex that none of the traditional data management
tools are able to store it or process it efficiently.
The New York Stock Exchange generates about one terabyte of new trade data per day.
Statistic shows that 500+terabytes of new data gets ingested into the databases of social
media site Facebook,Google, LinkedIn, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.
Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With
Page | 27
BIGDATA LECTURE NOTES
many thousand flights per day, generation of data reaches up to many Petabytes.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
Page | 28
BIGDATA LECTURE NOTES
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
Page | 29
BIGDATA LECTURE NOTES
Semi-structured
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Page | 30
BIGDATA LECTURE NOTES
Page | 31
BIGDATA LECTURE NOTES
Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.
Size of data plays very crucial role in determining value out of data.
Variety
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered
by most of the applications.
Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.
is also being considered in the analysis applications.
Velocity
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.
Page | 32
BIGDATA LECTURE NOTES
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution:
Apache Hadoop is the most important framework for working with Big Data. The biggest
strength of Hadoop is scalability.
Background of Hadoop
• With an increase in the penetration of internet and the usage of the internet, the
data captured by Google increased exponentially year on year.
• Just to give you an estimate of this number, in 2007 Google collected on an average
270 PB of data every month.
• The same number increased to 20000 PB everyday in 2009.
• Obviously, Google needed a better platform to process such an enormous data.
• Google implemented a programming model called MapReduce, which could process
this 20000 PB per day. Google ran these MapReduce operations on a special file
system called Google File System (GFS). Sadly, GFS is not an open source.
Doug cutting and Yahoo! reverse engineered the model GFS and built a parallel Hadoop
Distributed File System (HDFS).
The software or framework that supports HDFS and MapReduce is known as Hadoop.
What is Hadoop
Page | 33
BIGDATA LECTURE NOTES
It is used to store process and analyze data which are very huge in volume.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Apache Hadoop is not only a storage system but is a platform for data storage as well as
processing.
Commodity hardware is the low-end hardware, they are cheap devices which are
very economical. Hence, Hadoop is very economic.
The idea of Apache Hadoop was actually born out of a Google project called the
MapReduce, which is a framework for breaking down an application into smaller chunks
that can then be parsed on a much smaller and granular level. Each of the smaller blocks is
individually operated on nodes which are then connected to the main cluster.
Hadoop works in master-slave fashion. There is a master node and there are n numbers of
slave nodes where n can be 1000s.
slaves are the actual worker nodes. Master should deploy on good configuration hardware,
not just commodity hardware. As it is the centerpiece of Hadoop cluster.
Master stores the metadata (data about data) while slaves are the nodes which store the
data.
Page | 34
BIGDATA LECTURE NOTES
Hadoop components
1)Hdfs -storage
HDFS :, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity
hardware to form clusters and store data in a distributed fashion.
2) Map_Reduce-processing.
Map Reduce paradigm is applied to data distributed over network to find the required
output.
HDFS architecture:
Hadoop comes with a distributed file system called HDFS (HADOOP Distributed File
Systems) HADOOP based applications make use of HDFS.
HDFS is designed for storing very large data files, running on clusters of commodity
hardware.
Hadoop HDFS has a Master/Slave architecture in which Master is NameNode and Slave is
DataNode.
HDFS Architecture consists of single NameNode and all the other nodes are DataNodes.
Page | 35
BIGDATA LECTURE NOTES
HDFS NameNode
HDFS Namenode stores meta-data i.e. number of data blocks, replicas and other details.
This meta-data is available in memory in the master for faster retrieval of data. NameNode
maintains and manages the slave nodes, and assigns tasks to them. It should deploy on
reliable hardware as it is the centerpiece of HDFS.
Task of NameNode
Page | 36
BIGDATA LECTURE NOTES
FsImage –
It is an “Image file”. FsImage contains the entire filesystem namespace and stored as a file in
the namenode’s local file system.
It also contains a serialized form of all the directories and file inodes in the filesystem.
EditLogs –
It contains all the recent modifications made to the file system on the most recent FsImage.
HDFS DataNode
It performs read and write operation as per the request of the client.
Task of DataNode
Page | 37
BIGDATA LECTURE NOTES
Secondary Namenode: Secondary NameNode downloads the FsImage and EditLogs from
the NameNode.
Checkpoint Node
The Checkpoint node is a node which periodically creates checkpoints of the namespace.
Checkpoint Node in Hadoop first downloads FsImage and edits from the Active Namenode.
Then it merges them (FsImage and edits) locally, and at last, it uploads the new image back
to the active NameNode.
Backup Node
In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system
namespace.
The Backup node checkpoint process is more efficient as it only needs to save the
namespace into the local FsImage file and reset edits.
Blocks
HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the
smallest unit of data in a filesystem. We (client and admin) do not have any control on the
block like block location. NameNode decides all such things.
Page | 38
BIGDATA LECTURE NOTES
The default size of the HDFS block is 64 MB, which we can configure as per the need.
All blocks of the file are of the same size except the last block, which can be the same size or
smaller.
The major advantages of storing data in such block size are that it saves disk seek time.
Replication Management
Block replication provides fault tolerance. If one copy is not accessible and corrupted then
we can read data from other copy.
The number of copies or replicas of each block of a file is replication factor. The default
replication factor is 3 which are again configurable. So, each block replicates three times and
stored on different DataNodes.
If we are storing a file of 128 MB in HDFS using the default configuration, we will end up
occupying a space of 384 MB (3*128 MB).
NameNode receives block report from DataNode periodically to maintain the replication
factor.
Rack Awareness
In a large cluster of Hadoop, in order to improve the network traffic while reading/writing
HDFS file, NameNode chooses the DataNode which is closer to the same rack or nearby rack
to Read /write request. NameNode achieves rack information by maintaining the rack ids of
each DataNode. Rack Awareness in Hadoop is the concept that chooses Datanodes based
on the rack information.
In HDFS Architecture, NameNode makes sure that all the replicas are not stored on the
same rack or single rack. It follows Rack Awareness Algorithm to reduce latency as well as
fault tolerance. We know that default replication factor is 3.
Page | 39
BIGDATA LECTURE NOTES
According to Rack Awareness Algorithm first replica of a block will store on a local rack. The
next replica will store another datanode within the same rack. The third replica will store on
different rack In Hadoop.
The cluster is the set of host machines (nodes). Nodes may be partitioned in racks.
This is the hardware part of the infrastructure.
YARN Infrastructure (Yet Another Resource Negotiator) is the framework
responsible for providing the computational resources (e.g., CPUs, memory, etc.)
needed for application executions.
Page | 40
BIGDATA LECTURE NOTES
Resource Manager (one per cluster) is the master. It knows where the
slaves are located (Rack Awareness) and how many resources they
have. It runs several services, the most important is the Resource
Scheduler which decides how to assign the resources.
o Its resource capacity is the amount of memory and the number of vcores.
o At run-time, the Resource Scheduler will decide how to use this capacity:
a Container is a fraction of the NM capacity and it is used by the client for
running a program.
Page | 41
BIGDATA LECTURE NOTES
HDFS Federation is the framework responsible for providing permanent, reliable and
distributed storage. This is typically used for storing inputs and output (but not
intermediate ones).
Other alternative storage solutions. For instance, Amazon uses the Simple Storage
Service (S3).
The YARN infrastructure and the HDFS federation are completely decoupled and
independent:
First one provides resources for running an application while the second one provides
storage.
The MapReduce framework is only one of many possible framework which runs on top of
YARN (although currently is the only one implemented).
Page | 42
BIGDATA LECTURE NOTES
The Application Master is responsible for the execution of a single application. It asks for
containers to the Resource Scheduler (Resource Manager) and executes specific programs
(e.g., the main of a Java class) on the obtained containers.
The Application Master knows the application logic and thus it is framework-specific.
Using Application Masters, YARN is spreading over the cluster the metadata related to
running applications.
This reduces the load of the Resource Manager and makes it fast recoverable.
Data read request is served by HDFS, NameNode and DataNode. Let's call reader as a
'client'. Below diagram depicts file read operation in Hadoop.
Page | 43
BIGDATA LECTURE NOTES
Page | 44
BIGDATA LECTURE NOTES
In this section, we will understand how data is written into HDFS through files.
Page | 45
BIGDATA LECTURE NOTES
Hadoop HDFS command are discussed below along with their usage, description, and
examples.
Hadoop file system shell commands are used to perform various Hadoop HDFS operations
and in order to manage the files present on HDFS clusters.
All the Hadoop file system shell commands are invoked by the bin/hdfs script
Page | 46
BIGDATA LECTURE NOTES
Upload:
hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Syntax: hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example: hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/
Download:
hadoop fs -get:
Page | 47
BIGDATA LECTURE NOTES
This command allows multiple sources as well in which case the destination must be a
directory.
copyFromLocal
Similar to put command, except that the source is restricted to a local file reference.
copyToLocal
Similar to get command, except that the destination is restricted to a local file reference.
Page | 48
BIGDATA LECTURE NOTES
Please comment which of these commands you found most useful while dealing with
Hadoop /HDFS
Hadoop Daemons
Daemons are the processes that run in the background. There are mainly 4 daemons which
run for Hadoop.
Page | 49
BIGDATA LECTURE NOTES
These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary
NameNode, standby NameNode, Job HistoryServer, etc.
Hadoop Flavors
All the databases have provided native connectivity with Hadoop for fast data transfer.
Because, to transfer data from Oracle to Hadoop, you need a connector.
All flavors are almost same and if you know one, you can easily work on other flavors as
well.
Hadoop Ecosystem
In this section of Hadoop tutorial, we will cover Hadoop ecosystem components. Let us see
what all the components form the Hadoop Eco-System:
Page | 50
BIGDATA LECTURE NOTES
Page | 51