Efficient Ways To Improve The Performance of HDFS For Small Files
Efficient Ways To Improve The Performance of HDFS For Small Files
Efficient Ways To Improve The Performance of HDFS For Small Files
www.iiste.org
45
www.iiste.org
computers using single programming model. Large datasets means that Terabytes or Petabytes of data is stored.
Commodity computers are that there is no need of high quality machine and a Single programming model i.e.
MapReduce is used. In MapReduce the application is divided into many small fragments of work, each of which
can execute or re-execute on any node in the cluster. It also provides a distributed file system (DFS) that stores
the data on the compute nodes, providing very high aggregate bandwidth across the cluster. DFS is designed so
that node failures are automatically handled by the framework. Hadoop streaming is a utility that comes with the
Hadoop distribution which allows writing map/reduce code in any language one wants to. This is a key feature
in making Hadoop more acceptable and attractive to use.
Companies using Hadoop are Yahoo, Google, Facebook, Amazon, IBM and Indian Adhar card System.
3. HDFS ARCHITECTURE
In this architecture, there are two main components of Hadoop namely (i) Hadoop Distributed File System
(HDFS) which is used for storage purpose (ii) Execution Engine i.e. MapReduce for processing.
There are five main building blocks of Hadoop [7]. They are (i) NameNode (ii) DataNode (iii) Secondary
NameNode (iv) JobTracker (v) TaskTracker. NameNode is the master node on which the JobTracker runs and
DataNode is the slave node on which the TaskTracker runs. JobTracker and TaskTracker are the daemons i.e.
the process or the services running on background.
3.1 NameNode
It is the master of HDFS which directs the slave DataNode to perform the low-level I/O tasks. It keeps a
track of how the files are broken down into file blocks and which nodes are used to store those file blocks. The
metadata is maintained in the main memory of the NameNode to ensure fast access to the client, on read/write
requests. The NameNode is a single point of failure of your Hadoop cluster and this is the only negative aspect
to the importance of the NameNode.
3.2 DataNode
It is the slave node which is used to perform the work of the distributed filesystem such as reading and
writing HDFS blocks to actual files on the local filesystem. When one wants to read or write a HDFS file, the
file is broken into blocks and the NameNode on request from the client will tell the client that on which
DataNode each block resides in. Default block size is 64MB. Client communicates directly with the DataNodes
to process the local files corresponding to the blocks. A DataNode may also communicate with other DataNodes
to replicate its data blocks for redundancy using pipeline process.
3.3 Secondary NameNode (SNN)
It is an assistant daemon for monitoring the state of the NameNode. It performs periodic checkpoints. One
can restart the NameNode using the checkpoint in case of NameNode failure. Like the NameNode, each cluster
has one SNN, which resides on its own machine. Secondary NameNode is not a substitute for the NameNode.
3.4 Job Tracker
46
www.iiste.org
Job tracker is the daemon on which the NameNode runs. Once one submits the code to the cluster, it
determines which files to process, assigns nodes to different tasks, and monitors all tasks as theyre running. In
case the tasks fail, the task is automatically re-launched by the JobTracker, possibly on a different node, up to a
predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster which runs on a server as a
master node of the cluster.
3.5 TaskTracker
TaskTracker is the daemon on which the DataNode runs. It is responsible for executing the individual tasks
that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn
multiple Java Virtual Machines to handle many map or reduce tasks in parallel. The TaskTracker constantly
keeps on communicating with the JobTracker. In case the JobTracker does not receive a heartbeat from a
TaskTracker within a specified amount of time, it assumes that the TaskTracker has crashed and hence it
resubmits the corresponding tasks to other nodes in the cluster.
4. HDFS FEATURES
The following are the features of Hadoop Distributed File System:
i.
ii.
iii.
iv.
v.
Big data: Big Data is basically defined in term of the 'three V's': volume of data (size), velocity of data
(speed), and variety of data (type). It is basically a large volume of unstructured live data. Apache
Hadoop is an analytics tool that has been labelled 'big data'.
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file
systems data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural
goal of HDFS.Namenode is consistently checking Datanodes
47
www.iiste.org
There are certain disadvantages also of HAR files. They are as follows:
i.
ii.
Reading through files in HAR is slow as compared to reading the files in HDFS. This is because each
HAR file access requires two index file reads as well as the data file read.
Inspite of the fact that HAR files are used as the input to MapReduce, no special mechanism allows the
maps to operate over all the files in the HAR residing on a HDFS block.
ii.
iii.
6.3 Consolidator[10]
It takes the records containing files belonging to the same logical file & merges the files together into larger
files. It is possible but not practical to merge all the files into a one large file as then it would be a terabytes
sized file. A longer time is taken to run such a huge file. Hence Consolidator has a parameter for "desired file
size" where user can define the maximum file size of a merged file. In this way Consolidator balances its speed
with the desired size of files. "Desired file size" can be set to some multiples of the HDFS block size so that the
input splits are larger to optimize for locality.
6.4 HBase[10]
48
www.iiste.org
If you have lots of small files, then, depending on the access pattern, a different type of storage might be
more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to
do MapReduce style streaming analyses with the occasional random look up. If we are storing lots of small files
then HBASE provides a better interface for faster look up of files.
7. CONCLUSION
HDFS is designed to store large files and suffers performance penalty while storing large
amount of small files and performing analysis of them. The number of mappers are increased for too many
small files and high memory usage is caused by huge numbers of files. Small file problems are solved by given
approaches. Each of these approach applicable in different context which improves the efficiency of access to
small files in HDFS
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
49