Hadoop and Big Data Unit 2
Hadoop and Big Data Unit 2
Hadoop and Big Data Unit 2
FILE SYSTEM
A file system is used to control how data is stored and retrieved. Without a file system, information placed in a
storage area would be one large body of data with no way to tell where one piece of information stops and the
next begins. By separating the data into pieces and giving each piece a name, the information is easily isolated
and identified. Taking its name from the way paper-based information systems are named, each group of data
is called a "file". The structure and logic rules used to manage the groups of information is called a "file
system".
The file system manages access to both the content of files and the metadata about those files. It is responsible
for arranging storage space; reliability, efficiency, and tuning with regard to the physical storage medium are
important design considerations.
DISTRIBUTED FILE SYSTEM (DFS)
A distributed file system is a client/server-based application that allows clients to access and process data stored
on the server as if it were on their own computer. When a user accesses a file on the server, the server sends
the user a copy of the file, which is cached on the user's computer while the data is being processed and is then
returned to the server.
DFS is a mechanism for sharing files. Distributed file system (DFS) is used to make files distributed across
multiple servers appear to users as if they reside in one place on the network.
BENEFITS OF DFS
Resources management and accessibility ( users access all resources through a single point)
Accessibility( users do not need to know the physical location of the shared folder)
Fault tolerance ( shares can be replicated)
Workload management ( DFS allows administrators to distribute shared folders and workloads across several
servers for more efficient network and server resources use), and many others.
2.1 GOOGLE FILE SYSTEM
GFS is a proprietary distributed file system developed by Google for its own use. It is designed to provide
efficient, reliable access to data using large clusters of commodity hardware. Commodity hardware, in an IT
context, is a device or device component that is relatively inexpensive, widely available and more or less
interchangeable with other hardware of its type.
A new version of the Google File System is codenamed Colossus which was released in 2010.
GFS was implemented especially for meeting the rapidly growing demands of Google’s data processing
needs.
2.1.1 TRADITIONAL DESIGN ISSUES OF GFS
– Performance
– Scalability
– Reliability
– Availability
2.1.2 DIFFERENT POINTS IN GFS DESIGN
Most files are mutated by appending new data rather than overwriting existing data
appending becomes the focus of performance optimization and atomicity guarantees, while caching
data blocks in the client loses its appeal.
2.1.3 GFS DESIGN
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine which can
generate enormous amounts of data that needs to be retained.
2.1.4 INTERFACE
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX.
Files are organized hierarchically in directories and identified by pathnames. GFS support the usual operations
to create, delete, open, close, read, and write files. Moreover, GFS has snapshot and record append operations.
Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to
append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.
It is useful for implementing multi-way merge results and producer consumer queues that many clients can
simultaneously append to without additional locking. We have found these types of files to be invaluable in
building large distributed applications.
2.1.5 ARCHITECTURE
A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as
shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process.
It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and
the lower reliability caused by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit
chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks
as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each
chunk is replicated on multiple chunk servers. By default, three replicas will be stored, though users can
designate different replication levels for different regions of the file namespace. The master maintains all file
system metadata. This includes the namespace, access control information, the mapping from files to chunks,
and the current locations of chunks. It also controls system-wide activities such as chunk lease management,
garbage collection of orphaned chunks, and chunk migration between chunk servers.
The master periodically communicates with each chunk server in HeartBeat messages to give it instructions
and collect its state. GFS client code linked into each application implements the file system API and
communicates with the master and chunk servers to read or write data on behalf of the application. Clients
interact with the master for metadata operations, but all data-bearing communication goes directly to the
chunk servers. Neither the client nor the chunk server caches file data. Client caches offer little benefit because
most applications stream through huge files or have working sets too large to be cached. Not having them
simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata,
however.) Chunk servers need not cache file data because chunks are stored as local files and so Linux’s
buffer cache already keeps frequently accessed data in memory.
NameNode
Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The
distributed storage system is called the Hadoop File System , or HDFS. The NameNode is the master of
HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The NameNode is the
bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store
those blocks, and the overall health of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode
typically doesn’t store any user data or perform any computations for a MapReduce program to lower the
workload on the machine.
There is unfortunately a negative aspect to the importance of the NameNode—it’s a single point of failure of
your Hadoop cluster. For any of the other daemons, if their host nodes fail for software or hardware reasons,
the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the
NameNode.
DataNode
Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the
distributed filesystem—reading and writing HDFS blocks to actual files on the local filesystem. When you
want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which
DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the
local files corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes to
replicate its data blocks for redundancy.
The following figure illustrates the roles of the NameNode and DataNodes. In this figure, we show two data
files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up three blocks, which we
denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The content of the files are distributed among
the DataNodes. In this illustration, each block has three replicas. For example, block 1 (used for data1) is
replicated over the three rightmost DataNodes. This ensures that if any one DataNode crashes or becomes
inaccessible over the network, you’ll still be able to read the files.
DataNodes are constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the
NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll
the NameNode to provide information regarding local changes as well as receive instructions to create, move,
or delete blocks from the local disk.
Secondary NameNode
The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like
the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other
DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this
process doesn’t receive or record any real-time changes to HDFS. Instead, it communicates with the
NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.
The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the
downtime and loss of data. Nevertheless, a NameNode failure requires human intervention to reconfigure the
cluster to use the SNN as the primary NameNode.
JobTracker
The JobTracker daemon is the liaison between your application and Hadoop. Once you submit your code to
your cluster, the JobTracker determines the execution plan by determining which files to process, assigns
nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only
one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.
TaskTracker
the JobTracker is the master for the overall execution of a MapReduce job and the TaskTrackers manage the
execution of individual tasks on each slave node. The following figure shows this interaction.
Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there
is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or
reduce tasks in parallel. One responsibility of the TaskTracker is to constantly communicate with the
JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of
time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in
the cluster.
Pseudo-distributed mode
The pseudo-distributed mode is running Hadoop in a “cluster of one” with all daemons running on a single
machine. This mode complements the standalone mode for debugging your code, allowing you to examine
memory usage, HDFS input/output issues, and other daemon interactions. In order to work with pseudo
distributed mode in addition to the above process the four XML files (core-site.xml, hdfs-site.xml, mapred-
site.xml and yarn-site.xml) are to be updated as below.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker,
respectively. In hdfs-site.xml we specify the default replication factor for HDFS, which should only be one
because we’re running on only one node. The yarn-site.xml is used to configure yarn into Hadoop.
The hadoop-env.sh file contains other variables for defining your Hadoop environment.To set environment
variables we have to add following lines at the end of file in hadoop-env.sh.
export JAVA_HOME=/home/user/Downloads/jdk1.8.0_77
Export HADOOP_HOME=/home/user/Downloads/hadoop
Running simple Example on pseudo distributed mode and fully distributed mode
1. Create a new directory on HDFS
$ hdfs dfs -mkdir /in
2. Now copy the files from local directory to hdfs directory.
$hdfs dfs -put /home/user/in/file /in/file
3. To excecute a program.
$ hadoop jar /home/user/Downloads/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.2.jar wordcount /in /out
4. To get the output file from hdfs directory to local file system.
$ hadoop fs -get /out /home/user/out
5. To see the output use the following command
$ cat out/*