Hadoop & BigData (UNIT - 2)

UNIT – 2 : WORKING WITH BIG DATA
Working with Big Data: Google File System, Hadoop Distributed File System (HDFS) –
Building blocks of Hadoop (Namenode, Datanode, Secondary Namenode, JobTracker,
TaskTracker), Introducing and Configuring Hadoop cluster (Local, Pseudo-distributed mode,
Fully Distributed mode), Configuring XML files.
INTRODUCTION
We live in the data age. It‘s not easy to measure the total volume of data stored electronically,
but an IDC estimate put the size of the ―digital universe‖ at 0.18 zettabytes in 2006, and is
forecasting a tenfold growth by 2011 to 1.8 zettabytes. A zettabyte is 10^21 bytes, or
equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That‘s
roughly the same order of magnitude as one disk drive for every person in the world. This flood
of data is coming from many sources.
Consider the following:
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
The volume of data being made publicly available increases every year, too. Organizations no
longer have to merely manage their own data: success in the future will be dictated to a large
extent by their ability to extract value from other organizations‘ data. Initiatives such as Public
Data Sets on Amazon Web Services, Infochimps.org, and theinfo.org exist to foster the
―information commons,‖ where data can be freely (or in the case of AWS, for a modest price)
shared for anyone to download and analyze.
The problem is simple: while the storage capacities of hard drives have increased massively over
the years, access speeds—the rate at which data can be read from drives—have not kept up. One
typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so
you could read all the data from a full drive in around five minutes. Over 20 years later, one
terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two
and a half hours to read all the data off the disk.
This is a long time to read all data on a single drive—and writing is even slower. The obvious
way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each
holding one hundredth of the data. Working in parallel, we could read the data in under two
minutes.
Problems of Data Storage:
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is
through replication: redundant copies of the data are kept by the system so that in the event of
failure, there is another copy available. This is how RAID works, for instance, although
Hadoop‘s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different
approach.
The second problem is that most analysis tasks need to be able to combine the data in some way;
data read from one disk may need to be combined with the data from any of the other 99 disks.
Various distributed systems allow data to be combined from multiple sources, but doing this
correctly is notoriously challenging. MapReduce provides a programming model that abstracts
the problem from disk reads and writes, transforming it into a computation over sets of keys and
values.
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The
storage is provided by HDFS and analysis by MapReduce. There are other parts to Hadoop, but
these capabilities are its kernel.
Big Data
Big Data means really a big data, it is a collection of large data sets that cannot be processed
using traditional computing techniques. Big data is not merely a data, rather it has become a
complete subject, which involves various tools, techniques and frameworks. Big data involves
the data produced by different devices and applications. Given below are some of the fields that
come under the umbrella of Big Data.
 Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
 Stock Exchange Data : The stock exchange data holds information about the ‗buy‘ and
‗sell‘ decisions made on a share of different companies made by the customers.
 Power Grid Data : The power grid data holds information consumed by a particular
node with respect to a base station.
 Transport Data : Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
 Structured data : Relational data.

 Semi Structured data : XML data.
 Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
Big data is really critical to our life and its emerging as one of the most important technologies in
modern world. Follow are just few benefits which are very much known to all of us:
 Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
 Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
 Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.
Big Data Technologies
 Big data technologies are important in providing more accurate analysis, which may lead
to more concrete decision-making resulting in greater operational efficiencies, cost
reductions, and reduced risks for the business.
 To harness the power of big data, you would require an infrastructure that can manage
and process huge volumes of structured and unstructured data in realtime and can protect
data privacy and security.
 There are various technologies in the market from different vendors including Amazon,
IBM, Microsoft, etc., to handle big data.
Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will
be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present it to
the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data. But
when it comes to dealing with huge amounts of data, it is really a tedious task to process such
data through a traditional database server.
Google’s Solution
The above diagram shows various commodity hardwares which could be single CPU machines
or servers with higher capacity.
Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns those parts to many computers connected over the network, and
collects the results to form the final result dataset.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy elephant.
Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. A Hadoop
frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
Today, we‘re surrounded by data. People upload videos, take pictures on their cell phones, text
friends, update their Facebook status, leave comments around the web, click on ads, and so forth.
Machines, too, are generating and keeping more and more data. You may even be reading this
book as digital data on your computer screen, and certainly your purchase of this book is
recorded as data with some retailer. The exponential growth of data first presented challenges to
cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go
through terabytes and petabytes of data to figure out which websites were popular, what books
were in demand, and what kinds of ads appealed to people. Existing tools were becoming
inadequate to process such large data sets.
Google was the first to publicize MapReduce—a system they had used to scale their data
processing needs. This system aroused a lot of interest because many other businesses were
facing similar scaling challenges, and it wasn‘t feasible for everyone to reinvent their own
proprietary tool.
Doug Cutting saw an opportunity and led the charge to develop an open source version of this
MapReduce system called Hadoop . Soon after, Yahoo and others rallied around to support this
effort. Today, Hadoop is a core part of the computing infrastructure for many web companies,
such as Yahoo , Facebook , LinkedIn , and Twitter . Many more traditional businesses, such as
media and telecom, are beginning to adopt this system too.
Hadoop, and large-scale distributed data processing in general, is rapidly becoming an important
skill set for many programmers. An effective programmer, today, must have knowledge of
relational databases, networking, and security, all of which were considered optional skills a
couple decades ago. Similarly, basic understanding of distributed data processing will soon
become an essential part of every programmer‘s toolbox. Leading universities, such as Stanford
and CMU, have already started introducing Hadoop into their computer science curriculum.
Formally speaking, Hadoop is an open source framework for writing and running distributed
applications that process large amounts of data. Distributed computing is a wide and varied field,
but the key distinctions of Hadoop are that it is:
■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing
services such as Amazon‘s Elastic Compute Cloud (EC2 ).
■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected
with the assumption of frequent hardware malfunctions. It can gracefully handle most such
failures.
■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to
the cluster.
■ Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop Architecture
Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provides filesystem and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
 Hadoop MapReduce: This is YARN-based system for parallel processing of large data
sets.
We can use following diagram to depict these four components available in Hadoop framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or alongside
Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
Topic – 1: Google File System (GFS)
Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and
developed to accommodate Google‘s expanding data processing requirements. GFS provides
fault tolerance, reliability, scalability, availability and performance to large networks and
connected nodes. GFS is made up of several storage systems built from low-cost commodity
hardware components. It is optimized to accommodate Google's different data use and storage
needs, such as its search engine, which generates huge amounts of data that must be stored.
The Google File System capitalized on the strength of off-the-shelf servers while minimizing
hardware weaknesses.GFS is also known as GoogleFS.
GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that needs to be retained; Google File System
grew out of an earlier Google effort, "Big Files", developed by Larry Page and Sergey Brin in the
early days of Google, while it was still located in Stanford. Files are divided into fixed-size
chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only
extremely rarely overwritten, or shrunk; files are usually appended to or read. It is also designed
and optimized to run on Google's computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken against the high failure rate of
individual nodes and the subsequent data loss.
A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master
node and a large number of Chunk servers. Each file is divided into fixed-size chunks. Chunk
servers store these chunks. Each chunk is assigned a unique 64-bit label by the master node at the
time of creation, and logical mappings of files to constituent chunks are maintained. Each chunk
is replicated several times throughout the network, with the minimum being three, but even more
for files that have high end-in demand or need more redundancy.
The Master server does not usually store the actual chunks, but rather all the metadata associated
with the chunks, such as the tables mapping the 64-bit labels to chunk locations and the files they
make up, the locations of the copies of the chunks, what processes are reading or writing to a
particular chunk, or taking a "snapshot" of the chunk pursuant to replicate it (usually at the
instigation of the Master server, when, due to node failures, the number of copies of a chunk has
fallen beneath the set number). All this metadata is kept current by the Master server periodically
receiving updates from each chunk server ("Heart-beat messages").
Permissions for modifications are handled by a system of time-limited, expiring "leases", where
the Master server grants permission to a process for a finite period of time during which no other
process will be granted permission by the Master server to modify the chunk. The modifying
chunk server, which is always the primary chunk holder, then propagates the changes to the
chunk servers with the backup copies. The changes are not saved until all chunk servers
acknowledge, thus guaranteeing the completion and atomicity of the operation.
Programs access the chunks by first querying the Master server for the locations of the desired
chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the chunk
server directly (similar to Kazaa and its supernodes).
Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but
is instead provided as a user space library.
The GFS node cluster is a single master with multiple chunk servers that are continuously
accessed by different client systems. Chunk servers store data as Linux files on local disks.
Stored data is divided into large chunks (64 MB), which are replicated in the network a
minimum of three times. The large chunk size reduces network overhead.
GFS is designed to accommodate Google‘s large cluster requirements without burdening
applications. Files are stored in hierarchical directories identified by path names. Metadata - such
as namespace, access control data, and mapping information - is controlled by the master, which
interacts with and monitors the status updates of each chunk server through timed heartbeat
messages.
GFS features include:
 Fault tolerance
 Critical data replication
 Automatic and efficient data recovery
 High aggregate throughput
 Reduced client and master interaction because of large chunk server size
 Namespace management and locking
 High availability
The largest GFS clusters have more than 1,000 nodes with 300 TB disk storage capacity. This
can be accessed by hundreds of clients on a continuous basis.
Topic – 2: Hadoop Distributed File System (HDFS)

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary
to partition it across a number of separate machines. Filesystems that manage the storage across a
network of machines are called distributed filesystems. Since they are network-based, all the
complications of network programming kick in, thus making distributed filesystems more
complex than regular disk filesystems. For example, one of the biggest challenges is making the
filesystem tolerate node failure without suffering data loss. Hadoop comes with a distributed
filesystem called HDFS, which stands for Hadoop Distributed Filesystem.
Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP
FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop
Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode that
manages the file system metadata and one or more slave DataNodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to interact
with the file system.
Advantages of HDFS
 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using
low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.
Features of HDFS
 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of NameNode and DataNode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
Goals of HDFS
 Fault detection and recovery : Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms
for quick and automatic fault detection and recovery.
 Huge datasets : HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
 Hardware at data : A requested task can be done efficiently, when the computation
takes place near the data. Especially where huge datasets are involved, it reduces the
network traffic and increases the throughput.
HDFS Architecture:
HDFS is responsible for storing data on the cluster in Hadoop. Files in HDFS are split into
blocks before they are stored on cluster of size 64MB or 128MB. On a fully configured cluster,
―running Hadoop‖ means running a set of daemons, or resident programs, on the different
servers in your network. Programs which reside permanently in memory are called ―Resident
Programs‖. Daemon is a thread in Java, which runs in background and mostly created by JVM
for performing background task like Garbage collection. Each daemon runs separately in its own
JVM. These daemons have specific roles; some exist only on one server, some exist across
multiple servers. The daemons include :
 NameNode
 DataNode
 Secondary NameNode
 JobTracker
 TaskTracker
The above daemons are called as ―Building Blocks of Hadoop‖.
1. NameNode:
Let‘s begin with arguably the most vital of the Hadoop daemons—the NameNode . Hadoop
employs a master/slave architecture for both distributed storage and distributed computation. The
distributed storage system is called the Hadoop File System , or HDFS. The NameNode is the
master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The
NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into
file blocks, which nodes store those blocks, and the overall health of the distributed filesystem.
The function of the NameNode is memory and I/O intensive. As such, the server hosting the
NameNode typically doesn‘t store any user data or perform any computations for a MapReduce
program to lower the workload on the machine. This means that the NameNode server doesn‘t
double as a DataNode or a TaskTracker.
There is unfortunately a negative aspect to the importance of the NameNode—it‘s a single point
of failure of your Hadoop cluster. For any of the other daemons, if their host nodes fail for
software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or
you can quickly restart it. Not so for the NameNode.
2. DataNode:
Each slave machine in your cluster will host a DataNode daemon to perform the grunt (thankless
and menial) work of the distributed filesystem—reading and writing HDFS blocks to actual files
on the local filesystem. When you want to read or write a HDFS file, the file is broken into
blocks and the NameNode will tell your client which DataNode each block resides in. Your
client communicates directly with the DataNode daemons to process the local files
corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes
to replicate its data blocks for redundancy.
NameNode
File metadata:
/user/chuck/data1 -> 1,2,3
/user/james/data2 -> 4,5
DataNode
3 3 5 3 1 4
5 4 5 2
2 1 4 1 2
Figure : NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are
in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks
and constantly report to the NameNode to keep the metadata current.
The above figure illustrates the roles of the NameNode and DataNodes. In this figure, we show
two data files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up
three blocks, which we denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The
content of the files are distributed among the DataNodes.
In this illustration, each block has three replicas. For example, block 1 (used for data1) is
replicated over the three rightmost DataNodes. This ensures that if any one DataNode crashes or
becomes inaccessible over the network, you‘ll still be able to read the files. DataNodes are
constantly reporting to the NameNode. Upon initialization, each of the DataNodes informs the
NameNode of the blocks it‘s currently storing. After this mapping is complete, the DataNodes
continually poll the NameNode to provide information regarding local changes as well as receive
instructions to create, move, or delete blocks from the local disk.
3. Secondary NameNode:
The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster
HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own
machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN
differs from the NameNode in that this process doesn‘t receive or record any real-time changes
to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata
at intervals defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the
SNN snapshots help minimize the downtime and loss of data. Nevertheless, a NameNode failure
requires human intervention to reconfigure the cluster to use the SNN as the primary NameNode.
4. JobTracker:
The JobTracker daemon is the liaison (communication/cooperation which facilitates a close
working) between your application and Hadoop. Once you submit your code to your cluster, the
JobTracker determines the execution plan by determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they‘re running. Should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop cluster. It‘s typically run on a server as a
master node of the cluster.
5. TaskTracker:
As with the storage daemons, the computing daemons also follow a master/slave architecture:
the JobTracker is the master overseeing the overall execution of a MapReduce job and the
TaskTracker manage the execution of individual tasks on each slave node.
Client
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Map Map Map Map
Reduce Reduce Reduce Reduce
Figure: JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job,
the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster.
The above figure illustrates this interaction. Each TaskTracker is responsible for executing the
individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave
node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in
parallel. One responsibility of the TaskTracker is to constantly communicate with the
JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified
amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding
tasks to other nodes in the cluster.
Having covered each of the Hadoop daemons, we depict the topology of one typical Hadoop
cluster in the following figure:
Figure
This topology features a master node running the NameNode and JobTracker daemons and a
standalone node with the SNN in case the master node fails. For small clusters, the SNN can
reside on one of the slave nodes. On the other hand, for large clusters, separate the NameNode
and JobTracker on two machines. The slave machines each host a DataNode and TaskTracker,
for running tasks on the same node where their data is stored.
Secondary NameNode
DataNode TaskTracker
NameNode DataNode TaskTracker

JobTracker
Figure : Topology of a typical Hadoop cluster . It’s a master/slave architecture

in which the NameNode and JobTracker are masters and the DataNodes and
TaskTrackers are slaves.
Topic – 3: Introducing and Configuring Hadoop Cluster

When setting up a Hadoop cluster , you‘ll need to designate one specific node as the master
node. As shown in figure 2.3, this server will typically host the NameNode and JobTracker
daemons. It‘ll also serve as the base station contacting and activating the DataNode and
TaskTracker daemons on all of the slave nodes. As such, we need to define a means for the
master node to remotely access every node in your cluster.
Hadoop uses passphrase less SSH for this purpose. SSH utilizes standard public key
cryptography to create a pair of keys for user verification—one public, one private. The public
key is stored locally on every node in the cluster, and the master node sends the private key when
attempting to access a remote machine. With both pieces of information, the target machine can
validate the login attempt.
Define a common account:
We‘ve been speaking in general terms of one node accessing another; more precisely this access
is from a user account on one node to another user account on the target machine. For Hadoop,
the accounts should have the same username on all of the nodes (we use hadoop-user in this
book), and for security purpose we recommend it being a user-level account. This account is
only for managing your Hadoop cluster. Once the cluster daemons are up and running, you‘ll be
able to run your actual MapReduce jobs from other accounts.
Verify SSH installation
The first step is to check whether SSH is installed on your nodes. We can easily do this by use of
the "which" UNIX command:
[hadoop-user@master]$ which ssh
/usr/bin/ssh
[hadoop-user@master]$ which sshd
/usr/bin/sshd
[hadoop-user@master]$ which ssh-keygen
/usr/bin/ssh-keygen
If you instead receive an error message such as this,
/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...
Install OpenSSH (www.openssh.com) via a Linux package manager or by downloading the

source directly.
Operational modes of Hadoop:
 Local (Standalone) mode :
The standalone mode is the default mode for Hadoop. When you first uncompress the Hadoop
source package, it‘s ignorant of your hardware setup. Hadoop chooses to be conservative and
assumes a minimal configuration. All three XML files (or hadoopsite.xml before version 0.20)
are empty under this default mode:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
</configuration>
With empty configuration files, Hadoop will run completely on the local machine. Because
there‘s no need to communicate with other nodes, the standalone mode doesn‘t use HDFS, nor
will it launch any of the Hadoop daemons. Its primary use is for developing and debugging the
application logic of a MapReduce program without the additional complexity of interacting with
the daemons.
Properties:
(i) Default mode of Hadoop.
(ii) HDFS is not utilized in this mode.
(iii) Local File System is used for input and output.
(iv) Used for debugging purpose.
(v) Standalone is much faster than Pseudo-Distributed mode.
 Pseudo-Distributed mode :
The pseudo-distributed mode is running Hadoop in a ―cluster of one‖ with all daemons running
on a single machine. This mode complements the standalone mode for debugging your code,
allowing you to examine memory usage, HDFS input/output issues, and other daemon
interactions. The below listing provides simple XML files to configure a single server in this
mode.
core-site.xml
. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.
</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs
at.</description>
</property>
</configuration>
hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>The actual number of replications can be specified when
the file is created.</description>
</property>
</configuration>
Properties:
(i) Configuration is required in given three files for this node.
(ii) Here one node is used as Master/Data/Job Tracker/Task Tracker node.
(iii) This is used for real code to test in HDFS.
(iv) This is a cluster, where all daemons are running on one node itself.
Fully-Distributed mode :
After continually emphasizing the benefits of distributed storage and distributed computation,
it‘s time for us to set up a full cluster. In the discussion below we‘ll use the following server
names:
■ master—The master node of the cluster and host of the NameNode and Job- Tracker daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ hadoop1, hadoop2, hadoop3, ...—The slave boxes of the cluster running both DataNode and
TaskTracker daemons
Using the preceding naming convention, the below listing is a modified version of the pseudo-
distributed configuration files that can be used as a skeleton for your cluster‘s setup.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.
</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs
at.</description>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>The actual number of replications can be specified when
the file is created.</description>
</property>
</configuration>
The key differences are
■ We explicitly stated the hostname for location of the NameNode and JobTracker daemons.
■ We increased the HDFS replication factor to take advantage of distributed storage. Recall that
data is replicated across HDFS to increase availability and reliability.
Properties:
(i) This is a production phase.

(ii) Data are used and distributed across many nodes.
(iii) Different nodes will be used as Master node/Data node..etc
Data Measurement Chart:
Data Measurement Data Size
Bit 1 or 0 (Single Binary Digit)
Byte 8 Bits
Kilobyte (KB) 1024 Bytes
Megabyte (MB) 1024 KB
Gigabyte (GB) 1024 MB
Terabyte (TB) 1024 GB
Petabyte (PB) 1024 TB
Exabyte (EB) 1024 PB

Hadoop & BigData (UNIT - 2)

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Hadoop & BigData (UNIT - 2)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop & BigData (UNIT - 2)

Uploaded by

Copyright:

Available Formats

UNIT – 2 : WORKING WITH BIG DATA

 Structured data : Relational data.

Benefits of Big Data

Big Data Technologies

Big Data Challenges

The major challenges associated with big data are as follows:

Topic – 2: Hadoop Distributed File System (HDFS)

 It is suitable for the distributed storage and processing.

TaskTracker TaskTracker TaskTracker TaskTracker

Map Map Map Map

Reduce Reduce Reduce Reduce

NameNode DataNode TaskTracker

Figure : Topology of a typical Hadoop cluster . It’s a master/slave architecture

Topic – 3: Introducing and Configuring Hadoop Cluster

[hadoop-user@master]$ which sshd

[hadoop-user@master]$ which ssh-keygen

If you instead receive an error message such as this,

/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...

Install OpenSSH (www.openssh.com) via a Linux package manager or by downloading the

 Local (Standalone) mode :

(i) Default mode of Hadoop.

(ii) HDFS is not utilized in this mode.

(iii) Local File System is used for input and output.

(iv) Used for debugging purpose.

(v) Standalone is much faster than Pseudo-Distributed mode.

(i) Configuration is required in given three files for this node.

(ii) Here one node is used as Master/Data/Job Tracker/Task Tracker node.

(iii) This is used for real code to test in HDFS.

■ backup—The server that hosts the Secondary NameNode daemon

The key differences are

(i) This is a production phase.

Data Measurement Chart:

Data Measurement Data Size

Bit 1 or 0 (Single Binary Digit)

Kilobyte (KB) 1024 Bytes

Megabyte (MB) 1024 KB

Gigabyte (GB) 1024 MB

Terabyte (TB) 1024 GB

Petabyte (PB) 1024 TB

Exabyte (EB) 1024 PB

You might also like