Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Hadoop

Problems with Traditional Approach

In traditional approach, the main issue was handling the heterogeneity of data i.e.
structured, semi-structured and unstructured. The RDBMS focuses mostly on structured
data like banking transaction, operational data etc. and Hadoop specializes in semi-
structured, unstructured data like text, videos, audios, Facebook posts, logs, etc. RDBMS
technology is a proven, highly consistent, matured systems supported by many companies.
While on the other hand, Hadoop is in demand due to Big Data, which mostly consists of
unstructured data in different formats.

Now let us understand what are the major problems associated with Big Data. So that,
moving ahead we can understand how Hadoop emerged as a solution.

The first problem is storing the colossal amount of data.

Storing this huge data in a traditional system is not possible. The reason is obvious, the
storage will be limited only to one system and the data is increasing at a tremendous rate.

Second problem is storing heterogeneous data.

Now, we know storing is a problem, but let me tell you, it is just a part of the problem. Since
we discussed that the data is not only huge, but it is present in various formats as well like:
Unstructured, Semi-structured and Structured. So, you need to make sure that, you have a
system to store all these varieties of data, generated from various sources.

Third problem is accessing and processing speed.

The hard disk capacity is increasing but the disk transfer speed or the access speed is not
increasing at similar rate. Let me explain you this with an example: If you have only one 100
Mbps I/O channel and you are processing 1TB of data, it will take around 2.91 hours. Now, if
you have four machines with one I/O channel, for the same amount of data it will take 43
minutes approx. Thus, accessing and processing speed is the bigger problem than storing Big
Data.

Before understanding what is Hadoop, let us first look at the evolution of Hadoop over the
period of time.
What is Hadoop?

Hadoop is a framework that allows you to first store Big Data in a distributed environment,
so that, you can process it parallely. There are basically two components in Hadoop:

Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license.

Hadoop was developed, based on the paper written by Google on the MapReduce system
and it applies concepts of functional programming. Hadoop is written in the Java
programming language and ranks among the highest-level Apache projects. Hadoop was
developed by Doug Cutting and Michael J. Cafarella.

Evolution of Hadoop

In 2003, Doug Cutting launches project Nutch to handle billions of searches and indexing
millions of web pages. Later in Oct 2003 – Google releases papers with GFS (Google File
System). In Dec 2004, Google releases papers with MapReduce. In 2005, Nutch used GFS
and MapReduce to perform operations. In 2006, Yahoo created Hadoop based on GFS and
MapReduce with Doug Cutting and team. You would be surprised if I would tell you that, in
2007 Yahoo started using Hadoop on a 1000 node cluster.

Later in Jan 2008, Yahoo released Hadoop as an open source project to Apache Software
Foundation. In Jul 2008, Apache tested a 4000 node cluster with Hadoop successfully. In
2009, Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions
of searches and indexing millions of web pages. Moving ahead in Dec 2011, Apache Hadoop
released version 1.0. Later in Aug 2013, Version 2.0.6 was available.

Hadoop-as-a-Solution

Let’s understand how Hadoop provides a solution to the Big Data problems that we have
discussed so far.

 The first problem is storing huge amount of data.

As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in DataNodes and you specify the size of each block. Suppose you
have 512 MB of data and you have configured HDFS such that it will create 128 MB of data
blocks. Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
DataNodes. While storing these data blocks into DataNodes, data blocks are replicated on
different DataNodes to provide fault tolerance.
Hadoop follows horizontal scaling instead of vertical scaling. In horizontal scaling, you can
add new nodes to HDFS cluster on the run as per requirement, instead of increasing the
hardware stack present in each node.

 Next problem was storing a variety of data.

As you can see in the above image, in HDFS you can store all kinds of data whether it
is structured, semi-structured or unstructured. In HDFS, there is no pre-dumping schema
validation. It also follows write once and read many models. Due to this, you can just write
any kind of data once and you can read it multiple times for finding insights.

 The third challenge was about processing the data faster.

In order to solve this, we move the processing unit to data instead of moving data to the
processing unit.

So, what does it mean by moving the computation unit to data?

It means that instead of moving data from different nodes to a single master node for
processing, the processing logic is sent to the nodes where data is stored so as that each
node can process a part of data in parallel. Finally, all of the intermediary output produced
by each node is merged together and the final response is sent back to the client.
Features of Hadoop

Fig: Hadoop Tutorial – Hadoop Features

Reliability

When machines are working as a single unit, if one of the machines fails, another machine
will take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop
infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors.

But if I would have used hardware-based RAID with Oracle for the same purpose, I would
end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based
project is minimized. It is easier to maintain a Hadoop environment and is economical as
well. Also, Hadoop is open-source software and hence there is no licensing cost.

Scalability

Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if
you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor
because you can go ahead and procure more hardware and expand your set up within
minutes whenever required.
Flexibility

Hadoop is very flexible in terms of the ability to deal with all kinds of data. We
discussed “Variety” in our previous blog on Big Data Tutorial, where data can be of any kind
and Hadoop can store and process them all, whether it is structured, semi-structured or
unstructured data.

Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting
up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands
for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas
YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS

Let us go ahead with HDFS first. The main components of HDFS are the NameNode and
the DataNode. Let us talk about the roles of these two components in detail.

Fig: Hadoop Tutorial – HDFS

NameNode

 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to the file system metadata
 If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are alive
 It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
 It has high availability and federation features which I will discuss
in HDFS architecture in detail
DataNode

 It is the slave daemon which runs on each slave machine


 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and replicating the same
based on the decisions taken by the NameNode
 It sends heartbeats to the NameNode periodically to report the overall health of
HDFS, by default, this frequency is set to 3 seconds

So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental
unit of Hadoop i.e. YARN.

YARN

YARN comprises of two major components: ResourceManager and NodeManager.

Fig: Hadoop Tutorial – YARN

ResourceManager

 It is a cluster-level (one for each cluster) component and runs on the master machine
 It manages resources and schedules applications running on top of YARN
 It has two components: Scheduler & ApplicationManager
 The Scheduler is responsible for allocating resources to the various running
applications
 The ApplicationManager is responsible for accepting job submissions and negotiating
the first container for executing the application
 It keeps a track of the heartbeats from the Node Manager
NodeManager

 It is a node-level component (one on each node) and runs on each slave machine
 It is responsible for managing containers and monitoring resource utilization in each
container
 It also keeps track of node health and log management
 It continuously communicates with ResourceManager to remain up-to-date

You might also like