Unit v Programming Model
Unit v Programming Model
Introduction to Hadoop Framework - Mapreduce, Input splitting, map and reduce functions,
specifying input and output parameters, configuring and running a job –Developing Map
Reduce Applications - Design of Hadoop file system –Setting up Hadoop Cluster - Aneka:
Cloud Application Platform, Thread Programming, Task Programming and Map-Reduce
Programming in Aneka.
Even though the file chunks are replicated and distributed across several
machines, they form a single namespace, so their contents are universally
accessible.
MAPREDUCE in Hadoop
Hadoop will not run just any program and distribute it across a cluster.
Programs must be written to conform to a particular programming model,
named "MapReduce."
In addition, there are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored
in files. Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.
The DataNodes are responsible for serving read and write requests from
the file systemʼs clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on
commodity machines.
HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.
Data Format
InputFormat: How the input files are split up and read is defined by the
InputFormat. An InputFormat is a class that provides the following functionality:
Selects the files or other objects that should be used for input
OutputFormat: The (key, value) pairs provided to this OutputCollector are then
written to output files.
Hadoop can process many different types of data formats, from flat text
files to databases.
For example, ( National Climatic Data Center) NCDC data as given below,
the format supports a rich set of meteorological elements, many of which
are optional or with variable data lengths.
Data files are organized by date and weather station.
To take advantage of the parallel processing that Hadoop provides, express the
query as a MapReduce job.
Map Reduce works by breaking the processing into two phases: the map phase
and the reduce phase. Each phase has key-value pairs as input and output, the
types of which may be chosen by the programmer. The programmer also
specifies two functions: the map function and the reduce function.
The input to map phase is the raw NCDC data. We choose a text input format
that gives us each line in the dataset as a text value. The key is the offset of the
beginning of the line from the beginning of the file.
To visualize the way the map works, consider the following sample lines of input
data:
These lines are presented to the map function as the key-value pairs:
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated
in bold text), and emits them as its output (the temperature values have been
interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the
key-value pairs by key. So, continuing the example, our reduce function sees
the following input:
Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
The whole data flow is illustrated in the following figure.
Combiner Functions
This is best illustrated with an example. Suppose that for the maximum
temperature example, readings for the year 1950 were processed by two
maps (because they were in different splits). Imagine the first map
produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
with output:
(1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that,
just like the reduce function, finds the maximum temperature for each map
output.
The code is
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Design of HDFS
HDFS is a file system designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware.
“Very large”
Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a
Commodity hardware
Hadoop doesnʼt require expensive, highly reliable hardware to run on. Itʼs
designed to run on clusters of commodity hardware (commonly available
hardware available from multiple vendors) .
range, will not work well with HDFS. Remember, HDFS is optimized for
delivering a high throughput of data, and this may be at the expense of latency.
HDFS Concepts
Blocks
Name Node
Data Nodes
HDFS Blocks
HDFS is a block structured file system. Each HDFS file is broken into
blocks of fixed size usually 128 MB which are stored across various data
nodes on the cluster. Each of these blocks is stored as a separate file on
local file system on data nodes (Commodity machines on cluster).
So, any HDFS client trying to access/read a HDFS file, will get block
information from Name Node first, and then based on the block idʼs and
locations, data will be read from corresponding data nodes/computer
machines on cluster.
HDFSʼs fsck command is useful to get the files and blocks details of file
system.
Example: The following command list the blocks that make up each file in
the file system.
Advantages of Blocks
By default, HDFS Block Size is 128 MB which is much larger than any other file
system. In HDFS, large block size is maintained to reduce the seek time for block
access.
Another benefit of this block structure is that, there is no need to store all blocks
of a file on the same disk or node. So, a fileʼs size can be larger than the size of a
disk or node.
3. How Fault Tolerance is achieved with HDFS Blocks:
HDFS blocks feature suits well with the replication for providing fault tolerance
and availability.
Name Node
Name Node is the single point of contact for accessing files in HDFS and it
determines the block ids and locations for data access. So, Name Node
plays a Master role in Master/Slaves Architecture where as Data Nodes
acts as slaves. File System metadata is stored on Name Node.
Data Node
Data Nodes are the slaves part of Master/Slaves Architecture and on
which actual HDFS files are stored in the form of fixed size chunks of data
which are called blocks.
Data Nodes serve read and write requests of clients on HDFS files and
also perform block creation, replication and deletions.
The Command-Line Interface
There are many other interfaces to HDFS, but the command line is one of the
simplest and, to many developers, the most familiar. It provides a command line
interface called FS shell that lets a user interact with the data in HDFS. The
syntax of this command set is similar to other shells (e.g. bash, csh) that users
are already familiar with. Here are some sample action/command pairs:
Action Command
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
finally
IOUtils.closeStream(in);
Create a file with create() method on file system instance which will return
an FSDataOutputStream.
All the metadata information is with namenode and the original data is
stored on the datanodes.
The below figure will give idea about how data flow happens between the
Client interacting with HDFS, i.e. the Namenode and the Datanodes.
The following steps are involved in reading the file from HDFS:
Letʼs suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote
Procedure Call), to determine the locations of the blocks for the file. For each
block, the NameNode returns the addresses of all the DataNodeʼs that have a
copy of that block. Client will interact with respective DataNodeʼs to read the file.
NameNode also provide a token to the client which it shows to datanode for
authentication.
Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the DataNode addresses for the first few blocks in the file, then connects
to the first closest DataNode for the first block in the file.
Step 4: Data is streamed from the DataNode back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the DataNode , then find the best DataNode for the next block.
Step 6: Blocks are read in order, with the DFSInputStream opening new
connections to datanodes as the client reads through the stream. It will also call
the namenode to retrieve the datanode locations for the next batch of blocks as
needed. When the client has finished reading, it calls close() on the
FSDataInputStream.
Step 3: As the client writes data, DFSOutputStream splits it into packets, which
it writes to an internal queue, called the data queue. The data queue is
consumed by the DataStreamer, which is responsible for asking the namenode
to allocate new blocks by picking a list of suitable datanodes to store the
replicas. The list of datanodes forms a pipeline, and here weʼll assume the
replication level is three, so there are three nodes in the pipeline.
TheDataStreamer streams the packets to the first datanode in the pipeline,
which stores the packet and forwards it to the second datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it to the
third (and last) datanode in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
This action flushes all the remaining packets to the datanode pipeline and waits
for acknowledgments before contacting the namenode to signal that the file is
complete.
Apache Hadoop
Apache Hadoop is an open source java based programming framework that
supports the processing of large data set in a distributed computing
environment.
1. Hadoop common – collection of common utilities and libraries that support other
Hadoop modules.
2. Hadoop Distributed File System (HDFS) – Primary distributed storage system used
by Hadoop applications to hold large volume of data. HDFS is scalable and fault-
tolerant which works closely with a wide variety of concurrent data access
application.
3. Hadoop YARN (Yet Another Resource Negotiator) – Apache Hadoop YARN is the
resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system
resources to the various applications running in a Hadoop cluster and scheduling
tasks to be executed on different cluster nodes.
There are two ways to install Hadoop, i.e. Single node and Multi node.
Single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager and NodeManager on a single
machine. This is used for studying and testing purposes.
While in a Multi node cluster, there are more than one DataNode running and
each DataNode is running on different machines. The multi node cluster is
practically used in organizations for analyzing Big Data.
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
For applying all these changes to the current Terminal, execute the source
command.
To make sure that Java and Hadoop have been properly installed on the
system and can be accessed through the Terminal, execute the java -version
and hadoop version commands.
Command: cd hadoop-2.9.0/etc/hadoop/
Command: ls
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
Step 9: Edit the mapred-site.xml file and edit the property mentioned below
inside configuration tag:
</property>
</configuration>
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name
>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
Command: cd
Command: cd hadoop-2.9.0
This formats the HDFS via NameNode. This command is only executed for
the first time. Formatting the file system means initializing the directory
specified by the dfs.name.dir variable.
Command: ./start-all.sh
Step 14: To check that all the Hadoop services are up and running, run the
below command.
Command: jps
a)Eucalyptus
Compatible with Amazon Web Services (AWS) and Simple Storage Service
(S3).
Eucalyptus Architecture
Cloud Controller
The Cloud Controller (CLC) is the entry-point into the cloud for
administrators, developers, project managers, and end-users.
Walrus
Cluster Controller
CCs gather information about a set of NCs and schedules virtual machine
(VM) execution on specific NCs. The CC also manages the virtual machine
networks. All NCs associated with a single CC must be in the same subnet.
Storage Controller
Node Controller
The Node Controller (NC) executes on any machine that hosts VM
instances. The NC controls VM activities, including the execution, inspection,
and termination of VM instances.
VM Image Management
This image is uploaded into a user-defined bucket within Walrus, and can
be retrieved anytime from any availability zone.
b)OpenNebula
The following figure shows the OpenNebula architecture and its main
components.
Here, the core is a centralized component that manages the VM full life
cycle, including setting of networks dynamically for groups of VMs and
managing their storage requirements.
The last main components are the access drivers. They provide basic
functionality of the monitoring, storage, and virtualization services
available in the cluster.
To this end, OpenNebula implements the libvirt API , an open interface for
VM management, as well as a command-line interface (CLI).
c) OpenStack
OpenStack was introduced by Rackspace and NASA in July 2010.
OpenStack is a set of software tools for building and managing cloud
computing platforms for public and private clouds.
It focuses on the development of two aspects of cloud computing to
address compute and storage aspects with the OpenStack Compute and
OpenStack Storage solutions.
“OpenStack Compute for creating and managing large groups of virtual
private servers”
“OpenStack Object Storage software for creating redundant, scalable
object storage using clusters of commodity servers to store terabytes or
even petabytes of data.”
The OpenStack is an open source cloud computing platform for all types
of clouds, which aims to be simple to implement, massively scalable, and
feature rich.
OpenStack provides an Infrastructure-as-a-Service (IaaS) solution
through a set of interrelated services. Each service offers an application
programming interface (API) that facilitates this integration.
OpenStack Compute
OpenStack is developing a cloud computing fabric controller, a
component of an IAAS system, known as Nova.
Nova is an OpenStack project designed to provide massively scalable, on
demand, self service access to compute resources.
The architecture of Nova is built on the concepts of shared-nothing and
messaging-based information exchange.
Hence, most communication in Nova is facilitated by message queues.
To prevent blocking components while waiting for a response from others,
deferred objects are introduced. Such objects include callbacks that get
triggered when a response is received.
To achieve the shared-nothing paradigm, the overall system state is kept
in a distributed data system.
State updates are made consistent through atomic transactions.
Nova is implemented in Python while utilizing a number of externally
supported libraries and components. This includes boto, an Amazon API
provided in Python, and Tornado, a fast HTTP server used to implement
the S3 capabilities in OpenStack.
The Figure shows the main architecture of Open Stack Compute. In this
architecture, the API Server receives HTTP requests from boto, converts
the commands to and from the API format, and forwards the requests to
the cloud controller.
OpenStack Storage
Proxy servers
Proxy servers are the public face of Object Storage and handle all of the
incoming API requests. Once a proxy server receives a request, it determines
the storage node based on the objectʼs URL.
Rings
A ring represents a mapping between the names of entities stored on disks and
their physical locations. There are separate rings for accounts, containers, and
objects. When other components need to perform any operation on an object,
container, or account, they need to interact with the appropriate ring to
determine their location in the cluster.
The ring maintains this mapping using zones, devices, partitions, and replicas.
Each partition in the ring is replicated, by default, three times across the cluster,
and partition locations are stored in the mapping maintained by the ring. The
ring is also responsible for determining which devices are used for handoff in
failure scenarios.
Partitions
Nimbus is a set of open source tools that together provide an IaaS cloud
computing solution.
The following figure shows the architecture of Nimbus, which allows a
client to lease remote resources by deploying VMs on those resources
and configuring them to represent the environment desired by the user.
To this end, Nimbus provides a special web interface known as Nimbus
Web. Its aim is to provide administrative and user functions in a friendly
interface.
Nimbus Web is centered around a Python Django web application that is
intended to be deployable completely separate from the Nimbus service.
As shown in Figure, a storage cloud implementation called Cumulus has
been tightly integrated with the other central services, although it can also
be used stand-alone.
Cumulus is compatible with the Amazon S3 REST API , but extends its
capabilities by including features such as quota management. Therefore,
clients such as boto and s2cmd , that work against the S3 REST API, work
with Cumulus.
On the other hand, the Nimbus cloud client uses the Java Jets3t library to
interact with Cumulus. Nimbus supports two resource management
strategies.
The first is the default “resource pool” mode. In this mode, the service
has direct control of a pool of VM manager nodes and it assumes it can
start VMs.
The other supported mode is called “pilot.” Here, the service makes
requests to a clusterʼs Local Resource Management System (LRMS) to
get a VM manager available to deploy VMs.
Nimbus also provides an implementation of Amazonʼs EC2 interface that
allows users to use clients developed for the real EC2 system against
Nimbus-based clouds.