Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
48 views

Unit v Programming Model

Uploaded by

rajprawin0105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Unit v Programming Model

Uploaded by

rajprawin0105
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIT IV PROGRAMMING MODEL

Introduction to Hadoop Framework - Mapreduce, Input splitting, map and reduce functions,
specifying input and output parameters, configuring and running a job –Developing Map
Reduce Applications - Design of Hadoop file system –Setting up Hadoop Cluster - Aneka:
Cloud Application Platform, Thread Programming, Task Programming and Map-Reduce
Programming in Aneka.

Introduction to Hadoop Framework

 In a Hadoop cluster, data is distributed to all the nodes of the cluster


as it is being loaded in. The Hadoop Distributed File System (HDFS) will
split large data files into chunks which are managed by different nodes in
the cluster.

 In addition to this, each chunk is replicated across several machines, so


that a single machine failure does not result in any data being unavailable.

 Even though the file chunks are replicated and distributed across several
machines, they form a single namespace, so their contents are universally
accessible.
MAPREDUCE in Hadoop

 Hadoop will not run just any program and distribute it across a cluster.
Programs must be written to conform to a particular programming model,
named "MapReduce."

 In MapReduce, records are processed by tasks called Mappers. The


output from the Mappers is then brought together into a second set of
tasks called Reducers, where results from different mappers can be
merged together.
Hadoop Architecture:

 HDFS has a master/slave architecture. An HDFS cluster consists of a


single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.

 In addition, there are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be stored
in files. Internally, a file is split into one or more blocks and these blocks are
stored in a set of DataNodes.

 The NameNode executes file system namespace operations like opening,


closing, and renaming files and directories. It also determines the
mapping of blocks to DataNodes.

 The DataNodes are responsible for serving read and write requests from
the file systemʼs clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.
 The NameNode and DataNode are pieces of software designed to run on
commodity machines.

 HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software.
Data Format

InputFormat: How the input files are split up and read is defined by the
InputFormat. An InputFormat is a class that provides the following functionality:

 Selects the files or other objects that should be used for input

 Defines the InputSplits that break a file into tasks

 Provides a factory for RecordReader objects that read the file

OutputFormat: The (key, value) pairs provided to this OutputCollector are then
written to output files.
 Hadoop can process many different types of data formats, from flat text
files to databases.

 If it is flat file, the data is stored using a line-oriented ASCII format, in


which each line is a record.

 For example, ( National Climatic Data Center) NCDC data as given below,
the format supports a rich set of meteorological elements, many of which
are optional or with variable data lengths.
Data files are organized by date and weather station.

Analyzing the Data with Hadoop

To take advantage of the parallel processing that Hadoop provides, express the
query as a MapReduce job.

Map and Reduce:

Map Reduce works by breaking the processing into two phases: the map phase
and the reduce phase. Each phase has key-value pairs as input and output, the
types of which may be chosen by the programmer. The programmer also
specifies two functions: the map function and the reduce function.

The input to map phase is the raw NCDC data. We choose a text input format
that gives us each line in the dataset as a text value. The key is the offset of the
beginning of the line from the beginning of the file.

To visualize the way the map works, consider the following sample lines of input
data:

These lines are presented to the map function as the key-value pairs:
The keys are the line offsets within the file, which we ignore in our map function.
The map function merely extracts the year and the air temperature (indicated
in bold text), and emits them as its output (the temperature values have been
interpreted as integers):

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the
key-value pairs by key. So, continuing the example, our reduce function sees
the following input:

(1949, [111, 78])

(1950, [0, 22, −11])

Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:

(1949, 111)

(1950, 22)

This is the final output: the maximum global temperature recorded in each year.
The whole data flow is illustrated in the following figure.

Combiner Functions

 Hadoop allows the user to specify a combiner function to be run on the


map output—the combiner functionʼs output forms the input to the reduce
function.

 Since the combiner function is an optimization, Hadoop does not provide a


guarantee of how many times it will call it for a particular map output
record, if at all.

 This is best illustrated with an example. Suppose that for the maximum
temperature example, readings for the year 1950 were processed by two
maps (because they were in different splits). Imagine the first map
produced the output:
(1950, 0)

(1950, 20)

(1950, 10)

And the second produced:

(1950, 25)

(1950, 15)

The reduce function would be called with a list of all the values:

(1950, [0, 20, 10, 25, 15])

with output:

(1950, 25)

since 25 is the maximum value in the list. We could use a combiner function that,
just like the reduce function, finds the maximum temperature for each map
output.

The reduce would then be called with:

(1950, [20, 25])

and the reduce would produce the same output as before.

The code is

max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

Design of HDFS

HDFS is a file system designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware.

 “Very large”

Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are
Hadoop clusters running today that store petabytes of data.

 Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a

write-once, read-many-times pattern.

 Commodity hardware

Hadoop doesnʼt require expensive, highly reliable hardware to run on. Itʼs
designed to run on clusters of commodity hardware (commonly available
hardware available from multiple vendors) .

 Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds

range, will not work well with HDFS. Remember, HDFS is optimized for
delivering a high throughput of data, and this may be at the expense of latency.

HDFS Concepts

The following diagram illustrates the Hadoop concepts


Important components in HDFS Architecture are:

 Blocks

 Name Node

 Data Nodes

HDFS Blocks

 HDFS is a block structured file system. Each HDFS file is broken into
blocks of fixed size usually 128 MB which are stored across various data
nodes on the cluster. Each of these blocks is stored as a separate file on
local file system on data nodes (Commodity machines on cluster).

 Thus to access a file on HDFS, multiple data nodes need to be referenced


and the list of the data nodes which need to be accessed is determined by
the file system metadata stored on Name Node.

 So, any HDFS client trying to access/read a HDFS file, will get block
information from Name Node first, and then based on the block idʼs and
locations, data will be read from corresponding data nodes/computer
machines on cluster.

 HDFSʼs fsck command is useful to get the files and blocks details of file
system.

 Example: The following command list the blocks that make up each file in
the file system.

$ hadoop fsck / -files -blocks

Advantages of Blocks

1. Quick Seek Time:

By default, HDFS Block Size is 128 MB which is much larger than any other file
system. In HDFS, large block size is maintained to reduce the seek time for block
access.

2. Ability to Store Large Files:

Another benefit of this block structure is that, there is no need to store all blocks
of a file on the same disk or node. So, a fileʼs size can be larger than the size of a
disk or node.
3. How Fault Tolerance is achieved with HDFS Blocks:

HDFS blocks feature suits well with the replication for providing fault tolerance
and availability.

By default each block is replicated to three separate machines. This feature


insures blocks against corrupted blocks or disk or machine failure. If a block
becomes unavailable, a copy can be read from another machine. And a block
that is no longer available due to corruption or machine failure can be replicated
from its alternative machines to other live machines to bring the replication
factor back to the normal level (3 by default).

Name Node

 Name Node is the single point of contact for accessing files in HDFS and it
determines the block ids and locations for data access. So, Name Node
plays a Master role in Master/Slaves Architecture where as Data Nodes
acts as slaves. File System metadata is stored on Name Node.

 File System Metadata contains File names, File Permissions and


locations of each block of files. Thus, Metadata is relatively small in size
and fits into Main Memory of a computer machine. So, it is stored in Main
Memory of Name Node to allow fast access.

Data Node
 Data Nodes are the slaves part of Master/Slaves Architecture and on
which actual HDFS files are stored in the form of fixed size chunks of data
which are called blocks.

 Data Nodes serve read and write requests of clients on HDFS files and
also perform block creation, replication and deletions.
The Command-Line Interface

There are many other interfaces to HDFS, but the command line is one of the
simplest and, to many developers, the most familiar. It provides a command line
interface called FS shell that lets a user interact with the data in HDFS. The
syntax of this command set is similar to other shells (e.g. bash, csh) that users
are already familiar with. Here are some sample action/command pairs:

Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir


/foodir

View the contents of a file named bin/hadoop dfs –cat


/foodir/myfile.txt /foodir/myfile.txt
Hadoop Filesystem

Hadoop has an abstract notion of filesystem, of which HDFS is just one


implementation.

The Java abstract class org.apache.hadoop.fs.FileSystem represents a


filesystem in Hadoop, and there are several concrete implementations, which
are described in Table,
JAVA INTERFACE

Reading Data from a Hadoop URL


One of the simplest ways to read a file from a Hadoop filesystem is by using a

java.net.URL object to open a stream to read the data from.

The general syntax is:

InputStream in = null;

try {

in = new URL("hdfs://host/path").openStream();

// process in

finally

IOUtils.closeStream(in);

Writing HDFS Files Through FileSystem API:

To write a file in HDFS,

 First we need to get instance of FileSystem.

 Create a file with create() method on file system instance which will return
an FSDataOutputStream.

 We can copy bytes from any other stream to output stream


using IOUtils.copyBytes() or write directly with write() or any of its flavors
method on object of FSDataOutputStream.
Data Flow

Anatomy of File Read

 HDFS has a master and slave kind of architecture. Namenode acts as


master and Datanodes as worker.

 All the metadata information is with namenode and the original data is
stored on the datanodes.

 The below figure will give idea about how data flow happens between the
Client interacting with HDFS, i.e. the Namenode and the Datanodes.
The following steps are involved in reading the file from HDFS:
Letʼs suppose a Client (a HDFS Client) wants to read a file from HDFS.

Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote
Procedure Call), to determine the locations of the blocks for the file. For each
block, the NameNode returns the addresses of all the DataNodeʼs that have a
copy of that block. Client will interact with respective DataNodeʼs to read the file.
NameNode also provide a token to the client which it shows to datanode for
authentication.

The DistributedFileSystem returns an object of FSDataInputStream(an input


stream that supports file seeks) to the client to read data from
FSDataInputStream.

Step 3: The client then calls read() on the stream. DFSInputStream, which has
stored the DataNode addresses for the first few blocks in the file, then connects
to the first closest DataNode for the first block in the file.

Step 4: Data is streamed from the DataNode back to the client, which calls read()
repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the DataNode , then find the best DataNode for the next block.

Step 6: Blocks are read in order, with the DFSInputStream opening new
connections to datanodes as the client reads through the stream. It will also call
the namenode to retrieve the datanode locations for the next batch of blocks as
needed. When the client has finished reading, it calls close() on the
FSDataInputStream.

Anatomy of File Write

Step 1: The client creates the file by calling create() method on


DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC call to the namenode to create a
new file in the filesystemʼs namespace, with no blocks associated with it.The
namenode performs various checks to make sure the file doesnʼt already exist
and that the client has the right permissions to create the file.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which
it writes to an internal queue, called the data queue. The data queue is
consumed by the DataStreamer, which is responsible for asking the namenode
to allocate new blocks by picking a list of suitable datanodes to store the
replicas. The list of datanodes forms a pipeline, and here weʼll assume the
replication level is three, so there are three nodes in the pipeline.
TheDataStreamer streams the packets to the first datanode in the pipeline,
which stores the packet and forwards it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the
third (and last) datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are


waiting to be acknowledged by datanodes, called the ack queue. A packet is
removed from the ack queue only when it has been acknowledged by all the
datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close() on the stream.
This action flushes all the remaining packets to the datanode pipeline and waits
for acknowledgments before contacting the namenode to signal that the file is
complete.

Apache Hadoop
Apache Hadoop is an open source java based programming framework that
supports the processing of large data set in a distributed computing
environment.

Apache Hadoop framework is composed of following modules:

1. Hadoop common – collection of common utilities and libraries that support other
Hadoop modules.

2. Hadoop Distributed File System (HDFS) – Primary distributed storage system used
by Hadoop applications to hold large volume of data. HDFS is scalable and fault-
tolerant which works closely with a wide variety of concurrent data access
application.

3. Hadoop YARN (Yet Another Resource Negotiator) – Apache Hadoop YARN is the
resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system
resources to the various applications running in a Hadoop cluster and scheduling
tasks to be executed on different cluster nodes.

4. Hadoop MapReduce – MapReduce is a framework using which we can write


applications to process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.
Setting up a Hadoop Cluster

In general, a computer cluster is a collection of various computers that work


collectively as a single system.

“A hadoop cluster is a collection of independent components connected


through a dedicated network to work as a single centralized data processing
resource. “

Advantages of a Hadoop Cluster Setup

 As big data grows exponentially, parallel processing capabilities of a


Hadoop cluster helps in increasing the speed of analysis process.

 Hadoop cluster setup is inexpensive as they are held down by


cheap commodity hardware. Any organization can setup a powerful
hadoop cluster without having to spend on expensive server hardware.

 Hadoop clusters are resilient to failure meaning whenever data is


sent to a particular node for analysis, it is also replicated to other nodes on
the hadoop cluster. If the node fails then the replicated copy of the data
present on the other node in the cluster can be used for analysis.

There are two ways to install Hadoop, i.e. Single node and Multi node.

Single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager and NodeManager on a single
machine. This is used for studying and testing purposes.
While in a Multi node cluster, there are more than one DataNode running and
each DataNode is running on different machines. The multi node cluster is
practically used in organizations for analyzing Big Data.

Steps to set up a Hadoop Cluster

Step 1: download the Java Package. Save this file in home


directory.

Step 2: Extract the Java Tar File.

Step 3: Download the Hadoop 2.9.0 Package.

Step 4: Extract the Hadoop tar File.

Command: tar -xvf hadoop-2.9.0.tar.gz

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).

Open. bashrc file. Now, add Hadoop and Java Path as shown below.

Command: gedit .bashrc


Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source
command.

Command: source .bashrc

To make sure that Java and Hadoop have been properly installed on the
system and can be accessed through the Terminal, execute the java -version
and hadoop version commands.

Command: java -version


Fig: Hadoop Installation – Checking Java Version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.9.0/etc/hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-


2.9.0/etc/hadoop directory as seen in the snapshot below:
Fig: Hadoop Installation – Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:

core-site.xml informs Hadoop daemon where NameNode runs in the


cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.

Command: gedit core-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e.


NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.

Command: gedit hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Step 9: Edit the mapred-site.xml file and edit the property mentioned below
inside configuration tag:

mapred-site.xml contains configuration settings of MapReduce


application like number of JVM that can run in parallel, the size of the mapper
and the reducer process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create
the mapred-site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: gedit mapred-site.xml.

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>

</property>
</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:

yarn-site.xml contains configuration settings of ResourceManager and


NodeManager like application memory management size, the operation needed
on program & algorithm, etc.

Command: gedit yarn-site.xml

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name
>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:

hadoop-env.sh contains the environment variables that are used in the


script to run Hadoop like Java home path, etc.

Command: gedit hadoop–env.sh

Fig: Hadoop Installation – Configuring hadoop-env.sh

Step 12: Go to Hadoop home directory and format the NameNode.

Command: cd

Command: cd hadoop-2.9.0

Command: bin/hadoop namenode -format

This formats the HDFS via NameNode. This command is only executed for
the first time. Formatting the file system means initializing the directory
specified by the dfs.name.dir variable.

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin


directory and start all the daemons.
Command: cd hadoop-2.9.0/sbin

Either start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr-


jobhistory-daemon.sh

Step 14: To check that all the Hadoop services are up and running, run the
below command.

Command: jps

Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go


to localhost:50070/dfshealth.html to check the NameNode interface.
Fig: Hadoop Installation – Starting WebUI

Cloud Software Environments -Eucalyptus, Open Nebula, Open Stack, Nimbus

a)Eucalyptus

 Eucalyptus is an open source software platform for


implementing Infrastructure as a Service (IaaS) in
a private or hybrid cloud computing environment.

 It combines together existing virtualized infrastructure to


create cloud resources for infrastructure as a
service, network as a service and storage as a service.

 The name Eucalyptus is an acronym for Elastic Utility


Computing Architecture Linking Your Programs To Useful
Systems.

 Eucalyptus was founded out of a research project in the


Computer Science Department at the University of California,
Santa Barbara.

 Eucalyptus Systems announced a formal agreement


with Amazon Web Services (AWS) in March 2012,
allowing administrators to move instances between a
Eucalyptus private cloud and the Amazon Elastic Compute
Cloud (EC2) to create a hybrid cloud.

 The partnership also allows Eucalyptus to work with


Amazonʼs product teams to develop unique AWS-compatible
features.

Eucalyptus features include:

 Supports both Linux and Windows virtual machines (VMs).

 Application program interface- (API) compatible with Amazon EC2 platform.

 Compatible with Amazon Web Services (AWS) and Simple Storage Service
(S3).

 Works with multiple hypervisors including VMware, Xen and KVM.

 Internal processes communications are secured through SOAP and WS-


Security.
 Multiple clusters can be virtualized as a single cloud.

Eucalyptus Architecture

 The Eucalyptus system is an open software environment.

 The following figure shows the architecture based on the need to


manage VM images.

 The system supports cloud programmers in VM image management.


Essentially, the system has been extended to support the
development of both the compute cloud and storage cloud.

The Eucalyptus architecture for VM image management.

Eucalyptus is comprised of several components: Cloud Controller, Walrus,


Cluster Controller, Storage Controller, and Node Controller. Each component is
a stand-alone web service.

Cloud Controller

The Cloud Controller (CLC) is the entry-point into the cloud for
administrators, developers, project managers, and end-users.

As the interface to the management platform, the CLC is responsible for


exposing and managing the underlying virtualized resources (servers, network,
and storage).

Walrus

Walrus allows users to store persistent data, organized as buckets and


objects. We can use Walrus to create, delete, and list buckets, or to put, get, and
delete objects, or to set access control policies.

Walrus is interface compatible with Amazonʼs Simple Storage Service


(S3). It provides a mechanism for storing and accessing virtual machine images
and user data.

Cluster Controller

CCs gather information about a set of NCs and schedules virtual machine
(VM) execution on specific NCs. The CC also manages the virtual machine
networks. All NCs associated with a single CC must be in the same subnet.

Storage Controller

The Storage Controller (SC) provides functionality similar to the Amazon


Elastic Block Store (Amazon EBS). The SC is capable of interfacing with various
storage systems.

Node Controller
The Node Controller (NC) executes on any machine that hosts VM
instances. The NC controls VM activities, including the execution, inspection,
and termination of VM instances.

VM Image Management

 Eucalyptus stores images in Walrus, the block storage system that is


analogous to the Amazon S3 service.

 This image is uploaded into a user-defined bucket within Walrus, and can
be retrieved anytime from any availability zone.

 The Eucalyptus system is available in a commercial proprietary version as


well as the open source version.

b)OpenNebula

 OpenNebula is a open source toolkit which allows users to transform


existing infrastructure into an IaaS cloud with cloud-like interfaces.

 The following figure shows the OpenNebula architecture and its main
components.

 The architecture of OpenNebula has been designed to be flexible and


modular to allow integration with different storage and network
infrastructure configurations, and hypervisor technologies.
OpenNebula architecture and its main components

 Here, the core is a centralized component that manages the VM full life
cycle, including setting of networks dynamically for groups of VMs and
managing their storage requirements.

 Another important component is the capacity manager or scheduler.

 The default capacity scheduler is a requirement/rank matchmaker.


However, it is also possible to develop more complex scheduling policies,
through a lease model and advance reservations.

 The last main components are the access drivers. They provide basic
functionality of the monitoring, storage, and virtualization services
available in the cluster.

 OpenNebula is not tied to any specific environment and can provide a


uniform management layer regardless of the virtualization platform.
 OpenNebula offers management interfaces to integrate the core's
functionality within other data-center management tools, such as
accounting or monitoring frameworks.

 To this end, OpenNebula implements the libvirt API , an open interface for
VM management, as well as a command-line interface (CLI).

 A subset of this functionality is exposed to external users through a cloud


interface.

 When the local resources are insufficient, OpenNebula can support a


hybrid cloud model by using cloud drivers to interface with external clouds.

 OpenNebula currently includes an EC2 driver, which can submit requests


to Amazon EC2 and Eucalyptus , as well as an Elastic Hosts driver.

c) OpenStack
 OpenStack was introduced by Rackspace and NASA in July 2010.
 OpenStack is a set of software tools for building and managing cloud
computing platforms for public and private clouds.
 It focuses on the development of two aspects of cloud computing to
address compute and storage aspects with the OpenStack Compute and
OpenStack Storage solutions.
 “OpenStack Compute for creating and managing large groups of virtual
private servers”
 “OpenStack Object Storage software for creating redundant, scalable
object storage using clusters of commodity servers to store terabytes or
even petabytes of data.”

 The OpenStack is an open source cloud computing platform for all types
of clouds, which aims to be simple to implement, massively scalable, and
feature rich.
 OpenStack provides an Infrastructure-as-a-Service (IaaS) solution
through a set of interrelated services. Each service offers an application
programming interface (API) that facilitates this integration.

OpenStack Compute
 OpenStack is developing a cloud computing fabric controller, a
component of an IAAS system, known as Nova.
 Nova is an OpenStack project designed to provide massively scalable, on
demand, self service access to compute resources.
 The architecture of Nova is built on the concepts of shared-nothing and
messaging-based information exchange.
 Hence, most communication in Nova is facilitated by message queues.
 To prevent blocking components while waiting for a response from others,
deferred objects are introduced. Such objects include callbacks that get
triggered when a response is received.
 To achieve the shared-nothing paradigm, the overall system state is kept
in a distributed data system.
 State updates are made consistent through atomic transactions.
 Nova is implemented in Python while utilizing a number of externally
supported libraries and components. This includes boto, an Amazon API
provided in Python, and Tornado, a fast HTTP server used to implement
the S3 capabilities in OpenStack.
 The Figure shows the main architecture of Open Stack Compute. In this
architecture, the API Server receives HTTP requests from boto, converts
the commands to and from the API format, and forwards the requests to
the cloud controller.

OpenStack Nova system architecture


 The cloud controller maintains the global state of the system, ensures
authorization while interacting with the User Manager via Lightweight
Directory Access Protocol (LDAP), interacts with the S3 service, and
manages nodes, as well as storage workers through a queue.
 Additionally, Nova integrates networking components to manage private
networks, public IP addressing, virtual private network (VPN) connectivity,
and firewall rules.
 It includes the following types:

• NetworkController manages address and virtual LAN (VLAN) allocations


• RoutingNode governs the NAT (network address translation)
conversion of public IPs to private IPs, and enforces firewall rules

• AddressingNode runs Dynamic Host Configuration Protocol (DHCP)


services for private networks

• TunnelingNode provides VPN connectivity

OpenStack Storage

Openstack Storage uses the following components to deliver high


availability, high durability, and high concurrency:

 Proxy servers - Handle all of the incoming API requests.


 Rings - Map logical names of data to locations on particular disks.
 Zones - Isolate data from other zones. A failure in one zone does not
impact the rest of the cluster as data replicates across zones.
 Accounts and containers - Each account and container are individual
databases that are distributed across the cluster. An account database
contains the list of containers in that account. A container database
contains the list of objects in that container.
 Objects - The data itself.
 Partitions - A partition stores objects, account databases, and container
databases and helps manage locations where data lives in the cluster.
Object Storage building blocks

Proxy servers

Proxy servers are the public face of Object Storage and handle all of the
incoming API requests. Once a proxy server receives a request, it determines
the storage node based on the objectʼs URL.

Proxy servers use a shared-nothing architecture and can be scaled as needed


based on projected workloads. A minimum of two proxy servers should be
deployed for redundancy. If one proxy server fails, the others take over.

Rings

A ring represents a mapping between the names of entities stored on disks and
their physical locations. There are separate rings for accounts, containers, and
objects. When other components need to perform any operation on an object,
container, or account, they need to interact with the appropriate ring to
determine their location in the cluster.
The ring maintains this mapping using zones, devices, partitions, and replicas.
Each partition in the ring is replicated, by default, three times across the cluster,
and partition locations are stored in the mapping maintained by the ring. The
ring is also responsible for determining which devices are used for handoff in
failure scenarios.

Accounts and containers

Each account and container is an individual SQLite database that is distributed


across the cluster. An account database contains the list of containers in that
account. A container database contains the list of objects in that container.

Partitions

A partition is a collection of stored data. This includes account databases,


container databases, and objects. Partitions are core to the replication system.
d)Nimbus

 Nimbus is a set of open source tools that together provide an IaaS cloud
computing solution.
 The following figure shows the architecture of Nimbus, which allows a
client to lease remote resources by deploying VMs on those resources
and configuring them to represent the environment desired by the user.
 To this end, Nimbus provides a special web interface known as Nimbus
Web. Its aim is to provide administrative and user functions in a friendly
interface.
 Nimbus Web is centered around a Python Django web application that is
intended to be deployable completely separate from the Nimbus service.
 As shown in Figure, a storage cloud implementation called Cumulus has
been tightly integrated with the other central services, although it can also
be used stand-alone.
 Cumulus is compatible with the Amazon S3 REST API , but extends its
capabilities by including features such as quota management. Therefore,
clients such as boto and s2cmd , that work against the S3 REST API, work
with Cumulus.
 On the other hand, the Nimbus cloud client uses the Java Jets3t library to
interact with Cumulus. Nimbus supports two resource management
strategies.
 The first is the default “resource pool” mode. In this mode, the service
has direct control of a pool of VM manager nodes and it assumes it can
start VMs.
 The other supported mode is called “pilot.” Here, the service makes
requests to a clusterʼs Local Resource Management System (LRMS) to
get a VM manager available to deploy VMs.
 Nimbus also provides an implementation of Amazonʼs EC2 interface that
allows users to use clients developed for the real EC2 system against
Nimbus-based clouds.

Figure: Nimbus Cloud Infrastructure

Aneka in Cloud Computing

 Aneka includes an extensible set of APIs associated with programming


models like MapReduce.
 These APIs support different cloud models like a private, public, hybrid
Cloud.
 Manjrasoft focuses on creating innovative software technologies to
simplify the development and deployment of private or public cloud
applications. Our product plays the role of an application platform as a
service for multiple cloud computing.
 Multiple Structures:
 Aneka is a software platform for developing cloud computing applications.
 In Aneka, cloud applications are executed.
 Aneka is a pure PaaS solution for cloud computing.
 Aneka is a cloud middleware product.
 Manya can be deployed over a network of computers, a multicore server,
a data center, a virtual cloud infrastructure, or a combination thereof.
 Aneka is a pure PaaS solution for cloud computing. Aneka is a cloud
middleware product that can be deployed on a heterogeneous set of
resources: a network of computers, a multicore server, datacenters, virtual
cloud infrastructures, or a mixture of these. The framework provides both
middleware for managing and scaling distributed applications and an
extensible set of APIs for developing them.
 Figure provides a complete overview of the components of the Aneka
framework. The core infrastructure of the system provides a uniform layer
that allows the framework to be deployed over different platforms and
operating systems. The physical and virtual resources representing the
bare metal of the cloud are managed by the Aneka container, which is
installed on each node and constitutes the basic building block of the
middleware. A collection of interconnected containers constitutes the
Aneka Cloud: a single domain in which services are made available to
users and developers. The container features three different classes of
services: Fabric Services, Foundation Services, and Execution Services.
 These take care of infrastructure management, supporting services for
the Aneka Cloud, and application management and execution,
respectively. These services are made available to developers and
administrators by means of the application management and
development layer, which includes interfaces and APIs for developing
cloud applications and the management tools and interfaces for
controlling Aneka Clouds.
 Aneka framework overview.
 Anatomy of the Aneka container
 The Aneka container constitutes the building blocks of Aneka Clouds and
represents the runtime machinery available to services and applications.
The container, the unit of deployment in Aneka Clouds, is a lightweight
software layer designed to host services and interact with the underlying
operating system and hardware. The main role of the container is to
provide a lightweight environment in which to deploy services and some
basic capabilities such as communication channels through which it
interacts with other nodes in the Aneka Cloud. Almost all operations
performed within Aneka are carried out by the services managed by the
container. The services installed in the Aneka container can be classified
into three major categories:
• Fabric Services
• Foundation Services
• Application Services
 The services stack resides on top of the Platform Abstraction Layer (PAL),
representing the interface to the underlying operating system and
hardware. It provides a uniform view of the software and hardware
environment in which the container is running. Persistence and security
traverse all the services stack to provide a secure and reliable
infrastructure. In the following sections we discuss the components of
these layers in more detail.
 Fast and Simple: Task Programming Model:
 Task Programming Model provides developers with the ability of
expressing applications as a collection of independent tasks. Each task
can perform different operations, or the same operation on different data,
and can be executed in any order by the runtime environment. This is a
scenario in which many scientific applications fit in and a very popular
model for Grid Computing. Also, Task programming allows the
parallelization of legacy applications on the Cloud.
 Concurrent Applications: Thread Programming Model
 Thread Programming Model offers developers the capability of running
multithreaded applications on the Aneka Cloud. The main abstraction of
this model is the concept of thread which mimics the semantics of the
common local thread but is executed remotely in a distributed
environment. This model offers finer control on the execution of the
individual components (threads) of an application but requires more
management when compared to Task Programming, which is based on a
“submit and forget” pattern.
 The Aneka Thread supports almost all of the operations available for
traditional local threads. More specifically an Aneka thread has been
designed to mirror the interface of the System. Threading. Thread .NET
class, so that developers can easily move existing multi-threaded
applications to the Aneka platform with minimal changes. Ideally,
applications can be transparently ported to Aneka just by substituting
local threads with Aneka Threads and introducing minimal changes to the
code. This model covers all the application scenarios of the Task
Programming and solves the additional challenges of providing a
distributed runtime environment for local multi-threaded applications.
 Data Intensive Applications: MapReduce Programing Model
 MapReduce Programming Model is an implementation of the MapReduce
model proposed by Google, in .NET on the Aneka platform. MapReduce
has been designed to process huge quantities of data by using simple
operations that extracts useful information from a dataset (the map
function) and aggregates this information together (the reduce function)
to produce the final results. Developers provide the logic for these two
operations and the dataset, and Aneka will do the rest, making the results
accessible when the application is completed.

You might also like