Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Shortnotes For Cloud

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Q1)HADOOP

a) Big Data
- Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques.
Big Data Challenges
1.Volume – Today data size has increased to size of terabytes in the form of records
or transactions
2.Variety – There is huge variety of data based on internal, external, behavioral,
or/and social type. Data can be of structured, semi structured or unstructured type.
3.Velocity – It means near or real-time assimilation of data coming in huge volume.

b) Hadoop Definition :(distributed storage and processing)


- Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware.
- Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.

- It provides a software framework for distributed storage and distributed computing.


- Input data is broken into blocks of size minimum datasize 64mb in
hadoop(cloudera=128mb/256Mb) and then blocks are moved to different nodes.
- It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.

- Hadoop is a framework that allows you to first store Big Data in a distributed
environment, so that, you can process it parallelly.

- It provides an efficient framework for running jobs on multiple nodes of clusters.

- Cluster means a group of systems connected via LAN.

- Using commodity hardware for data storage and analysi

- A Hadoop frame-worked application works in an environment that provides distributed


storage and computation across clusters of computers.

- Cluster creation is easy using Commodity hardware.


- Commodity hardware= compatible and can function on a plug and
play basis(normal PCs)
Hadoop file size:
1)HDFS is 128 MB (Hadoop 2. x) 2)64 MB (Hadoop 1. x)
- Data is stored across multiple hard drives.

c)HADOOP versions

d)Hadoop ECO System components - it is a platform or framework which solves big


data problems
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster

d)HADOOP architecture/component

e)HADOOP Components
1.HDFS for storage (Hadoop distributed File System), -
-HDFS allows you to store data of various formats across a cluster.
- HDFS splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers
DataNodes.
-HDFS is highly fault tolerant and designed using low-cost hardware.
- HDFS holds very large amount of data and provides easier access.
-Similar as virtualization, you can see HDFS logically as a single unit for storing Big
Data, but actually you are storing your data across multiple nodes in a distributed
fashion.
-HDFS follows master-slave architecture.
HDFS Features:
1) HDFS Fault Tolerance

HDFS is highly fault tolerant and designed using low-cost hardware.


HDFS holds very large amount of data and provides easier access.

HDFS Operation
1.Read a file 2.Write a file

HDFS Architecture

1. Name Node
2. Data Node
3. HDFS Client
4. HDFS Blocks

a) Master Node(NameNode)-
- NameNode stores the directory tree of all files in the file system.
- Namenode Manages the file system namespace.
- Regulates client’s access to files.
- Namenode It also executes file system operations such as renaming, closing, and opening
files and directories.

b) Slave Node(Data node) –


-It stores data in the HadoopFileSystem.
-Datanodes perform read-write operations on the file systems, as per client request.
-They also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
c)Block
Generally the user data is stored in the files of HDFS. The file in a file system
will be divided into one or more segments and/or stored in individual data
nodes. These file segments are called as blocks

2. YARN(YARN((Yet Another Resource Negotiator)


- for resource management in Hadoop. It allows parallel processing over the
data, i.e. stored across HDFS.
-YARN performs all your processing activities by allocating resources and
scheduling tasks
- YARN is called as the operating system of Hadoop as it is responsible for
managing and monitoring workloads.
- It allows multiple data processing engines such as real-time streaming and
batch processing to handle data stored on a single platform.
3) MapReduce (programming model ,engine, Distributed Data processing model)
➢ MapReduce is the computation engine running on top of HDFS as its data storage manager.

- Map reduce is the processing engine of Hadoop. MapReduce works on the principle of
distributed processing. It divides the task submitted by the user into a number of independent
subtasks. These sub-task executes in parallel thereby increasing the throughput.
- HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x)
- Data is stored across multiple hard drives.
- The MapReduce engine is the computation engine running on top of HDFS as its data storage
manager.
- MapReduce is the processing engine of Hadoop that processes and computes large volumes
of data.
- The general idea of the MapReduce algorithm is to process the data in parallel on your
distributed cluster.
- It subsequently combine it into the desired result or output.
- MapReduce job comprises a number of map tasks and reduces tasks

Features of map reduce


1)Synchronization

2) Data locality

3) Error handling

4) Scheduling
MAP-REDUCE MODEL Working model(Distributed data processing model of large data sets)
• Hadoop MapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
• MapReduce - MapReduce is a programming model designed to process big volumes
of data. This is done in parallel by dividing the task into a set of dependent tasks.
• The MapReduce framework consists of a single master JobTracker and one
slave TaskTracker per cluster-node.
• The master JobTracker is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring
them and re-executing the failed tasks.
• The slaves TaskTracker execute the tasks as directed by the master and provide task-status
information to the master periodically.
• The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.

MAP-REDUCE TASK steps


1. Map task
2.Reduce Task

1.The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).
2.The Reduce Task: This task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map
task.
Typically both the input and the output are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks

Functions
a))Map()---The Map Task: This is the first task, which takes input data and converts it into a set of
data, where individual elements are broken down into tuples (key/value pairs).

Syntax:
public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException

b)Reduce() - This Reduce task takes the output from a map task as input and combines those
data tuples into a smaller set of tuples. The reduce task is always performed after the map task.

Syntax
public void reduce( Text t_key, Iterator<IntWritable> values, Output Collector<Text,IntWritable>
output, Reporter reporter) throws IOException
{
}

Over all map -reduce functions

Syntax for map-reduce function :


EAMPLE: word count program using HADOOP framework

Hadoop Processing Steps

Stage 1(application can submit a job to the Hadoop )

A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:

The location of the input and output files in the distributed file system.

The java classes in the form of jar file containing the implementation of map and reduce functions.

The job configuration by setting different parameters specific to the job.

Stage 2- responsibility of distributing the software

The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3 different nodes execute the task

The TaskTrackers on different nodes execute the task as per MapReduce implementation and output
of the reduce function is stored into the output files on the file system.
Advantages of Hadoop
• Open Source
• Distributed Processing
• Fault Tolerance
• Reliability
• High Availability
• Scalability
• Economic
• Easy to use
• Data Locality
Other advantages
Hadoop framework allows the user to quickly write and test distributed systems.

It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes
the underlying parallelism of the CPU cores.

Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather
Hadoop library itself has been designed to detect and handle failures at the application layer.

Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.

Another big advantage of Hadoop is that apart from being open source, it is compatible on all the
platforms since it is Java based.
VIRTUAL BOX/VIRTUAL MACHINE MONITOT(VMM)/HYPERVISOR

- x86 virtualization software package


- It is cross-platform virtualization software
- It extends their existing computer to run multiple operating systems at the same time.
- VirtualBox runs on Microsoft Windows, Mac OS, Linux, and Solaris systems
- It is a Type II (hosted) hypervisor that can be installed on an existing host operating system
as an application
- Define :
1) Guest OS(virtual os)- no direct communication to cpu hardware
2) Host os(directly communicate to cpu hardware).
3) VM(virtual machine )-simulation of actual operating system,sharing the CPU, A virtual
machine (VM) is a digital version of a physical computer.
- It supports Fully Para virtualized environment
- VirtualBox also supports OS (.vmdk) and Virtual hard disk (.vhd) images made using VMware
Google App Engine(GAE)
Google App Engine (GAE) is a Paas(Platform-as-a-Service) cloud computing model that supports many
programming languages.

GAE is a scalable runtime environment mostly devoted to execute Web applications.

Google App Engine is Google’s Platform to Build Web Application on Cloud

Dynamic Web server with full support for common web technologies.

By using Google’s App Engine, there are no servers to maintain and no administrators needed.

In fact, it allows developers to integrate third-party frameworks and libraries with the infrastructure
still being managed by Google.

It allows developers to use readymade platform to develop and deploy web applications using
development tools, runtime engine, databases and middleware solutions.

It supports languages like Java, Python, .NET, PHP, Ruby, Node.js and Go in which developers can
write their code and deploy it on available google infrastructure with the help of Software
Development Kit (SDK).

In GAE, SDKs are required to set up your computer for developing, deploying, and managing your
apps in App Engine.

GAE enables users to run their applications on a large number of data centers associated with Google’s
search engine operations.

Presently, Google App Engine uses fully managed, serverless platform that allows to choose from
several popular languages, libraries, and frameworks to develop user applications and then uses App
Engine to take care of provisioning servers and scaling application instances based on demand.

Sandbox –Isolates your application in its own secure,reliable environment that is independent of
hardware,operating system and physical location of web server.

GWT( Google Web Toolkit) is available for Java web application developers

GAE Runtime environment :


1.Sandboxing –provide isolated and protected execution environment for execution

2. Supported runtimes - Java, Python, and Go.

Storage
1) Static file servers

2) Data Store –Data Store is a service that allows developers to store semi structured data.

Bigtable – a redundant, distributed and semi structured data store that organizes data in the
form of tables.
Developers define their data in terms of entity and properties .

• Data object is called a “Entity” that has a kind (~ table name) and a set of properties
(~ column names).

application services
1) UrlFetch 2) MemCache 3) Mail and instant messaging

2) 4) Account management 5) Account management

Compute services
1) Task queues 2) Cron jobs

Cost model
1. billable quotas 2. fixed quotas, 3. per-minute quotas

Why App Engine?

Google App Engine (GAE)

•Lower total cost of ownership

•Rich set of APIs

•Fully featured SDK for Local development

•Ease of Deployment

Application Development Life Cycle of GAE


architecture for GAE

Infrastructure For GEA Composed Of Four Main Components


1)Google File System (GFS),

2)MapReduce,

3) BigTable,

4) Chubby

Functional modules of GAE (GAE platform )


1) Application runtime environment
2) Software Development Kit (SDK)
3) Datastore
4) Admin console
5) GAE web service

Application runtime environment offers a platform that has built-in execution engine for scalable web
programming and execution.
• Software Development Kit (SDK) for local application development and deployment over google
cloud platform. The software development kit (SDK) is used for local application development.
The SDK allows users to execute test runs of local applications and upload
application code

• Datastore to provision object-oriented, distributed, structured data storage to store application and
data. It also provides secures data management operations based on BigTable techniques.

• Admin console used for easy management of user application development and resource
management. The administration console is used for easy management of user application
development cycles, instead of for physical resource management.

• GAE web service for providing APIs and interfaces. The GAE web service infrastructure provides
special interfaces to guaranteeflexible use and management of storage and network
resources by GAE

Programming Environment for Google App Engine


- GAE programming model for two supported languages: Java and Python.
- GWT( Google Web Toolkit) is available for Java web application developers
- Python is often used with frameworks such as Django and CherryPy.
- The data store is a NOSQL data management system for entities.
- Java offers Java Data Object (JDO) and Java Persistence API (JPA) interfaces implemented by
the open source Data Nucleus Access platform.
- Google added the blobstore which is suitable for large files as its size limit is 2 GB.
- The Google SDC (Secure Data Connection) can tunnel through the Internet and link your
intranet to an external GAE application.
- Your application can perform these tasks on a schedule that you configure, such as on a daily
or hourly basis using “cron jobs,” handled by the Cron service.

1)Google File System (GFS) - distributed file processing systems, high availability, high performance,
reliability, and scalability of systems. GFS provides a file system interface and different APIs.

The GFS provides the following features : • Large - scale data processing and storage support • Normal
treatment for components that stop responding

Google File System is a proprietary distributed file system developed by Google to provide efficient,
reliable access to data using large clusters of commodity hardware

Google File System is a proprietary distributed file system developed by Google to provide efficient,
reliable access to data using large clusters of commodity hardware.

-it is the fundamental storage service for Google’s search engine.

-Google needed a distributed file system to redundantly store massive amounts of data on cheap and
unreliable computers

- GFS typically will hold a large number of huge files, each 100MB or larger, with files that are multiple
GB in size quite common.
Thus, Google has chosen its file data block size to be 64MB instead of the 4 KB in typical traditional
file systems

- GFS was designed for high fault tolerance and adopted some methods to achieve this goal.

- GFS will anticipate any commodity hardware outages caused by both software and hardware
faults.

- GFS accepts a modest number of large files.

- GFS has its well defined sematic for multiple clients with minimal synchronization overhead.

- A constant high-file storage.

- The GFS was designed to meet many of the same goals as pre existing distributed file
systems including scalability, performance, reliability, and robustness

2)MapReduce – distributed progamming model for large data sets.

3) BigTable - Googles Big table is a distributed storage system that allows storing huge volumes of
structured as well as unstructured data on storage mediums.

BigTable building blocks:

1. GFS: stores persistent state 2. Scheduler: schedules jobs involved in BigTable serving

3. Lock service: master election, location bootstrapping

4. MapReduce: often used to read/write BigTable data

• The BigTable system is scalable, which means the system has thousands of servers, terabytes
of in-memory data, petabytes of disk-based data, millions of reads/writes per second, and
efficient scans.

• BigTable is a self-managing system (i.e., servers can be added/removed dynamically and it


features automatic load balancing).

-BigTable is used in many projects, including Google Search, Orkut, and Google Maps/Google Earth

4) Chubby - distributed locking service that is used for synchronizing distributed activities.

Chubby is used extensively inside Google in various systems such as GFS, BigTable. The primary goal
is to provide a reliable lock service.

Chubby is to provide reliable storage with consistent availability.

It is designed to use with loosely coupled distributed systems that are connected in a high-speed
network and contain several small-sized machines.
GAE Applications
- Best-known GAE applications include the Google Search Engine, Google Docs,
Google Earth and Gmail. These applications can support large numbers of users
simultaneously.
- Users can interact with Google applications via the web interface provided by each
application.
-The applications are all run in the Google data centers.

GOOGLE APIs
Google developed a set of Application Programming Interfaces (APIs) that can be used to
communicate with Google Services. This set of APIs is referred as Google APIs. and their integration
to other services.

Services by GAE

GAE
OPENSTACK
Open Stack OpenStack is an open - source cloud operating system.

OpenStack provides Infrastructure-as-a-Service (IaaS) to its users to enable them to manage virtual
private servers in their data centers.

OpenStack is a cloud OS that is used to control the large pools of computing, storage, and networking
resources within a data center. OpenStack is an open-source and free software platform.

The main objective of OpenStack is to provide a cloud computing platform that is :

• Global • Open-source • Freely available • Easy to use • Highly and easily scalable • Easy to implement
• Interoperable

Features and Benefits of OpenStack

Compatibility : OpenStack supports both private and public clouds and is very easy to deploy and
manage. OpenStack APIs are supported in Amazon Web Services. The compatibility eliminates the
need for rewriting applications for AWS, thus enabling easy portability between public and private
clouds.

• Security : OpenStack addresses the security concerns, which are the top- most concerns for most
organisations, by providing robust and reliable security systems.

• Real-time Visibility : OpenStack provides real-time client visibility to administrators, including


visibility of resources and instances, thus enabling administrators and providers to track what clients
are requesting for.

• Live Upgrades : This feature allows upgrading services without any downtime. Earlier, for
upgradations, the was a need for shutting-down complete systems, which resulted in loss of
performance. Now, OpenStack has enabled upgrading systems while they are running by requiring
only individual components to shut down
Components of OpenStack

Nova : This is one of the primary services of OpenStack, which provides numerous tools for the
deployment and management of a large number of virtual machines. Nova is the compute service of
OpenStack.

• Swift : Swift provides storage services for storing files and objects. Swift can be equated with
Amazon’s Simple Storage System (S3).

• Cinder : This component provides block storage to Nova Virtual Machines. Its working is similar to a
traditional computer storage system where the computer is able to access specific locations on a disk
drive. Cinder is analogous to AWS’s EBS.

• Glance : Glace is OpenStack’s image service component that provides virtual templates (images) of
hard disks. These templates can be used for new VMs. Glance may use either Swift or flat files to store
these templates.

• Neutron (formerly known as Quantum) : This component of OpenStack provides Networking-as-a-


Service, Load-Balancer-as-a-Service and Firewall- as-a-Service. It also ensures communication
between other components.

• Heat : It is the orchestration component of OpenStack. It allows users to manage infrastructural


needs of applications by allowing the storage of requirements in files.

• Keystone : This component provides identity management in OpenStack • Horizon : This is a


dashboard of OpenStack, which provides a graphical interface.

• Ceilometer : This component of OpenStack provisions meters and billing models for users of the
cloud services. It also keeps an account of the resources used by each individual user of the OpenStack
cloud. Let us also discuss some of the non core components of OpenStack and their offerings.

• Trove : Trove is a component of OpenStack that provides Database-as-a- service. It provisions


relational databases and big data engines.

• Sahara : This component provisions Hadoop to enable the management of data processors.

• Zaqar : This component allows messaging between distributed application components.

• Ironic : Ironic provisions bare-metals, which can be used as a substitute to VMs.


Federation in the Cloud :
Federated cloud is created by connecting the cloud environment of different cloud providers using a
common standard.

Cloud Federation, also known as Federated Cloud is the deployment and management of several
external and internal cloud computing services to match business needs.

It is a multi-national cloud system that integrates private, community, and public clouds into scalable
computing platforms.

Cloud federation manages consistency and access controls when two or more independent
geographically distinct Clouds share either authentication, files, computing resources, command
and control or access to storage resources.”

The architecture of Federated Cloud:

1. Cloud Exchange 2. Cloud Coordinator 3.Cloud Broker


4 Types/Four Levels of Federation:
1. Permissive federation
2. Verified Federation
3. Encrypted Federation
4. Trusted Federation

Levels of Federation

• Permissive federation
Permissive federation allows the interconnection of the cloud
environment of two service providers without the verifying identity of
peer cloud using DNS lookups. This raises the chances of domain
spoofing.

• Verified Federation
Verified federation allows interconnection of the cloud environment,
two service providers, only after the peer cloud is identified using the
information obtained from DNS. Though the identity verification
prevents spoofing the connection is still not encrypted and there are
chances of DNS attack.

• Encrypted Federation
Encrypted federation allows interconnection of the cloud
environment of two services provider only if the peer
cloud supports transport layer security (TSL). The peer cloud
interested in the federation must provide the digital certificate which
still provides mutual authentication. Thus encrypted federation
results in weak identity verification.

• Trusted Federation
Trusted federation allows two clouds from different provider to
connect only under a provision that the peer cloud support TSL along
with that it provides a digital certificate authorized by
the certification authority (CA) that is trusted by the authenticating
cloud.

Advantages of Federated Cloud


1. Federated cloud allows scaling up of resources.
2. Federated cloud increases reliability.
3. Federated cloud has increased collaboration of cloud resources.
4. Connects multiple cloud service provider globally to let providers buy
and sell their services on demand.
5. Dynamic scalability reduces the cost and time of providers.

You might also like