Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

2. Programming Environment for GAE

Uploaded by

953622244034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2. Programming Environment for GAE

Uploaded by

953622244034
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

CCS335- Cloud Computing

Unit IV
CLOUD DEPLOYMENT ENVIRONMENT
Topic: Programming Environment for GAE

By
Dr. M.Gomathy Nayagam
Associate Professor/CSBS
Ramco Institute of Technology
Rajapalayam
Introduction
 Figure summarizes some key features
of GAE programming model for two
supported languages: Java and Python.
 A client environment that includes an
Eclipse plug-in for Java allows you to
debug your GAE on your local machine.
 Also, the GWT Google Web Toolkit is
available for Java web application
developers.
 Developers can use this, or any other
language using a JVMbased interpreter
or compiler, such as JavaScript or Ruby.
 Python is often used with frameworks
such as Django and CherryPy, but
Google also supplies a built in webapp
Python environment.
 There are several powerful constructs for storing and
accessing data.
 The data store is a NOSQL data management system
for entities that can be, at most, 1 MB in size and are
labeled by a set of schema-less properties.
 Queries can retrieve entities of a given kind filtered
and sorted by the values of the properties.
 Java offers Java Data Object (JDO) and Java
Persistence API (JPA) interfaces implemented by the
open source Data Nucleus Access platform, while
Python has a SQL-like query language called GQL.
 The data store is strongly consistent and uses
optimistic concurrency control.
 An update of an entity occurs in a transaction that is retried a
fixed number of times if other processes are trying to update
the same entity simultaneously.
 Your application can execute multiple data store operations in
a single transaction which either all succeed or all fail together.
 The data store implements transactions across its distributed
network using “entity groups.”
 A transaction manipulates entities within a single group.
 Entities of the same group are stored together for efficient
execution of transactions.
 Your GAE application can assign entities to groups when the
entities are created.
 The performance of the data store can be enhanced by in-
memory caching using the memcache, which can also be used
independently of the data store.
 Recently, Google added the blobstore which is
suitable for large files as its size limit is 2 GB.
 There are several mechanisms for incorporating
external resources.
 The Google SDC Secure Data Connection can
tunnel through the Internet and link your intranet to
an external GAE application.
 The URL Fetch operation provides the ability for
applications to fetch resources and communicate
with other hosts over the Internet using HTTP and
HTTPS requests.
 There is a specialized mail mechanism to send e-
mail from your GAE application.
Applications can access resources on the
Internet, such as web services or other
data, using GAE’s URL fetch service.
The URL fetch service retrieves web
resources using the same highspeed
Google infrastructure that retrieves web
pages for many other Google products.
There are dozens of Google “corporate”
facilities including maps, sites, groups,
calendar, docs, and YouTube, among
others.
These support the Google Data API which
can be used inside GAE.
 An application can use Google Accounts for user
authentication.
 Google Accounts handles user account creation and
sign-in, and a user that already has a Google account
(such as a Gmail account) can use that account with
your app.
 GAE provides the ability to manipulate image data
using a dedicated Images service which can resize,
rotate, flip, crop, and enhance images.
 An application can perform tasks outside of responding
to web requests.
 Your application can perform these tasks on a schedule
that you configure, such as on a daily or hourly basis
using “cron jobs,” handled by the Cron service.
Alternatively, the application can perform
tasks added to a queue by the application
itself, such as a background task created
while handling a request.
A GAE application is configured to consume
resources up to certain limits or quotas.
With quotas, GAE ensures that your
application won’t exceed your budget, and
that other applications running on GAE
won’t impact the performance of your app.
In particular, GAE use is free up to certain
quotas.
Google File System (GFS)
 GFS was built primarily as the fundamental storage service
for Google’s search engine.
 As the size of the web data that was crawled and saved
was quite substantial, Google needed a distributed file
system to redundantly store massive amounts of data on
cheap and unreliable computers.
 None of the traditional distributed file systems can provide
such functions and hold such large amounts of data.
 In addition, GFS was designed for Google applications,
and Google applications were built for GFS.
 In traditional file system design, such a philosophy is not
attractive, as there should be a clear interface between
applications and the file system, such as a POSIX
interface.
There are several assumptions concerning GFS.
 One is related to the characteristic of the cloud
computing hardware infrastructure (i.e., the
high component failure rate).
As servers are composed of inexpensive
commodity components, it is the norm rather
than the exception that concurrent failures will
occur all the time.
Another concerns the file size in GFS.
GFS typically will hold a large number of huge
files, each 100MB or larger, with files that are
multiple GB in size quite common.
Thus, Google has chosen its file data block
size to be 64MB instead of the 4 KB in
typical traditional file systems.
The I/O pattern in the Google application is
also special.
 Files are typically written once, and the
write operations are often the appending
data blocks to the end of files.
 Multiple appending operations might be
concurrent.
 There will be a lot of large streaming reads
and only a little random access.
 Thus, Google made some special decisions
regarding the design of GFS.
 As noted earlier, a 64 MB block size was chosen.
 Reliability is achieved by using replications (i.e.,
each chunk or data block of a file is replicated
across more than three chunk servers).
 A single master coordinates access as well as keeps
the metadata.
 This decision simplified the design and
management of the whole cluster.
 Developers do not need to consider many difficult
issues in distributed systems, such as distributed
consensus.
There is no data cache in GFS as large
streaming reads and writes represent
neither time nor space locality.
GFS provides a similar, but not identical,
POSIX file system accessing interface.
 The distinct difference is that the
application can even see the physical
location of file blocks.
Such a scheme can improve the upper-
layer applications.
The customized API can simplify the
problem and focus on Google applications
 Figure shows the GFS architecture.
 It is quite obvious that there is a
single master in the whole cluster.
 Other nodes act as the chunk servers
for storing data, while the single
master stores the metadata.
 The file system namespace and
locking facilities are managed by the
master.
 The master periodically
communicates with the chunk servers
to collect management information as
well as give instructions to the chunk
servers to do work such as load
balancing or fail recovery.
 The master has enough information to keep the whole cluster
in a healthy state.
 With a single master, many complicated distributed algorithms
can be avoided and the design of the system can be simplified.
 However, this design does have a potential weakness, as the
single GFS master could be the performance bottleneck and
the single point of failure.
 To mitigate this, Google uses a shadow master to replicate all
the data on the master, and the design guarantees that all the
data operations are performed directly between the client and
the chunk server.
 The control messages are transferred between the master and
the clients and they can be cached for future use.
 With the current quality of commodity servers, the single
master can handle a cluster of more than 1,000 nodes.
Data Mutation sequence in
GFS
Figure shows the data
mutation (write, append
operations) in GFS.
Data blocks must be
created for all replicas.
The goal is to minimize
involvement of the
master.
The mutation takes the
following steps:
 The mutation takes the following steps:
 1.The client asks the master which chunk server holds the
current lease for the chunk and the locations of the other
replicas. If no one has a lease, the master grants one to a
replica it chooses(not shown).
 2. The master replies with the identity of the primary and the
locations of the other (secondary) replicas. The client caches
this data for future mutations. It needs to contact the master
again only when the primary becomes unreachable or replies
that it no longer holds a lease.
 3. The client pushes the data to all the replicas. A client can do
so in any order. Each chunk server will store the data in an
internal LRU buffer cache until the data is used or aged out. By
decoupling the data flow from the control flow, we can improve
performance by scheduling the expensive data flow based on
the network topology regardless of which chunk server is the
primary.
 4.Once all the replicas have acknowledged receiving the data, the
client sends a write request to the primary. The request identifies the
data pushed earlier to all the replicas. The primary assigns
consecutive serial numbers to all the mutations it receives, possibly
from multiple clients, which provides the necessary serialization. It
applies the mutation to its own local state in serial order.
 5. The primary forwards the write request to all secondary replicas.
Each secondary replica applies mutations in the same serial number
order assigned by the primary.
 6. The secondaries all reply to the primary indicating that they have
completed the operation.
 7. The primary replies to the client. Any errors encountered at any
replicas are reported to the client. In case of errors, the write corrects
at the primary and an arbitrary subset of the secondary replicas. The
client request is considered to have failed, and the modified region is
left in an inconsistent state. Our client code handles such errors by
retrying the failed mutation. It will make a few attempts at steps 3
through 7 before falling back to a retry from the beginning of the
write.
 Thus, besides the writing operation provided by GFS,
special appending operations can be used to append the
data blocks to the end of the files.
 The reason for providing such operations is that some of
the Google applications need a lot of append operations.
 For example, while crawlers are gathering data from the
web, the contents of web pages will be appended to page
files.
 Thus, the appending operation is provided and optimized.
 The client specifies data to be appended and GFS appends
it to the file atomically at least once.
 GFS picks the offset and the clients cannot decide the
offset of the data position.
 The appending operation works for concurrent writers.
 GFS was designed for high fault tolerance and adopted some
methods to achieve this goal.
 Master and chunk servers can be restarted in a few seconds, and
with such a fast recovery capability, the window of time in which
the data is unavailable can be greatly reduced.
 As we mentioned before, each chunk is replicated in at least three
places and can tolerate at least two data crashes for a single chunk
of data.
 The shadow master handles the failure of the GFS master.
 For data integrity, GFS makes checksums on every 64 KB block in
each chunk.
 GFS can achieve the goals of high availability (HA), high
performance, and large scale.
 GFS demonstrates how to support large-scale processing workloads
on commodity hardware designed to tolerate frequent component
failures optimized for huge files that are mostly appended and read.
BigTable, Google’s NOSQL
System
 BigTable was designed to provide a service for storing and
retrieving structured and semistructured data.
 BigTable applications include storage of web pages, per-user
data, and geographic locations.
 Here we use web pages to represent URLs and their
associated data, such as contents, crawled metadata, links,
anchors,and page rank values.
 Per-user data has information for a specific user and
includes such data as user preference settings, recent
queries/search results, and the user’s e-mails.
 Geographic locations are used in Google’s well-known
Google Earth software.
 Geographic locations include physical entities (shops,
restaurants, etc.), roads, satellite image data, and user
annotations.
The scale of such data is incredibly large.
There will be billions of URLs, and each URL
can have many versions, with an average
page size of about 20 KB per version.
The user scale is also huge.
There are hundreds of millions of users and
there will be thousands of queries per
second.
The same scale occurs in the geographic
data, which might consume more than 100
TB of disk space.
It is not possible to solve such a large scale
of structured or semi structured data using
a commercial database system.
This is one reason to rebuild the data
management system; the resultant system
can be applied across many projects for a
low incremental cost.
The other motivation for rebuilding the data
management system is performance.
Low-level storage optimizations help
increase performance significantly, which is
much harder to do when running on top of
a traditional database layer.
 The design and implementation of the BigTable system
has the following goals.
 The applications want asynchronous processes to be
continuously updating different pieces of data and want
access to the most current data at all times.
 The database needs to support very high read/write
rates and the scale might be millions of operations per
second.
 Also, the database needs to support efficient scans over
all or interesting subsets of data, as well as efficient joins
of large one-to-one and one-to-many data sets.
 The application may need to examine data changes over
time (e.g., contents of a web page over multiple crawls).
 Thus, BigTable can be viewed as a distributed multilevel map.
 It provides a fault-tolerant and persistent database as in a
storage service.
 The BigTable system is scalable, which means the system has
thousands of servers, terabytes of in-memory data, petabytes of
disk-based data, millions of reads/writes per second, and
efficient scans.
 Also, BigTable is a self-managing system (i.e., servers can be
added/removed dynamically and it features automatic load
balancing).
 Design/initial implementation of BigTable began at the beginning
of 2004.
 BigTable is used in many projects, including Google Search,
Orkut, and Google Maps/Google Earth, among others.
 One of the largest BigTable cell manages ~200 TB of data spread
over several thousand machines.
The BigTable system is built on top of an
existing Google cloud infrastructure.
BigTable uses the following building blocks:
1.GFS: stores persistent state
2. Scheduler: schedules jobs involved in
BigTable serving
3. Lock service: master election, location
bootstrapping
4. MapReduce: often used to read/write
BigTable data
 BigTable provides a simplified data
model compared to traditional database
systems.
 Figure shows the data model of a
sample table, Web Table.
 Web Table stores the data about a web
page.
 Each web page can be accessed by the
URL.
 The URL is considered the row index.
 The—forcolumn provides different data
related to the corresponding URL
example, different versions of the
contents, and the anchors appearing in
the web page.
 In this sense, BigTable is a distributed
multidimensional stored sparse map.
 The map is indexed by row key, column
key, and timestamp—that is, (row: string,
column: string, time: int64) maps to string
(cell contents).
 Rows are ordered in lexicographic order by
row key.
The row range for a table is dynamically
partitioned and each row range is called
“Tablet.”
Syntax for columns is shown as a
(family:qualifier) pair.
Cells can store multiple versions of data
Such a data model is a good match for
most of Google’s (and other organizations’)
applications.
For rows, Name is an arbitrary string and
access to data in a row is atomic.
This is different from the traditional
relational database which provides
abundant atomic operations (transactions).
 Row creation is implicit upon storing data.
Rows are ordered lexicographically, that is,
close together lexicographically, usually on
one or a small number of machines.
 Large tables are broken into tablets at row boundaries.
 A tablet holds a contiguous range of rows.
 Clients can often choose row keys to achieve locality.
 The system aims for about 100MB to 200MB of data per
tablet.
 Each serving machine is responsible for about 100
tablets.
 This can achieve faster recovery times as 100 machines
each pick up one tablet from the failed machine.
 This also results in fine-grained load balancing, that is,
migrating tablets away from the overloaded machine.
 Similar to the design in GFS, a master machine in
BigTable makes load-balancing decisions.
Figure shows the BigTable
system structure.
A BigTable master manages and
stores the metadata of the
BigTable system.
BigTable clients use the BigTable
client programming library to
communicatewith the BigTable
master and tablet servers.
BigTable relies on a highly
available and persistent
distributed lock service called
Chubby
Tablet Location Hierarchy
 Figure shows how to locate the BigTable data starting from the file stored in Chubby.
 The first level is a file stored in Chubby that contains the location of the root tablet.
 The root tablet contains the location of all tablets in a special METADATA table.
 Each METADATA tablet contains the location of a set of user tablets.
 The root tablet is just the first tablet in the METADATA table,
 It is treated specially
 It is never split to ensure that the tablet location hierarchy has no more than three
levels.
 The METADATA table stores the location of a tablet
under a row key that is an encoding of the tablet’s
table identifier and its end row.
 BigTable includes many optimizations and fault-
tolerant features.
 Chubby can guarantee the availability of the file for
finding the root tablet.
 The BigTable master can quickly scan the tablet
servers to determine the status of all nodes.
 Tablet servers use compaction to store data efficiently.
 Shared logs are used for logging the operations of
multiple tablets so as to reduce the log space as well
as keep the system consistent.
Chubby, Google’s Distributed Lock
Service
 Chubby is intended to provide a
coarse-grained locking service.
 It can store small files inside
Chubby storage which provides a
simple namespace as a file system
tree.
 The files stored in Chubby are
quite small compared to the huge
files in GFS.
 Based on the Paxos agreement
protocol, the Chubby system can
be quite reliable despite the failure
of any member node.
 Figure shows the overall
architecture of the Chubby system.
Each Chubby cell has five servers inside.
Each server in the cell has the same file system
namespace.
Clients use the Chubby library to talk to the servers
in the cell.
Client applications can perform various file
operations on any server in the Chubby cell.
Servers run the Paxos protocol to make the whole
file system reliable and consistent.
 Chubby has become Google’s primary internal
name service.
GFS and BigTable use Chubby to elect a primary
from redundant replicas.

You might also like