555 PG
555 PG
555 PG
Cloud Applications
UNIT-I
Distributed System is a collection of autonomous computer systems that are physically separated
but are connected by a centralized computer network that is equipped with distributed system
software. The autonomous computers will communicate among each system by sharing
resources and files and performing the tasks assigned to them.
Example of Distributed System:
Any Social Media can have its Centralized Computer Network as its Headquarters and
computer systems that can be accessed by any user and using their services will be the
Autonomous Systems in the Distributed System Architecture.
When input comes from a client to the main computer, the master CPU divides the task into
simple jobs and sends it to slave note to do it when the jobs are done by the slave nodes, they
send it back to the master node, and then it shows the result to the main computer.
Advantages:
High Performance
Easy to manage
Scalable
Expandability
Availability
Flexibility
Disadvantages:
High cost
The problem in finding the fault
More space is needed
Applications of Cluster Computing:
In many web applications functionalities such as Security, Search Engines, Database
servers, web servers, proxy, and email.
It is flexible to allocate work as small data tasks for processing.
Assist and help to solve complex computational problems
Cluster computing can be used in weather modeling
Earthquake, Nuclear, Simulation, and tornado forecast
Grid computing: In grid computing, the subgroup consists of distributed systems, which
are often set up as a network of computer systems, each system can belong to a different
administrative domain and can differ greatly in terms of hardware, software, and
implementation network technology.
The different department has a different computer with different OS to make the control node
present which helps different computer with different OS to communicate with each other and
transfer messages to work.
Advantages:
Can solve bigger and more complex problems in a shorter time frame. Easier collaboration
with other organizations and better use of existing equipment
Disadvantages:
Grid software and standards continue to evolve
Getting started learning curve
Non-interactive job submission
You may need a fast connection between computer resources.
Licensing on many servers can be prohibitive for some applications.
Applications of Grid Computing
Organizations that develop grid standards and practices for the guild line.
Works as a middleware solution for connecting different businesses.
It is a solution-based solution that can meet computing, data, and network needs.
UNIT-2
What is Cloud-Computing?
Cloud computing is a technology to store, manage, process, and access the data over the
internet instead of a local server or computer hard drives. Here, the term cloud is taken from
the symbol of the internet users in the flowcharts. The remote servers are used in cloud
computing to store the data that can be accessed from anywhere using the internet.
Types of Cloud
1. Public cloud
2. Private cloud
3. Hybrid cloud
4. Community cloud
Public Cloud
Public clouds are managed by third parties which provide cloud services over the internet to
the public, these services are available as pay-as-you-go billing models.
They offer solutions for minimizing IT infrastructure costs and become a good option for
handling peak loads on the local infrastructure.
Public clouds are the go-to option for small enterprises, which can start their businesses without
large upfront investments by completely relying on public infrastructure for their IT needs.
The fundamental characteristics of public clouds are multitenancy. A public cloud is meant
to serve multiple users, not a single customer. A user requires a virtual computing
environment that is separated, and most likely isolated, from other users.
Public cloud
Private cloud
Private clouds are distributed systems that work on private infrastructure and provide the users
with dynamic provisioning of computing resources. Instead of a pay-as-you-go model in
private clouds, there could be other schemes that manage the usage of the cloud and
proportionally billing of the different departments or sections of an enterprise. Private cloud
providers are HP Data Centers, Ubuntu, Elastic-Private cloud, Microsoft, etc.
Hybrid cloud:
A hybrid cloud is a heterogeneous distributed system formed by combining facilities of the
public cloud and private cloud. For this reason, they are also called heterogeneous clouds.
A major drawback of private deployments is the inability to scale on- demand and efficiently
address peak loads. Here public clouds are needed. Hence, a hybrid cloud takes advantage of
both public and private clouds.
Community cloud:
Community clouds are distributed systems created by integrating the services of different
clouds to address the specific needs of an industry, a community, or a business sector. But
sharing responsibilities among the organizations is difficult.
In the community cloud, the infrastructure is shared between organizations that have shared
concerns or tasks. The cloud may be managed by an organization or a third party.
Community Cloud
Characteristics of IaaS
Example: DigitalOcean, Linode, Amazon Web Services (AWS), Microsoft Azure, Google
Compute Engine (GCE), Rackspace, and Cisco Metacloud.
Platform as a Service (PaaS)
PaaS cloud computing platform is created for the programmer to develop, test, run, and manage
the applications.
Characteristics of PaaS
Example: AWS Elastic Beanstalk, Windows Azure, Heroku, Force.com, Google App Engine,
Apache Stratos, Magento Commerce Cloud, and OpenShift.
Software as a Service (SaaS)
SaaS is also known as "on-demand software". It is a software in which the applications are
hosted by a cloud service provider. Users can access these applications with the help of internet
connection and web browser.
Characteristics of SaaS
The below table shows the difference between IaaS, PaaS, and SaaS -
It provides a virtual It provides virtual platforms and It provides web software and
data center to store tools to create, test, and apps to complete
Cloud computing provides IT services through the internet. These services are placed in different
remote places. The services can be divided into three main categories:
1. Software-as-a-Service (SaaS)
2. Platform-as-a-Service (PaaS)
3. Infrastructure-as-a-Service (IaaS)
From the above three services, salesforce provides two services: SAAS and PAAS, to its users.
SAAS(Salesforce.com)
SaaS
services can be directly accessed using the internet instead of installing each application on the
local drive or system.
Salesforce.com is the SAAS service provider that provides various online applications for CRM.
There is no need to install any software or server on a local machine; instead we can start the
business on this just by singing-up.
PAAS(Force.com)
Platform-as-a-Service
or PaaS is a type of cloud computing service where a service provider such as Salesforce.com
provides a platform to their client to work on. On such platforms, the users or clients can run,
develop, test, or manage any business applications without any IT infrastructure.
It lies between the SaaS and IaaS services, and provides a building block by which we can create
our solutions.
Google App Engine is one of the great examples of PaaS services. Currently, it provides online
Python and Java Runtime platforms to develop web applications without any need for
complicated software & hardware.
Force.com platform also offers PaaS services. It uses its language proprietary.
Infrastructure-as-a-Service (IaaS)
IaaS
is a type of cloud-computing service that offers the rental computing infrastructures. The cloud
provider provides various infrastructure services such as servers, virtual machines, network
storage, etc.
Features of Cloud Computing
Cloud computing is becoming popular day by day. Continuous business expansion and growth
requires huge computational power and large-scale data storage systems. Cloud computing can
help organizations expand and securely move data from physical locations to the 'cloud' that can
be accessed anywhere.
Cloud computing has many features that make it one of the fastest growing industries at present.
1. Resources Pooling
Resource pooling is one of the essential features of cloud computing. Resource pooling means
that a cloud service provider can share resources among multiple clients, each providing a
different set of services according to their needs. It is a multi-client strategy that can be applied
to data storage, processing and bandwidth- delivered services. The administration process of
allocating resources in real-time does not conflict with the client's experience.
2. On-Demand Self-Service
It is one of the important and essential features of cloud computing. This enables the client to
continuously monitor server uptime, capabilities and allocated network storage. This is a
fundamental feature of cloud computing, and a customer can also control the computing
capabilities according to their needs.
3. Easy Maintenance
This is one of the best cloud features. Servers are easily maintained, and downtime is minimal or
sometimes zero. Cloud computing powered resources often undergo several updates to optimize
their capabilities and potential. Updates are more viable with devices and perform faster than
previous versions.
5. Economical
This cloud feature helps in reducing the IT expenditure of the organizations. In cloud computing,
clients need to pay the administration for the space used by them. There is no cover-up or
additional charges that need to be paid. Administration is economical, and more often than not,
some space is allocated for free.
7. Security
Data security is one of the best features of cloud computing. Cloud services make a copy of the
stored data to prevent any kind of data loss. If one server loses data by any chance, the copied
version is restored from the other server. This feature comes in handy when multiple users are
working on a particular file in real-time, and one file suddenly gets corrupted.
8. Automation
Automation is an essential feature of cloud computing. The ability of cloud computing to
automatically install, configure and maintain a cloud service is known as automation in cloud
computing. In simple words, it is the process of making the most of the technology and
minimizing the manual effort. However, achieving automation in a cloud ecosystem is not that
easy. This requires the installation and deployment of virtual machines, servers, and large
storage. On successful deployment, these resources also require constant maintenance.
9. Resilience
Resilience in cloud computing means the ability of a service to quickly recover from any
disruption. The resilience of a cloud is measured by how fast its servers, databases and network
systems restart and recover from any loss or damage. Availability is another key feature of cloud
computing. Since cloud services can be accessed remotely, there are no geographic restrictions or
limits on the use of cloud resources.
The services can be scaled up and down as per the client requirements.
Benefits of Cloud-computing
Data intensive computing has some characteristics which are different from other forms of
computing. They are:
In order to achieve high performance in data intensive computing, it is necessary to minimize the
movement of data. This reduces system overhead and increases performance by allowing the
algorithms to execute on the node where the data resides.
The data intensive computing system utilizes a machine independent approach where the run
time system controls the scheduling, execution, load balancing, communications and the
movement of programs.
Data intensive computing hugely focuses on reliability and availability of data. Traditional large
scale systems may be susceptible to hardware failures, communication errors and software bugs,
and data intensive computing is designed to overcome these challenges.
Data intensive computing is designed for scalability so it can accommodate any amount of data
and so it can meet the time critical requirements. Scalability of the hardware as well as the
software architecture is one of the biggest advantages of data intensive computing.
UNIT-2
Virtualization In Cloud Computing and Types
Virtualization is a technique of how to separate a service from the underlying physical delivery
of that service. It is the process of creating a virtual version of something like computer
hardware. It was initially developed during the mainframe era. It involves using specialized
software to create a virtual or software-created version of a computing resource rather than the
actual version of the same resource. With the help of Virtualization, multiple operating systems
and applications can run on same machine and its same hardware at the same time, increasing
the utilization and flexibility of hardware.
In other words, one of the main cost effective, hardware reducing, and energy saving techniques
used by cloud providers is virtualization. Virtualization allows to share a single physical
instance of a resource or an application among multiple customers and organizations at one
time. It does this by assigning a logical name to a physical storage and providing a pointer to
that physical resource on demand. The term virtualization is often synonymous with hardware
virtualization, which plays a fundamental role in efficiently delivering Infrastructure-as-a-
Service (IaaS) solutions for cloud computing. Moreover, virtualization technologies provide a
virtual environment for not only executing applications but also for storage, memory, and
networking.
The machine on which the virtual machine is going to be built is known as Host Machine and that
virtual machine is referred as a Guest Machine.
BENEFITS OF VIRTUALIZATION
1. More flexible and efficient allocation of resources.
2. Enhance development productivity.
3. It lowers the cost of IT infrastructure.
4. Remote access and rapid scalability.
5. High availability and disaster recovery.
6. Pay peruse of the IT infrastructure on demand.
7. Enables running multiple operating systems.
Types of Virtualization:
1. Application Virtualization.
2. Network Virtualization.
3. Desktop Virtualization.
4. Storage Virtualization.
5. Server Virtualization.
6. Data virtualization.
1. Application Virtualization:
Application virtualization helps a user to have remote access of an application from a server.
The server stores all personal information and other characteristics of the application but can
still run on a local workstation through the internet. Example of this would be a user who needs
to run two different versions of the same software. Technologies that use application
virtualization are hosted applications and packaged applications.
2. Network Virtualization:
The ability to run multiple virtual networks with each has a separate control and data plan. It co-
exists together on top of one physical network. It can be managed by individual parties that
potentially confidential to each other.
Network virtualization provides a facility to create and provision virtual networks—logical
switches, routers, firewalls, load balancer, Virtual Private Network (VPN), and workload security
within days or even in weeks.
3. Desktop Virtualization:
Desktop virtualization allows the users’ OS to be remotely stored on a server in the data centre.
It allows the user to access their desktop virtually, from any location by a different machine.
Users who want specific operating systems other than Windows Server will need to have a
virtual desktop. Main benefits of desktop virtualization are user mobility, portability, easy
management of software installation, updates, and patches.
4. Storage Virtualization:
Storage virtualization is an array of servers that are managed by a virtual storage system. The
servers aren’t aware of exactly where their data is stored, and instead function more like worker
bees in a hive. It makes managing storage from multiple sources to be managed and utilized as a
single repository. storage virtualization software maintains smooth operations, consistent
performance and a continuous suite of advanced functions despite changes, break down and
differences in the underlying equipment.
5. Server Virtualization:
This is a kind of virtualization in which masking of server resources takes place. Here, the
central-server(physical server) is divided into multiple different virtual servers by changing the
identity number, processors. So, each system can operate its own operating systems in isolate
manner. Where each sub-server knows the identity of the central server. It causes an increase in
the performance and reduces the operating cost by the deployment of main server resources into
a sub-server resource. It’s beneficial in virtual migration, reduce energy consumption, reduce
infrastructural cost, etc.
6. Data virtualization:
This is the kind of virtualization in which the data is collected from various sources and
managed that at a single place without knowing more about the technical information like how
data is collected, stored & formatted then arranged that data logically so that its virtual view can
be accessed by its interested people and stakeholders, and users through the various cloud
services remotely. Many big giant companies are providing their services like Oracle, IBM, At
scale, Cdata, etc.
It can be used to performing various kind of tasks such as:
Data-integration
Business-integration
Service-oriented architecture data-services
Searching organizational data
Network Virtualization is a process of logically grouping physical networks and making them
operate as single or multiple independent networks called Virtual Networks.
Types of Virtualization
b. Server Virtualization
In server virtualization in Cloud Computing, the software directly installs on the server system
and use for a single physical server can divide into many servers on the demand basis and
balance the load.
It can be also stated that the server virtualization is masking of the server resources which
consists of number and identity. With the help of software, the server administrator divides one
physical server into multiple servers.
c. Hardware Virtualization
Hardware virtualization in Cloud Computing, used in server platform as it is flexible to use
Virtual Machine rather than physical machines. In hardware virtualizations, virtual machine
software installs in the hardware system and then it is known as hardware virtualization.
It consists of a hypervisor which use to control and monitor the process, memory, and other
hardware resources. After the completion of hardware virtualization process, the user can install
the different operating system in it and with this platform different application can use.
d. Storage Virtualization
In storage virtualization in Cloud Computing, a grouping is done of physical storage which is
from multiple network storage devices this is done so it looks like a single storage device.
It can implement with the help of software applications and storage virtualization is done for
the backup and recovery process. It is a sharing of the physical storage from multiple storage
devices.
e. MEMORY VIRTUALIZATION
1. A technique that gives an application program the impression that it has its own contiguous
logical memory independent of available physical memory.
5. Paging saves inactive memory pages onto the disk and brings them back to physical
memory when required.
6. The space used by VMM(Virtual Memory Monitor) on the disk is known as a “Swap
File”.
7. Swap is a portion of the local storage environment that is designated as memory to the host
system.
8. The hosts see the local swap as additional addressable memory locations and does not
delineate between RAM and Swap.
1. Higher memory utilization by sharing contents and consolidating more virtual machines
on a physical host.
2. Ensuring some memory space exists before halting services until memory frees up.
It introduces a way to decouple memory from the server to provide a shared, distributed or
networked function.
It enhances performance by providing greater memory capacity without any addition to the main
memory. That’s why a portion of the disk drive serves as an extension of the main memory.
Implementations –
Application-level integration – Applications running on connected computers directly connect
to the memory pool through an API or the file system.
Operating System-Level Integration – The operating system first connects to the memory pool
and makes that pooled memory available to applications.
Benefits of Virtualization
Virtualizations in Cloud Computing has numerous benefits, let’s discuss them one by one:
i. Security
During the process of virtualization security is one of the important concerns. The security can
be provided with the help of firewalls, which will help to prevent unauthorized access and will
keep the data confidential.
Moreover, with the help of firewall and security, the data can protect from harmful viruses
malware and other cyber threats. Encryption process also takes place with protocols which
will protect the data from other threads.
So, the customer can virtualize all the data store and can create a backup on a server in which the
data can store.
ii. Flexible operations
With the help of a virtual network, the work of it professional is becoming more efficient and
agile. The network switch implement today is very easy to use, flexible and saves time.
With the help of virtualization in Cloud Computing, technical problems can solve in physical
systems. It eliminates the problem of recovering the data from crashed or corrupted devices and
hence saves time.
iii. Economical
Virtualization in Cloud Computing, save the cost for a physical system such as hardware and
servers. It stores all the data in the virtual server, which are quite economical.
It reduces the wastage, decreases the electricity bills along with the maintenance cost. Due to
this, the business can run multiple operating system and apps in a particular server.
iv. Eliminates the risk of system failure
While performing some task there are chances that the system might crash down at the wrong
time. This failure can cause damage to the company but the virtualizations help you to
perform the same task in multiple devices at the same time.
The data can store in the cloud it can retrieve anytime and with the help of any device.
Moreover, there is two working server side by side which makes the data accessible every time.
Even if a server crashes with the help of the second server the customer can access the data.
v. Flexible transfer of data
The data can transfer to the virtual server and retrieve anytime. The
customers or cloud provider don’t have to waste time finding out hard drives to find data. With
the help of virtualization, it will very easy to locate the required data and transfer them to the
allotted authorities.
This transfer of data has no limit and can transfer to a long distance with the minimum charge
possible. Additional storage can also provide and the cost will be as low as possible.
So, hotspot mitigation problem is that once we have to detect the hotspot that also is not trivial,
we will see some of the methods how we can detect the hotspots and having detected the
hotspots then we have to involve a new allocation strategy to deal with the overloaded virtual
machines and maybe sometimes requires the virtual machine migration that is all contained in the
hotspot mitigation algorithms. So, mitigation of the hotspot algorithms will determine which
virtual servers have that many that sufficient resources required by over provisioned virtual
machines are required by the virtual machines, therefore it has to be migrated in order to
mitigate the hotspots.
So, determining a new mapping of virtual machine to the physical machine that avoids the
threshold violations, specified as per the service level agreement is an NP hard problem. That
means, there exist an NP complete problem that is called multidimensional bin packing problem,
which can be reduced to the hotspot mitigation problem that we have just described.
So, if it is reduced that means multiple multidimensional bin packing problem can be reduced to
the hotspot mitigation problem, where each server is a bin with the multiple dimension
corresponding to the resource constraints and each virtual machine is an object that need to be
packed with the equal with the size equal to it is resource requirements. Even the problem of
determining if a valid packing of multidimensional bin exist to determinate itself is a hard
problem.
What isSDN?
Server virtualization
Server virtualization is a method of running multiple independent virtual operating systems
on a single physical computer. Server virtualization allows optimal use of physical hardware
and dynamic scalability where virtual servers can be created or deleted much like files.
Server Virtualization Definition
Server virtualization is the process of dividing a physical server into multiple unique and isolated
virtual servers by means of a software application. Each virtual server can run its own operating
systems independently.
Server virtualization is a cost-effective way to provide web hosting services and effectively
utilize existing resources in IT infrastructure. Without server virtualization, servers only use a
small part of their processing power. This results in servers sitting idle because the workload is
distributed to only a portion of the network’s servers. Data centers become overcrowded with
underutilized servers, causing a waste of resources and power.
By having each physical server divided into multiple virtual servers, server virtualization allows
each virtual server to act as a unique physical device. Each virtual server can run its own
applications and operating system. This process increases the utilization of resources by making
each virtual server act as a physical server and increases the capacity of each physical machine.
UNIT-4
Until the arrival of event streaming systems like Apache Kafka and Google Cloud Pub/Sub, data
processing has typically been handled with periodic batch jobs, where raw data is first stored and
then later processed at arbitrary time intervals. For example, a telecom company might wait until
the end of the day, week, or month to analyze the millions of call records and calculate
accumulated charges.
One of the limitations of batch processing is that it’s not real time. Increasingly, organizations
want to analyze data in real time in order to make timely business decisions and take action
when interesting things happen. For example, the same telecom company mentioned above
might benefit from keeping customers apprised of charges in real time as a way to enhance the
overall customer experience.
This is where event streaming comes in. Event streaming is the process of continuously
processing infinite streams of events, as they are created, in order to capture the time-value of
data as well as create push-based applications that take action whenever something interesting
happens. Examples of event streaming include continuously analyzing log files generated by
customer- facing web applications, monitoring and responding to customer behavior as users
browse e-commerce websites, keeping a continuous pulse on customer sentiment by analyzing
changes in clickstream data generated by social networks, or collecting and responding to
telemetry data generated by Internet of Things (IoT) devices.
What is MapReduce in cloud computing?
MapReduce is a programming paradigm that enables massive scalability across
hundreds or thousands of servers in a Hadoop cluster. As the processing component,
MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate
and distinct tasks that Hadoop programs perform.
Introduction To MapReduce
MapReduce is a Hadoop structure utilized for composing applications that can process large
amounts of data on clusters. It can likewise be known as a programming model in which we can
handle huge datasets across PC clusters. This application permits information to be put away in a
distributed form. It works on huge volumes of data and enormous scope of computing.
Data Warehouse: We can utilize MapReduce to analyze large data volumes in data warehouses
while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.
How does MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.
Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each
one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated easily in the
Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase
− In the output phase, we have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Fault tolerance: It can handle failures without downtime.
Speed: It splits, shuffles, and reduces the unstructured data in a short time. Cost-effective:
Hadoop MapReduce has a scale-out feature that enables users to process or store the data
in a cost-effective manner.
Scalability: It provides a highly scalable framework. MapReduce allows users to run applications
from many nodes.
Parallel Processing: Here multiple job-parts of the same dataset can be processed in a parallel
manner. This can reduce the task that can be taken to complete a task. Limitations Of
MapReduce
MapReduce cannot cache the intermediate data in memory for a further
requirement which diminishes the performance of Hadoop.
It is only suitable for Batch Processing of a Huge amounts of Data.
Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more
capabilities, features, speed and provides APIs for developers in many languages like Scala,
Python,
Java and R. It is also friendly for database developers as it provides Spark SQL which
supports most of the ANSI SQL functionality. Spark also has out of the box support for
Machine learning and Graph processing using components called MLlib and GraphX
respectively. Spark also has support for streaming data using Spark Streaming.
Spark is developed in Scala programming language. Though the majority of use cases of
Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It
does work with a variety of other Data sources like Cassandra, MySQL, AWS S3
etc. Apache Spark also comes with its default resource manager which might be good enough
for the development environment and small size cluster, but it also integrates very well with
YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as
the resource manager.
Features of Spark
1. Speed: According to Apache, Spark can run applications on Hadoop cluster up to 100
times faster in memory and up to 10 times faster on disk. Spark is able to achieve such
a speed by overcoming the drawback of MapReduce which always writes to disk for
all intermediate results. Spark does not need to write
intermediate results to disk and can work in memory using DAG, lazy evaluation,
RDDs and caching. Spark has a highly optimized execution engine which makes it so
fast.
2. Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also
fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed
Datasets) in combination with DAG, which is built to handle failures of tasks or even
node failures.
3. Lazy Evaluation: Spark works on lazy evaluation technique. This means that the
processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e.
the output RDDs/datasets are not available after transformation will be available only
when needed i.e. when any action is performed. The transformations are just part of the
DAG which gets executed when action is called.
4. Multiple Language Support: Spark provides support for multiple programming
languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.
5. Reusability: Spark code once written for batch processing jobs can also be utilized for
writing processed on Stream processing and it can be used to join historical batch data
and stream data on the fly.
6. Machine Learning: MLlib is a Machine Learning library of Spark. which is available
out of the box for creating ML pipelines for data analysis and predictive analytics also
7. Graph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs
which is again provided out of the box one can write graph processing and do graph-
parallel computation.
8. Stream Processing and Structured Streaming: Spark can be used for batch
processing and also has the capability to cater to stream processing use case with
micro batches. Spark Streaming comes with Spark and one does not need to use any
other streaming tools or APIs. Spark streaming also supports Structure Streaming.
Spark streaming also has in-built connectors for
Apache Kafka which comes very handy while developing Streaming applications.
9. Spark SQL: Spark has an amazing SQL support and has an in- built SQL optimizer.
Spark SQL features are used heavily in warehouses to build ETL pipelines.
Spark is being used in more than 1000 organizations who have built huge clusters for batch
processing, stream processing, building warehouses, building data analytics engine and also
predictive analytics platforms using many of the above features of Spark. Let’s look at some
of the use cases in a few of these organizations.
Streaming Data:
Streaming is basically unstructured data produced by different types of data sources. The data
sources could be anything like log files generated while customers using mobile apps or web
applications, social media contents like tweets, facebook posts, telemetry from connected
devices or instrumentation in data centres. The streaming data is usually unbounded and is
being processed as received from the data source.
Then there is Structured streaming which works on the principle of polling data in intervals
and then this interval data is processed and appended or updated to the unbounded result
table.
Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro
batches and DStreams and Structured Streaming using Datasets and Data frames.
The order management system pushes the order status to the queue(could be Kafka) from
where Streaming process reads every minute and picks all the orders with their status. Then
Spark engine processes these and emits the output status count. Spark streaming process runs
like a daemon until it is killed or error is encountered.
Machine Learning:
As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives
computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell
gave a definition which is more specifically from an engineering perspective, “A computer
program is said to learn from experience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P, improves with experience E.”. ML
solves complex problems that could not be solved with just mathematical numerical methods
or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such
thing. Its goal is to make a prediction or make guesses which are good enough to be useful.
MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple
algorithms for Supervised and Unsupervised ML which can scale out on a cluster for
classification, regression, clustering, collaborative filtering. MLlib interoperates with
Python’s math/numerical analysis library NumPy and also with R’s libraries.
Some of these algorithms are also applicable to streaming data. MLlib
helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.
A very common use case of ML is text classification, say for categorising emails. An ML
pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks
like this. ML is a subject in itself so it is not possible to deep dive here.
Fog Computing:
Fog Computing is another use case of Apache Spark. To understand Fog computing we need
to understand IoT first. IoT basically connects all our devices so that they can communicate
with each other and provide solutions to the users of those devices. This would mean huge
amounts of data and current cloud computing may not be sufficient to cater to so much data
transfer, data processing and online demand of customer’s request.
Fog computing can be ideal here as it takes the work of processing to the devices on the edge
of the network. This would need very low latency, parallel processing of ML and complex
graph analytical algorithms, all of which are readily available in Apache spark out of the box
and can be pick and choose as per the requirements of the processing. So it is expected that as
IoT gains momentum Apache spark will be the leader in Fog computing.
Event Detection:Apache Spark is increasingly used in event detection like credit card
fraud detection, money laundering activities etc. Apache spark streaming along with
MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.
Credit card transactions of a cardholder can be captured over a period of time to
categorize user’s spending habits. Models can be
developed and trained to predict any anomaly in the card transaction and along with
Spark streaming and Kafka in real time.
Interactive Analysis:Spark’s one of the most popular features is its ability to provide
users with interactive analytics. MapReduce does provide tools like Pig and Hive for
interactive analysis, but they are too slow in most of the cases. But Spark is very fast
and swift and that’s why it has gained so much ground in the interactive analysis.
Spark interfaces with programming languages like R, Python, SQL and Scala which
caters to a bigger set of developers and users for interactive analysis.Spark also came up
with Structured Streaming in version 2.0 which can be used for interactive analysis with
live data as well as join the live data with batch data output to get more insight into the
data. Structured streaming in future has the potential to boost Web Analytics by allowing
users to query user’s live web session. Even machine learning can be applied to live
session data for more insights.
Data Warehousing: Data warehousing is another function where Apache Spark has is
getting tremendous traction. Due to an increasing volume of data day by day, the
tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs
as they are not able to scale horizontally. Spark along with Spark SQL is being used by
many companies to migrate to Big Data based Warehouse which can scale horizontally
as the load increases.
With Spark, even the processing can be scaled horizontally by adding machines to the
Spark engine cluster.These migrated applications embed the Spark engine and offer a
web UI to allow users to create, run, test and deploy jobs interactively. Jobs are
primarily written in native Spark SQL or other flavours of SQL. These Spark clusters
have been able to scale to process many terabytes of data every day and the clusters
can be hundreds to thousands of nodes.
Companies Using Apache Spark
Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping
platform generates Petabytes of data as it has millions of users every day doing searches,
shopping and placing orders. These user interactions are represented as complex graphs. The
processing of these data points is done using Spark’s Machine learning component MLlib and
then used to provide better user shopping experience by suggesting products based on choice,
trending products, reviews etc.
MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million
active users. The portal helps its users follow and achieve a healthy lifestyle by following a
proper diet and fitness regime. The portal uses the data added by users about their food,
exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the
portal is able to scan through the huge amount of structured and unstructured data and pull out
best suggestions for its users.
TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is
one of the biggest names in the Travel and Tourism industry. It helps users plan their personal
and official trips around the world. It uses Apache Spark to process petabytes of data from user
interactions and destination details and gives recommendations on planning a perfect trip based
on users choice and preferences. They help users identify best airlines, best prices on hotels and
airlines, best places to eat, basically everything needed to plan any trip. It also ranks these
places, hotels, airlines, restaurants based on
user feedback and reviews. All this processing is done using Apache Spark
Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s
contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark
Machine learning capabilities to identify topics and news which users are interested in. This is
similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning
algo were developed in C/C++ with thousands of lines of code. While today with Spark and
Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a
big leap in turnover time as well as code understanding and maintenance. This has been made
possible due to Spark to a great extent.