Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

555 PG

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

MIDNAPORE CITY COLLEGE

Department of Pure & Applied Science


M.Sc. in Computer Science
Semester: I
Paper Title: Distributed System
Paper Code: COS-102: MII

Syllabus: Introduction to Clouds, Virtualization and Virtual Machine


Network Virtualization and Geo-distributed Clouds

Leader Election in Cloud, Distributed Systems, and Industry Systems

Classical Distributed Algorithms and the Industry Systems

Consensus, Paxos and Recovery in Clouds

Cloud Storage: Key-value stores/NoSQL

Cloud Applications
UNIT-I

Distributed System is a collection of autonomous computer systems that are physically separated
but are connected by a centralized computer network that is equipped with distributed system
software. The autonomous computers will communicate among each system by sharing
resources and files and performing the tasks assigned to them.
Example of Distributed System:
Any Social Media can have its Centralized Computer Network as its Headquarters and
computer systems that can be accessed by any user and using their services will be the
Autonomous Systems in the Distributed System Architecture.

 Distributed System Software: This Software enables computers to coordinate their


activities and to share the resources such as Hardware, Software, Data, etc.
 Database: It is used to store the processed data that are processed by each Node/System of
the Distributed systems that are connected to the Centralized network.
 As we can see that each Autonomous System has a common Application that can have its
own data that is shared by the Centralized Database System.
 To Transfer the Data to Autonomous Systems, Centralized System should be having a
Middleware Service and should be connected to a Network.
 Middleware Services enables some services which are not present in the local systems or
centralized system default by acting as an interface between the Centralized System and the
local systems. By using components of Middleware Services systems communicate and
manage data.
 The Data which is been transferred through the database will be divided into segments or
modules and shared with Autonomous systems for processing.
 The Data will be processed and then will be transferred to the Centralized system through the
network and will be stored in the database.
Characteristics of Distributed System:
 Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere in the
System.
 Openness: It is concerned with Extensions and improvements in the system (i.e., How
openly the software is developed and shared with others)
 Concurrency: It is naturally present in the Distributed Systems, that deal with the same
activity or functionality that can be performed by separate users who are in remote
locations. Every local system has its independent Operating Systems and Resources.
 Scalability: It increases the scale of the system as a number of processors communicate with
more users by accommodating to improve the responsiveness of the system.
 Fault tolerance: It cares about the reliability of the system if there is a failure in
Hardware or Software, the system continues to operate properly without degrading the
performance the system.
 Transparency: It hides the complexity of the Distributed Systems to the Users and
Application programs as there should be privacy in every system.
Types of Distributed System
A Distributed System is a Network of Machines that can exchange information with each other
through Message-passing. It can be very useful as it helps in resource sharing.
 Client/Server Systems: The client requests the server for resources or a task to do, the
server allocates the resource or performs the task and sends the result in the form of a
response to the request of the client.
 Peer-to-Peer Systems: Nodes are an important part of a system. In this, each node performs
its own task on its local memory and shares data through the supporting medium, this node
can work as a server or as a client for a system.
 Middleware: It works as a base for different interoperability applications running on
different operating systems. Data can be transferred to other between others by using this
service.
 Three-tier: In this data of the client is stored in the middle tier rather than sorted into the
client system or on their server through which
development can be done easily. This is mostly used in web or online applications.
 N-tier: When interoperability sends the request to another application to perform a task or to
provide a service.
Types of Distributed Systems
A distributed system is also known as distributed computer science and distributed databases;
independent components that interact with other different machines that exchange messages to
achieve common goals. As such, the distributed system appears to the end-user like an interface
or a computer. Together the system can maximize resources and information while preventing
system failure and did not affect service availability.

1. Distributed Computing System:


This distributed system is used in performance computation which requires high computing.
 Cluster Computing: A collection of connected computers that work together as a unit to
perform operations together, functioning in a single system. Clusters are generally connected
quickly via local area networks & each node is running the same operating system.

When input comes from a client to the main computer, the master CPU divides the task into
simple jobs and sends it to slave note to do it when the jobs are done by the slave nodes, they
send it back to the master node, and then it shows the result to the main computer.
Advantages:
 High Performance
 Easy to manage
 Scalable
 Expandability
 Availability
 Flexibility
Disadvantages:
 High cost
 The problem in finding the fault
 More space is needed
Applications of Cluster Computing:
 In many web applications functionalities such as Security, Search Engines, Database
servers, web servers, proxy, and email.
 It is flexible to allocate work as small data tasks for processing.
 Assist and help to solve complex computational problems
 Cluster computing can be used in weather modeling
 Earthquake, Nuclear, Simulation, and tornado forecast

 Grid computing: In grid computing, the subgroup consists of distributed systems, which
are often set up as a network of computer systems, each system can belong to a different
administrative domain and can differ greatly in terms of hardware, software, and
implementation network technology.
The different department has a different computer with different OS to make the control node
present which helps different computer with different OS to communicate with each other and
transfer messages to work.
Advantages:
 Can solve bigger and more complex problems in a shorter time frame. Easier collaboration
with other organizations and better use of existing equipment
Disadvantages:
 Grid software and standards continue to evolve
 Getting started learning curve
 Non-interactive job submission
 You may need a fast connection between computer resources.
 Licensing on many servers can be prohibitive for some applications.
Applications of Grid Computing
 Organizations that develop grid standards and practices for the guild line.
 Works as a middleware solution for connecting different businesses.
 It is a solution-based solution that can meet computing, data, and network needs.

2. Distributed Information System:


 Distributed transaction processing: It works across different servers using multiple
communication models. The four characteristics that transactions have:
 Atomic: the transaction taking place must be indivisible for the others
 Consistent: The transaction should be consistent after the transaction has been
done
 Isolated: A transaction must not interfere with another transaction
 Durable: Once an engaged transaction, the changes are permanent. Transactions
are often constructed as several sub-transactions, jointly forming a nested
transaction.
Each database can perform its own individual query containing data retrieval from two different
databases to give one single result
In the company’s middleware systems, the component that manages distributed (or nested)
transactions has formed the application integration core at the server or database. This was
referred to as the Transaction Processing Monitor(TP Monitor). Its main task was to allow an
application to access multiple servers/databases by providing a transactional programming
model. Many requests are sent to the database to get the result, to ensure each request gets
successfully executed and deliver result to each request, this work is handled by the TP Monitor.

 Enterprise application integration: Enterprise Application Integration (EAI) is the process


of bringing different businesses together. The databases and workflows associated with
business applications ensure that the business uses information consistently and that changes
in data done by one business application are reflected correctly in another’s. Many
organizations collect different data from different plate forms in the internal systems and
then they use those data are used in the Trading system /physical medium.
 RPC: Remote Procedure Calls (RPC), a software element that sends a request to every other
software element with the aid of using creating a nearby method name and retrieving the data
Which is now known as remote method invocation (RMI). An app can have a different
database for managing different data and then they can communicate with each other on
different platforms. Suppose, if you login into your android device and watch you’re video
on YouTube then you go to your laptop and open YouTube you can see the same video is in
your watch list. RPC and RMI have the disadvantage that the sender and receiver must be
running at the time of communication.

UNIT-2
What is Cloud-Computing?

Cloud computing is a technology to store, manage, process, and access the data over the
internet instead of a local server or computer hard drives. Here, the term cloud is taken from
the symbol of the internet users in the flowcharts. The remote servers are used in cloud
computing to store the data that can be accessed from anywhere using the internet.

With the help of cloud computing


, an organization can save lots of cost of local data storage, maintenance of data, etc. The
information over the cloud can be accessed by anyone, anywhere, and anytime, with the help of
the internet.
Using cloud computing instead of traditional storage helps users with lots of benefits such as
speed, cost-effectiveness, security, global access, etc.

Cloud computing involves two main concepts:


o Vendors who provide the software apps on the clouds.
o Clients who access the software apps via cloud.
Types of Cloud Computing

Types of Cloud
1. Public cloud
2. Private cloud
3. Hybrid cloud
4. Community cloud
Public Cloud
Public clouds are managed by third parties which provide cloud services over the internet to
the public, these services are available as pay-as-you-go billing models.
They offer solutions for minimizing IT infrastructure costs and become a good option for
handling peak loads on the local infrastructure.
Public clouds are the go-to option for small enterprises, which can start their businesses without
large upfront investments by completely relying on public infrastructure for their IT needs.
The fundamental characteristics of public clouds are multitenancy. A public cloud is meant
to serve multiple users, not a single customer. A user requires a virtual computing
environment that is separated, and most likely isolated, from other users.
Public cloud
Private cloud
Private clouds are distributed systems that work on private infrastructure and provide the users
with dynamic provisioning of computing resources. Instead of a pay-as-you-go model in
private clouds, there could be other schemes that manage the usage of the cloud and
proportionally billing of the different departments or sections of an enterprise. Private cloud
providers are HP Data Centers, Ubuntu, Elastic-Private cloud, Microsoft, etc.

Hybrid cloud:
A hybrid cloud is a heterogeneous distributed system formed by combining facilities of the
public cloud and private cloud. For this reason, they are also called heterogeneous clouds.
A major drawback of private deployments is the inability to scale on- demand and efficiently
address peak loads. Here public clouds are needed. Hence, a hybrid cloud takes advantage of
both public and private clouds.

Community cloud:
Community clouds are distributed systems created by integrating the services of different
clouds to address the specific needs of an industry, a community, or a business sector. But
sharing responsibilities among the organizations is difficult.
In the community cloud, the infrastructure is shared between organizations that have shared
concerns or tasks. The cloud may be managed by an organization or a third party.

Community Cloud

Sectors that use community clouds are:


1. Media industry: Media companies are looking for quick, simple, low-cost ways for
increasing the efficiency of content generation. Most media productions involve an extended
ecosystem of partners. In particular, the creation of digital content is the outcome of a
collaborative process that includes the movement of large data, massive compute-intensive
rendering tasks, and complex workflow executions.
2. Healthcare industry: In the healthcare industry community clouds are used to share
information and knowledge on the global level with sensitive data in the private infrastructure.
3. Energy and core industry: In these sectors, the community cloud is used to cluster a set of
solution which collectively addresses the management, deployment, and orchestration of
services and operations.
4. Scientific research: In this organization with common interests in science share a large
distributed infrastructure for scientific computing.
Types of Cloud Services

Infrastructure as a Service (IaaS)

IaaS is also known as Hardware as a Service (HaaS). It is a computing infrastructure managed


over the internet. The main advantage of using IaaS is that it helps users to avoid the cost and
complexity of purchasing and managing the physical servers.

Characteristics of IaaS

There are the following characteristics of IaaS -


o Resources are available as a service
o Services are highly scalable
o Dynamic and flexible
o GUI and API-based access
o Automated administrative tasks

Example: DigitalOcean, Linode, Amazon Web Services (AWS), Microsoft Azure, Google
Compute Engine (GCE), Rackspace, and Cisco Metacloud.
Platform as a Service (PaaS)

PaaS cloud computing platform is created for the programmer to develop, test, run, and manage
the applications.

Characteristics of PaaS

There are the following characteristics of PaaS -


o Accessible to various users via the same development application.
o Integrates with web services and databases.
o Builds on virtualization technology, so resources can easily be scaled up or down as per
the organization's need.
o Support multiple languages and frameworks.
o Provides an ability to "Auto-scale".

Example: AWS Elastic Beanstalk, Windows Azure, Heroku, Force.com, Google App Engine,
Apache Stratos, Magento Commerce Cloud, and OpenShift.
Software as a Service (SaaS)

SaaS is also known as "on-demand software". It is a software in which the applications are
hosted by a cloud service provider. Users can access these applications with the help of internet
connection and web browser.

Characteristics of SaaS

There are the following characteristics of SaaS -


o Managed from a central location
o Hosted on a remote server
o Accessible over the internet
o Users are not responsible for hardware and software updates. Updates are applied
automatically.
o The services are purchased on the pay-as-per-use basis

Example: BigCommerce, Google Apps, Salesforce, Dropbox, ZenDesk, Cisco WebEx,


ZenDesk, Slack, and GoToMeeting.
Difference between IaaS, PaaS, and SaaS

The below table shows the difference between IaaS, PaaS, and SaaS -

IaaS Paas SaaS

It provides a virtual It provides virtual platforms and It provides web software and
data center to store tools to create, test, and apps to complete

information and create deploy apps. business tasks.


platforms for app
development, testing, and
deployment.
It provides access to It provides runtime environments It provides software as a
resources such as virtual and service to the end-users.
machines, deployment tools for
virtual storage, etc. applications.

It is used by network It is used by developers. It is used by end users.


architects.

IaaS provides only PaaS provides SaaS provides


Infrastructure. Infrastructure+Platform. Infrastructure+Platform
+Software.

Cloud computing provides IT services through the internet. These services are placed in different
remote places. The services can be divided into three main categories:

1. Software-as-a-Service (SaaS)
2. Platform-as-a-Service (PaaS)
3. Infrastructure-as-a-Service (IaaS)
From the above three services, salesforce provides two services: SAAS and PAAS, to its users.

SAAS(Salesforce.com)

Software-as-a-Service is a way of providing applications as a service over the internet.

SaaS
services can be directly accessed using the internet instead of installing each application on the
local drive or system.

Salesforce.com is the SAAS service provider that provides various online applications for CRM.
There is no need to install any software or server on a local machine; instead we can start the
business on this just by singing-up.

PAAS(Force.com)

Platform-as-a-Service
or PaaS is a type of cloud computing service where a service provider such as Salesforce.com
provides a platform to their client to work on. On such platforms, the users or clients can run,
develop, test, or manage any business applications without any IT infrastructure.

It lies between the SaaS and IaaS services, and provides a building block by which we can create
our solutions.
Google App Engine is one of the great examples of PaaS services. Currently, it provides online
Python and Java Runtime platforms to develop web applications without any need for
complicated software & hardware.

Force.com platform also offers PaaS services. It uses its language proprietary.

Infrastructure-as-a-Service (IaaS)

IaaS
is a type of cloud-computing service that offers the rental computing infrastructures. The cloud
provider provides various infrastructure services such as servers, virtual machines, network
storage, etc.
Features of Cloud Computing

Cloud computing is becoming popular day by day. Continuous business expansion and growth
requires huge computational power and large-scale data storage systems. Cloud computing can
help organizations expand and securely move data from physical locations to the 'cloud' that can
be accessed anywhere.

Cloud computing has many features that make it one of the fastest growing industries at present.

1. Resources Pooling
Resource pooling is one of the essential features of cloud computing. Resource pooling means
that a cloud service provider can share resources among multiple clients, each providing a
different set of services according to their needs. It is a multi-client strategy that can be applied
to data storage, processing and bandwidth- delivered services. The administration process of
allocating resources in real-time does not conflict with the client's experience.

2. On-Demand Self-Service
It is one of the important and essential features of cloud computing. This enables the client to
continuously monitor server uptime, capabilities and allocated network storage. This is a
fundamental feature of cloud computing, and a customer can also control the computing
capabilities according to their needs.
3. Easy Maintenance
This is one of the best cloud features. Servers are easily maintained, and downtime is minimal or
sometimes zero. Cloud computing powered resources often undergo several updates to optimize
their capabilities and potential. Updates are more viable with devices and perform faster than
previous versions.

4. Scalability And Rapid Elasticity


A key feature and advantage of cloud computing is its rapid scalability. This cloud feature
enables cost-effective handling of workloads that require a large number of servers but only for a
short period. Many customers have workloads that can be run very cost-effectively due to the
rapid scalability of cloud computing.

5. Economical
This cloud feature helps in reducing the IT expenditure of the organizations. In cloud computing,
clients need to pay the administration for the space used by them. There is no cover-up or
additional charges that need to be paid. Administration is economical, and more often than not,
some space is allocated for free.

6. Measured And Reporting Service


Reporting Services is one of the many cloud features that make it the best choice for
organizations. The measurement and reporting service is helpful for both cloud providers and
their customers. This enables both the provider and the customer to monitor and report which
services have been used and for what purposes. It helps in monitoring billing and ensuring
optimum utilization of resources.

7. Security
Data security is one of the best features of cloud computing. Cloud services make a copy of the
stored data to prevent any kind of data loss. If one server loses data by any chance, the copied
version is restored from the other server. This feature comes in handy when multiple users are
working on a particular file in real-time, and one file suddenly gets corrupted.

8. Automation
Automation is an essential feature of cloud computing. The ability of cloud computing to
automatically install, configure and maintain a cloud service is known as automation in cloud
computing. In simple words, it is the process of making the most of the technology and
minimizing the manual effort. However, achieving automation in a cloud ecosystem is not that
easy. This requires the installation and deployment of virtual machines, servers, and large
storage. On successful deployment, these resources also require constant maintenance.

9. Resilience
Resilience in cloud computing means the ability of a service to quickly recover from any
disruption. The resilience of a cloud is measured by how fast its servers, databases and network
systems restart and recover from any loss or damage. Availability is another key feature of cloud
computing. Since cloud services can be accessed remotely, there are no geographic restrictions or
limits on the use of cloud resources.

10. Large Network Access


A big part of the cloud's characteristics is its ubiquity. The client can access cloud data or
transfer data to the cloud from any location with a device and internet connection. These
capabilities are available everywhere in the organization and are achieved with the help of
internet. Cloud providers deliver that large network access by monitoring and guaranteeing
measurements that reflect how clients access cloud resources and data: latency, access times,
data throughput, and more.

The services can be scaled up and down as per the client requirements.
Benefits of Cloud-computing

1. Cost-Effective: The cloud computing platform is much cost-effective, as there is no


requirement to save data on local drives or any hardware setup.
2. 24*7 Availability: One of the most significant advantages of cloud computing is that the
data or any service available in the cloud can be accessed any time from anywhere.
3. High-Security: The data stored in the hard drives may be lost, and if the data is highly
confidential, it can highly affect any organization. But with the cloud
platforms, the data is highly secured in the clouds, so the risk of the data lost is reduced
with cloud computing.
4. Easy Access: Cloud applications can be accessed from anywhere and anytime.
5. Fast Implementation: To implement any new application, it may take a long time. But
with cloud applications, this time can be reduced a lot. With most cloud applications, we
just need to sign-up, and we can start working on it.
6. Instant Scalability: Cloud-based applications enable the organization to easily increase
or decrease user's numbers as per the requirement. Hence, we don't need to think about
the availability or running out of capacity.
7. Automated updates: Any application can take up to many days to upgrade, maintain, or
test the application. But with cloud applications, such things are not necessary because
the cloud application has the automated update software that can be updated
automatically.
8. Collaboration: Cloud-computing enhances collaboration. It means that various groups of
an organization can connect virtually and share useful information and data on the cloud-
platforms. It improves the customer services and product development process in any
organization.

Data Intensive Computing


Data Intensive Computing is a class of parallel computing which uses data parallelism in order to
process large volumes of data. The size of this data is typically in terabytes or petabytes. This
large amount of data is generated each day and it is referred to Big Data.

Data intensive computing has some characteristics which are different from other forms of
computing. They are:
 In order to achieve high performance in data intensive computing, it is necessary to minimize the
movement of data. This reduces system overhead and increases performance by allowing the
algorithms to execute on the node where the data resides.
 The data intensive computing system utilizes a machine independent approach where the run
time system controls the scheduling, execution, load balancing, communications and the
movement of programs.
 Data intensive computing hugely focuses on reliability and availability of data. Traditional large
scale systems may be susceptible to hardware failures, communication errors and software bugs,
and data intensive computing is designed to overcome these challenges.
 Data intensive computing is designed for scalability so it can accommodate any amount of data
and so it can meet the time critical requirements. Scalability of the hardware as well as the
software architecture is one of the biggest advantages of data intensive computing.

UNIT-2
Virtualization In Cloud Computing and Types
Virtualization is a technique of how to separate a service from the underlying physical delivery
of that service. It is the process of creating a virtual version of something like computer
hardware. It was initially developed during the mainframe era. It involves using specialized
software to create a virtual or software-created version of a computing resource rather than the
actual version of the same resource. With the help of Virtualization, multiple operating systems
and applications can run on same machine and its same hardware at the same time, increasing
the utilization and flexibility of hardware.
In other words, one of the main cost effective, hardware reducing, and energy saving techniques
used by cloud providers is virtualization. Virtualization allows to share a single physical
instance of a resource or an application among multiple customers and organizations at one
time. It does this by assigning a logical name to a physical storage and providing a pointer to
that physical resource on demand. The term virtualization is often synonymous with hardware
virtualization, which plays a fundamental role in efficiently delivering Infrastructure-as-a-
Service (IaaS) solutions for cloud computing. Moreover, virtualization technologies provide a
virtual environment for not only executing applications but also for storage, memory, and
networking.
The machine on which the virtual machine is going to be built is known as Host Machine and that
virtual machine is referred as a Guest Machine.

BENEFITS OF VIRTUALIZATION
1. More flexible and efficient allocation of resources.
2. Enhance development productivity.
3. It lowers the cost of IT infrastructure.
4. Remote access and rapid scalability.
5. High availability and disaster recovery.
6. Pay peruse of the IT infrastructure on demand.
7. Enables running multiple operating systems.
Types of Virtualization:
1. Application Virtualization.
2. Network Virtualization.
3. Desktop Virtualization.
4. Storage Virtualization.
5. Server Virtualization.
6. Data virtualization.
1. Application Virtualization:
Application virtualization helps a user to have remote access of an application from a server.
The server stores all personal information and other characteristics of the application but can
still run on a local workstation through the internet. Example of this would be a user who needs
to run two different versions of the same software. Technologies that use application
virtualization are hosted applications and packaged applications.
2. Network Virtualization:
The ability to run multiple virtual networks with each has a separate control and data plan. It co-
exists together on top of one physical network. It can be managed by individual parties that
potentially confidential to each other.
Network virtualization provides a facility to create and provision virtual networks—logical
switches, routers, firewalls, load balancer, Virtual Private Network (VPN), and workload security
within days or even in weeks.
3. Desktop Virtualization:
Desktop virtualization allows the users’ OS to be remotely stored on a server in the data centre.
It allows the user to access their desktop virtually, from any location by a different machine.
Users who want specific operating systems other than Windows Server will need to have a
virtual desktop. Main benefits of desktop virtualization are user mobility, portability, easy
management of software installation, updates, and patches.
4. Storage Virtualization:
Storage virtualization is an array of servers that are managed by a virtual storage system. The
servers aren’t aware of exactly where their data is stored, and instead function more like worker
bees in a hive. It makes managing storage from multiple sources to be managed and utilized as a
single repository. storage virtualization software maintains smooth operations, consistent
performance and a continuous suite of advanced functions despite changes, break down and
differences in the underlying equipment.
5. Server Virtualization:
This is a kind of virtualization in which masking of server resources takes place. Here, the
central-server(physical server) is divided into multiple different virtual servers by changing the
identity number, processors. So, each system can operate its own operating systems in isolate
manner. Where each sub-server knows the identity of the central server. It causes an increase in
the performance and reduces the operating cost by the deployment of main server resources into
a sub-server resource. It’s beneficial in virtual migration, reduce energy consumption, reduce
infrastructural cost, etc.
6. Data virtualization:
This is the kind of virtualization in which the data is collected from various sources and
managed that at a single place without knowing more about the technical information like how
data is collected, stored & formatted then arranged that data logically so that its virtual view can
be accessed by its interested people and stakeholders, and users through the various cloud
services remotely. Many big giant companies are providing their services like Oracle, IBM, At
scale, Cdata, etc.
It can be used to performing various kind of tasks such as:
 Data-integration
 Business-integration
 Service-oriented architecture data-services
 Searching organizational data

Full Virtualization and Para virtualization


1. Full Virtualization: Full Virtualization was introduced by IBM in the year 1966. It is the first
software solution for server virtualization and uses binary translation and direct approach
techniques. In full virtualization, guest OS is completely isolated by the virtual machine from the
virtualization layer and hardware. Microsoft and Parallels systems are examples of full
virtualization.

1. Para virtualization: Para virtualization is the category of CPU


virtualization which uses hypercalls for operations to handle instructions
at compile time. In para virtualization, guest OS is not completely isolated but it is
partially isolated by the virtual machine from the virtualization layer and hardware.
VMware and Xen are some examples of para virtualization.

The difference between Full Virtualization and Paravirtualization are as follows:


S.No. Full Virtualization Para virtualization

In Full virtualization, virtual machines In para virtualization, a virtual machine


permit the execution of the instructions does not implement full isolation of OS
with the running of unmodified OS in but rather provides a different API which
1. an entirely isolated way. is utilized when OS is subjected to
alteration.

2. While the Para virtualization is more


Full Virtualization is less secure. secure than the Full Virtualization.

3. Full Virtualization uses binary While Para virtualization uses hypercalls at


translation and a direct approach as a compile time for operations.

S.No. Full Virtualization Para virtualization


technique for operations.

4. Full Virtualization is slow than Paravirtualization is faster in operation as


paravirtualization in operation. compared to full virtualization.

5. Full Virtualization is more portable and Paravirtualization is less portable and


compatible. compatible.

6. Examples of full virtualization are Examples of para virtualization are Microsoft


Microsoft and Parallels systems. Hyper-V, Citrix Xen, etc.

The guest operating system has to be


7. It supports all guest operating modified and only a few operating
systems without modification. systems support it.

Using the drivers, the guest operating


8. The guest operating system will issue system will directly communicate with the
hardware calls. hypervisor.

9. It is less streamlined compared to


para-virtualization. It is more streamlined.

10. It provides less isolation compared to full


It provides the best isolation. virtualization
UNIT-3
Network Virtualization in Cloud Computing

Network Virtualization is a process of logically grouping physical networks and making them
operate as single or multiple independent networks called Virtual Networks.

General Architecture Of Network Virtualization

Tools for Network Virtualization :


1. Physical switch OS –
It is where the OS must have the functionality of network virtualization.
2. Hypervisor –
It is which uses third-party software or built-in networking and the functionalities of network
virtualization.
The basic functionality of the OS is to give the application or the executing process with a
simple set of instructions. System calls that are generated by the OS and executed through the
libc library are comparable to the service primitives given at the interface between the
application and the network through the SAP (Service Access Point).
The hypervisor is used to create a virtual switch and configuring virtual networks on it. The
third-party software is installed onto the hypervisor and it replaces the native networking
functionality of the hypervisor. A hypervisor allows us to have various VMs all working
optimally on a single piece of computer hardware.
Functions of Network Virtualization :
 It enables the functional grouping of nodes in a virtual network.
 It enables the virtual network to share network resources.
 It allows communication between nodes in a virtual network without routing of
frames.
 It restricts management traffic.
 It enforces routing for communication between virtual networks.
Network Virtualization in Virtual Data Center :
1. Physical Network
 Physical components: Network adapters, switches, bridges, repeaters, routers and hubs.
 Grants connectivity among physical servers running a hypervisor, between physical
servers and storage systems and between physical servers and clients.
2. VM Network
 Consists of virtual switches.
 Provides connectivity to hypervisor kernel.
 Connects to the physical network.
 Resides inside the physical server.

Network Virtualization In VDC

Advantages of Network Virtualization :


Improves manageability –
 Grouping and regrouping of nodes are eased.
 Configuration of VM is allowed from a centralized management workstation using
management software.
Reduces CAPEX –
 The requirement to set up separate physical networks for different node groups is
reduced.
Improves utilization –
 Multiple VMs are enabled to share the same physical network which enhances the
utilization of network resource.
Enhances performance –
 Network broadcast is restricted and VM performance is improved.
Enhances security –
 Sensitive data is isolated from one VM to another VM.
 Access to nodes is restricted in a VM from another VM.
Disadvantages of Network Virtualization :
 It needs to manage IT in the abstract.
 It needs to coexist with physical devices in a cloud-integrated hybrid environment.
 Increased complexity.
 Upfront cost.
 Possible learning curve.
Examples of Network Virtualization :
Virtual LAN (VLAN) –
 The performance and speed of busy networks can be improved by VLAN.
 VLAN can simplify additions or any changes to the network.
Network Overlays –
 A framework is provided by an encapsulation protocol called VXLAN for overlaying
virtualized layer 2 networks over layer 3 networks.
 The Generic Network Virtualization Encapsulation protocol (GENEVE) provides a new
way to encapsulation designed to provide control-plane independence between the
endpoints of the tunnel.
Network Virtualization Platform: VMware NSX –
 VMware NSX Data Center transports the components of networking and security such as
switching, firewalling and routing that are defined and consumed in software.
 It transports the operational model of a virtual machine (VM) for the network.
Applications of Network Virtualization :
 Network virtualization may be used in the development of application testing to mimic
real-world hardware and system software.
 It helps us to integrate several physical networks into a single network or separate single
physical networks into multiple analytical networks.
 In the field of application performance engineering, network virtualization allows the
simulation of connections between applications, services, dependencies, and end- users for
software testing.
 It helps us to deploy applications in a quicker time frame, thereby supporting a faster go-to-
market.
 Network virtualization helps the software testing teams to derive actual results with
expected instances and congestion issues in a networked environment.

Types of Virtualization

 Operating System Virtualization


 Hardware Virtualization
 Server Virtualization
 Storage Virtualization

Operating System Virtualization


In operating system virtualization in Cloud Computing, the virtual machine software installs in
the operating system of the host rather than directly on the hardware system.
The most important use of operating system virtualization is for testing the application on different
platforms or operating system. Here, the software is present in the hardware, which allows different
applications to run.

b. Server Virtualization
In server virtualization in Cloud Computing, the software directly installs on the server system
and use for a single physical server can divide into many servers on the demand basis and
balance the load.
It can be also stated that the server virtualization is masking of the server resources which
consists of number and identity. With the help of software, the server administrator divides one
physical server into multiple servers.

c. Hardware Virtualization
Hardware virtualization in Cloud Computing, used in server platform as it is flexible to use
Virtual Machine rather than physical machines. In hardware virtualizations, virtual machine
software installs in the hardware system and then it is known as hardware virtualization.
It consists of a hypervisor which use to control and monitor the process, memory, and other
hardware resources. After the completion of hardware virtualization process, the user can install
the different operating system in it and with this platform different application can use.

d. Storage Virtualization
In storage virtualization in Cloud Computing, a grouping is done of physical storage which is
from multiple network storage devices this is done so it looks like a single storage device.
It can implement with the help of software applications and storage virtualization is done for
the backup and recovery process. It is a sharing of the physical storage from multiple storage
devices.
e. MEMORY VIRTUALIZATION

1. A technique that gives an application program the impression that it has its own contiguous
logical memory independent of available physical memory.

2. Memory virtualization is a generalization of the concept of virtual memory.


3. Virtual memory makes application programming easier to hiding fragmentation of physical
memory.

4. In virtual memory implementation, a memory address space is divided into contiguous


blocks of fixed size pages.

5. Paging saves inactive memory pages onto the disk and brings them back to physical
memory when required.

6. The space used by VMM(Virtual Memory Monitor) on the disk is known as a “Swap
File”.

7. Swap is a portion of the local storage environment that is designated as memory to the host
system.

8. The hosts see the local swap as additional addressable memory locations and does not
delineate between RAM and Swap.

9. High bandwidth, low latency environments are making use of memory


virtualization as well.

Benefits to use memory virtualization:

1. Higher memory utilization by sharing contents and consolidating more virtual machines
on a physical host.

2. Ensuring some memory space exists before halting services until memory frees up.

3. Access to more memory than the chassis can physically allow.

4. Advanced server virtualization functions, like live migrations.

It introduces a way to decouple memory from the server to provide a shared, distributed or
networked function.
It enhances performance by providing greater memory capacity without any addition to the main
memory. That’s why a portion of the disk drive serves as an extension of the main memory.
Implementations –
 Application-level integration – Applications running on connected computers directly connect
to the memory pool through an API or the file system.

 Operating System-Level Integration – The operating system first connects to the memory pool
and makes that pooled memory available to applications.

Benefits of Virtualization
Virtualizations in Cloud Computing has numerous benefits, let’s discuss them one by one:

i. Security
During the process of virtualization security is one of the important concerns. The security can
be provided with the help of firewalls, which will help to prevent unauthorized access and will
keep the data confidential.
Moreover, with the help of firewall and security, the data can protect from harmful viruses
malware and other cyber threats. Encryption process also takes place with protocols which
will protect the data from other threads.
So, the customer can virtualize all the data store and can create a backup on a server in which the
data can store.
ii. Flexible operations
With the help of a virtual network, the work of it professional is becoming more efficient and
agile. The network switch implement today is very easy to use, flexible and saves time.
With the help of virtualization in Cloud Computing, technical problems can solve in physical
systems. It eliminates the problem of recovering the data from crashed or corrupted devices and
hence saves time.
iii. Economical
Virtualization in Cloud Computing, save the cost for a physical system such as hardware and
servers. It stores all the data in the virtual server, which are quite economical.
It reduces the wastage, decreases the electricity bills along with the maintenance cost. Due to
this, the business can run multiple operating system and apps in a particular server.
iv. Eliminates the risk of system failure
While performing some task there are chances that the system might crash down at the wrong
time. This failure can cause damage to the company but the virtualizations help you to
perform the same task in multiple devices at the same time.
The data can store in the cloud it can retrieve anytime and with the help of any device.
Moreover, there is two working server side by side which makes the data accessible every time.
Even if a server crashes with the help of the second server the customer can access the data.
v. Flexible transfer of data
The data can transfer to the virtual server and retrieve anytime. The
customers or cloud provider don’t have to waste time finding out hard drives to find data. With
the help of virtualization, it will very easy to locate the required data and transfer them to the
allotted authorities.
This transfer of data has no limit and can transfer to a long distance with the minimum charge
possible. Additional storage can also provide and the cost will be as low as possible.

So, hotspot mitigation problem is that once we have to detect the hotspot that also is not trivial,
we will see some of the methods how we can detect the hotspots and having detected the
hotspots then we have to involve a new allocation strategy to deal with the overloaded virtual
machines and maybe sometimes requires the virtual machine migration that is all contained in the
hotspot mitigation algorithms. So, mitigation of the hotspot algorithms will determine which
virtual servers have that many that sufficient resources required by over provisioned virtual
machines are required by the virtual machines, therefore it has to be migrated in order to
mitigate the hotspots.
So, determining a new mapping of virtual machine to the physical machine that avoids the
threshold violations, specified as per the service level agreement is an NP hard problem. That
means, there exist an NP complete problem that is called multidimensional bin packing problem,
which can be reduced to the hotspot mitigation problem that we have just described.
So, if it is reduced that means multiple multidimensional bin packing problem can be reduced to
the hotspot mitigation problem, where each server is a bin with the multiple dimension
corresponding to the resource constraints and each virtual machine is an object that need to be
packed with the equal with the size equal to it is resource requirements. Even the problem of
determining if a valid packing of multidimensional bin exist to determinate itself is a hard
problem.

What is geo distribution in cloud computing?


For capacity-intensive workloads, users need fast, local access to data. Cloudian's global
data fabric makes this easy with storage nodes that can be deployed anywhere. Locate storage
nodes near data users or data sources to minimize latency and network traffic while maximizing
throughput.
What is a Geo-distributed application?
A geo-distributed app is an app that spans multiple geographic locations for high availability,
resiliency, compliance, and performance. So a geo-distributed application's architecture must
support high availability, resilience, compliance, and performance.
Geo-Distribution
For capacity-intensive workloads, users need fast, local access to data.
Cloudian’s global data fabric makes this easy with storage nodes that can be deployed
anywhere. Locate storage nodes near data users or data sources to minimize latency and
network traffic while maximizing throughput.
Replicate data to other points in the fabric, employing the consistency level you choose.
Whether for data collection, distribution, or data protection, Cloudian makes it simple to put
storage where you need it.

What isSDN?
Server virtualization
Server virtualization is a method of running multiple independent virtual operating systems
on a single physical computer. Server virtualization allows optimal use of physical hardware
and dynamic scalability where virtual servers can be created or deleted much like files.
Server Virtualization Definition
Server virtualization is the process of dividing a physical server into multiple unique and isolated
virtual servers by means of a software application. Each virtual server can run its own operating
systems independently.

Key Benefits of Server Virtualization:


 Higher server ability
 Cheaper operating costs
 Eliminate server complexity
 Increased application performance
 Deploy workload quicker

Three Kinds of Server Virtualization:


 Full Virtualization: Full virtualization uses a hypervisor, a type of software
that directly communicates with a physical server's disk space and CPU. The
hypervisor monitors the physical server's resources and keeps each virtual
server independent and unaware of the other virtual
servers. It also relays resources from the physical server to the correct virtual
server as it runs applications. The biggest limitation of using full virtualization is
that a hypervisor has its own processing needs. This can slow down applications
and impact server performance.
 Para-Virtualization: Unlike full virtualization, para-virtualization involves the
entire network working together as a cohesive unit. Since each operating system on
the virtual servers is aware of one another in para-virtualization, the hypervisor does
not need to use as much processing power to manage the operating systems.
 OS-Level Virtualization: Unlike full and para-virtualization, OS-level
visualization does not use a hypervisor. Instead, the virtualization capability,
which is part of the physical server operating system, performs all the tasks of a
hypervisor. However, all the virtual servers must run that same operating system in
this server virtualization method.

Why Server Virtualization?

Server virtualization is a cost-effective way to provide web hosting services and effectively
utilize existing resources in IT infrastructure. Without server virtualization, servers only use a
small part of their processing power. This results in servers sitting idle because the workload is
distributed to only a portion of the network’s servers. Data centers become overcrowded with
underutilized servers, causing a waste of resources and power.

By having each physical server divided into multiple virtual servers, server virtualization allows
each virtual server to act as a unique physical device. Each virtual server can run its own
applications and operating system. This process increases the utilization of resources by making
each virtual server act as a physical server and increases the capacity of each physical machine.

UNIT-4

What is Apache Kafka?


Apache Kafka is a popular event streaming platform used to collect, process, and store streaming
event data or data that has
no discrete beginning or end. Kafka makes possible a new generation of distributed
applications capable of scaling to handle billions of streamed events per minute.

Until the arrival of event streaming systems like Apache Kafka and Google Cloud Pub/Sub, data
processing has typically been handled with periodic batch jobs, where raw data is first stored and
then later processed at arbitrary time intervals. For example, a telecom company might wait until
the end of the day, week, or month to analyze the millions of call records and calculate
accumulated charges.

One of the limitations of batch processing is that it’s not real time. Increasingly, organizations
want to analyze data in real time in order to make timely business decisions and take action
when interesting things happen. For example, the same telecom company mentioned above
might benefit from keeping customers apprised of charges in real time as a way to enhance the
overall customer experience.

This is where event streaming comes in. Event streaming is the process of continuously
processing infinite streams of events, as they are created, in order to capture the time-value of
data as well as create push-based applications that take action whenever something interesting
happens. Examples of event streaming include continuously analyzing log files generated by
customer- facing web applications, monitoring and responding to customer behavior as users
browse e-commerce websites, keeping a continuous pulse on customer sentiment by analyzing
changes in clickstream data generated by social networks, or collecting and responding to
telemetry data generated by Internet of Things (IoT) devices.
What is MapReduce in cloud computing?
MapReduce is a programming paradigm that enables massive scalability across
hundreds or thousands of servers in a Hadoop cluster. As the processing component,
MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate
and distinct tasks that Hadoop programs perform.

Introduction To MapReduce
MapReduce is a Hadoop structure utilized for composing applications that can process large
amounts of data on clusters. It can likewise be known as a programming model in which we can
handle huge datasets across PC clusters. This application permits information to be put away in a
distributed form. It works on huge volumes of data and enormous scope of computing.

MapReduce consists of two phases:


Map and Reduce Map generally deals with the splitting and mapping of data while reducing tasks
shuffle and reducing the data.
Hadoop is fully capable of running MapReduce programs that are written in various languages:
python, java, and C++. This is very useful for performing large- scale data analysis using
multiple machines in the cluster.
Application Of MapReduce
Entertainment: To discover the most popular movies, based on what you like and what you
watched in this case Hadoop MapReduce help you out. It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like Amazon, Walmart, and eBay, utilize the
MapReduce programming model to distinguish most loved items dependent on clients’
inclinations or purchasing behavior.
It incorporates making item proposal Mechanisms for E-commerce inventories, examining
website records, buy history, user interaction logs, etc.

Data Warehouse: We can utilize MapReduce to analyze large data volumes in data warehouses
while implementing specific business logic for data insights.
Fraud Detection: Hadoop and MapReduce are utilized in monetary enterprises, including
organizations like banks, insurance providers, installment areas for misrepresentation
recognition, pattern distinguishing proof, or business metrics through transaction analysis.
How does MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduced task is always performed after the map job.

Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each
one of them to generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated easily in the
Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase
− In the output phase, we have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.
Advantage of MapReduce
Fault tolerance: It can handle failures without downtime.
Speed: It splits, shuffles, and reduces the unstructured data in a short time. Cost-effective:
Hadoop MapReduce has a scale-out feature that enables users to process or store the data
in a cost-effective manner.
Scalability: It provides a highly scalable framework. MapReduce allows users to run applications
from many nodes.
Parallel Processing: Here multiple job-parts of the same dataset can be processed in a parallel
manner. This can reduce the task that can be taken to complete a task. Limitations Of
MapReduce
 MapReduce cannot cache the intermediate data in memory for a further
requirement which diminishes the performance of Hadoop.
 It is only suitable for Batch Processing of a Huge amounts of Data.

What is Apache Spark?


Apache Spark is a unified analytics engine for large-scale data processing with built-in modules
for SQL, streaming, machine learning, and graph processing. Spark can run on Apache Hadoop,
Apache Mesos, Kubernetes, on its own, in the cloud— and against diverse data sources.
Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has
seen a very high adoption rate from top-notch technology companies like Google, Facebook,
Apple, Netflix etc. The demand has been ever increasing day by day. According to
marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67%
between 2019 and 2022. The Spark market
revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market
valued at $9.2 billion (2019 - 2022).

As per Apache, “Apache Spark is a unified analytics engine for large-


scale data processing”.

Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more
capabilities, features, speed and provides APIs for developers in many languages like Scala,
Python,
Java and R. It is also friendly for database developers as it provides Spark SQL which
supports most of the ANSI SQL functionality. Spark also has out of the box support for
Machine learning and Graph processing using components called MLlib and GraphX
respectively. Spark also has support for streaming data using Spark Streaming.

Spark is developed in Scala programming language. Though the majority of use cases of
Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It
does work with a variety of other Data sources like Cassandra, MySQL, AWS S3
etc. Apache Spark also comes with its default resource manager which might be good enough
for the development environment and small size cluster, but it also integrates very well with
YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as
the resource manager.

Features of Spark

1. Speed: According to Apache, Spark can run applications on Hadoop cluster up to 100
times faster in memory and up to 10 times faster on disk. Spark is able to achieve such
a speed by overcoming the drawback of MapReduce which always writes to disk for
all intermediate results. Spark does not need to write
intermediate results to disk and can work in memory using DAG, lazy evaluation,
RDDs and caching. Spark has a highly optimized execution engine which makes it so
fast.
2. Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also
fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed
Datasets) in combination with DAG, which is built to handle failures of tasks or even
node failures.
3. Lazy Evaluation: Spark works on lazy evaluation technique. This means that the
processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e.
the output RDDs/datasets are not available after transformation will be available only
when needed i.e. when any action is performed. The transformations are just part of the
DAG which gets executed when action is called.
4. Multiple Language Support: Spark provides support for multiple programming
languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.
5. Reusability: Spark code once written for batch processing jobs can also be utilized for
writing processed on Stream processing and it can be used to join historical batch data
and stream data on the fly.
6. Machine Learning: MLlib is a Machine Learning library of Spark. which is available
out of the box for creating ML pipelines for data analysis and predictive analytics also
7. Graph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs
which is again provided out of the box one can write graph processing and do graph-
parallel computation.
8. Stream Processing and Structured Streaming: Spark can be used for batch
processing and also has the capability to cater to stream processing use case with
micro batches. Spark Streaming comes with Spark and one does not need to use any
other streaming tools or APIs. Spark streaming also supports Structure Streaming.
Spark streaming also has in-built connectors for
Apache Kafka which comes very handy while developing Streaming applications.
9. Spark SQL: Spark has an amazing SQL support and has an in- built SQL optimizer.
Spark SQL features are used heavily in warehouses to build ETL pipelines.
Spark is being used in more than 1000 organizations who have built huge clusters for batch
processing, stream processing, building warehouses, building data analytics engine and also
predictive analytics platforms using many of the above features of Spark. Let’s look at some
of the use cases in a few of these organizations.

What are the Different Apache Spark Applications?

Streaming Data:

Streaming is basically unstructured data produced by different types of data sources. The data
sources could be anything like log files generated while customers using mobile apps or web
applications, social media contents like tweets, facebook posts, telemetry from connected
devices or instrumentation in data centres. The streaming data is usually unbounded and is
being processed as received from the data source.

Then there is Structured streaming which works on the principle of polling data in intervals
and then this interval data is processed and appended or updated to the unbounded result
table.

Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro
batches and DStreams and Structured Streaming using Datasets and Data frames.

Let us try to understand Spark Streaming from an example.


Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on
its inventory and operations. Using this dashboard the management should be able to track how
many products are being purchased, shipped and delivered to customers.

Spark Streaming can be an ideal fit here.

The order management system pushes the order status to the queue(could be Kafka) from
where Streaming process reads every minute and picks all the orders with their status. Then
Spark engine processes these and emits the output status count. Spark streaming process runs
like a daemon until it is killed or error is encountered.

Machine Learning:

As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives
computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell
gave a definition which is more specifically from an engineering perspective, “A computer
program is said to learn from experience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P, improves with experience E.”. ML
solves complex problems that could not be solved with just mathematical numerical methods
or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such
thing. Its goal is to make a prediction or make guesses which are good enough to be useful.

MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple
algorithms for Supervised and Unsupervised ML which can scale out on a cluster for
classification, regression, clustering, collaborative filtering. MLlib interoperates with
Python’s math/numerical analysis library NumPy and also with R’s libraries.
Some of these algorithms are also applicable to streaming data. MLlib
helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.

A very common use case of ML is text classification, say for categorising emails. An ML
pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks
like this. ML is a subject in itself so it is not possible to deep dive here.

Fog Computing:

Fog Computing is another use case of Apache Spark. To understand Fog computing we need
to understand IoT first. IoT basically connects all our devices so that they can communicate
with each other and provide solutions to the users of those devices. This would mean huge
amounts of data and current cloud computing may not be sufficient to cater to so much data
transfer, data processing and online demand of customer’s request.

Fog computing can be ideal here as it takes the work of processing to the devices on the edge
of the network. This would need very low latency, parallel processing of ML and complex
graph analytical algorithms, all of which are readily available in Apache spark out of the box
and can be pick and choose as per the requirements of the processing. So it is expected that as
IoT gains momentum Apache spark will be the leader in Fog computing.

 Event Detection:Apache Spark is increasingly used in event detection like credit card
fraud detection, money laundering activities etc. Apache spark streaming along with
MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.
Credit card transactions of a cardholder can be captured over a period of time to
categorize user’s spending habits. Models can be
developed and trained to predict any anomaly in the card transaction and along with
Spark streaming and Kafka in real time.
 Interactive Analysis:Spark’s one of the most popular features is its ability to provide
users with interactive analytics. MapReduce does provide tools like Pig and Hive for
interactive analysis, but they are too slow in most of the cases. But Spark is very fast
and swift and that’s why it has gained so much ground in the interactive analysis.
Spark interfaces with programming languages like R, Python, SQL and Scala which
caters to a bigger set of developers and users for interactive analysis.Spark also came up
with Structured Streaming in version 2.0 which can be used for interactive analysis with
live data as well as join the live data with batch data output to get more insight into the
data. Structured streaming in future has the potential to boost Web Analytics by allowing
users to query user’s live web session. Even machine learning can be applied to live
session data for more insights.
 Data Warehousing: Data warehousing is another function where Apache Spark has is
getting tremendous traction. Due to an increasing volume of data day by day, the
tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs
as they are not able to scale horizontally. Spark along with Spark SQL is being used by
many companies to migrate to Big Data based Warehouse which can scale horizontally
as the load increases.
With Spark, even the processing can be scaled horizontally by adding machines to the
Spark engine cluster.These migrated applications embed the Spark engine and offer a
web UI to allow users to create, run, test and deploy jobs interactively. Jobs are
primarily written in native Spark SQL or other flavours of SQL. These Spark clusters
have been able to scale to process many terabytes of data every day and the clusters
can be hundreds to thousands of nodes.
Companies Using Apache Spark

Apache Spark at Alibaba:

Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping
platform generates Petabytes of data as it has millions of users every day doing searches,
shopping and placing orders. These user interactions are represented as complex graphs. The
processing of these data points is done using Spark’s Machine learning component MLlib and
then used to provide better user shopping experience by suggesting products based on choice,
trending products, reviews etc.

Apache Spark at MyFitnessPal:

MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million
active users. The portal helps its users follow and achieve a healthy lifestyle by following a
proper diet and fitness regime. The portal uses the data added by users about their food,
exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the
portal is able to scan through the huge amount of structured and unstructured data and pull out
best suggestions for its users.

Apache Spark at TripAdvisor:

TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is
one of the biggest names in the Travel and Tourism industry. It helps users plan their personal
and official trips around the world. It uses Apache Spark to process petabytes of data from user
interactions and destination details and gives recommendations on planning a perfect trip based
on users choice and preferences. They help users identify best airlines, best prices on hotels and
airlines, best places to eat, basically everything needed to plan any trip. It also ranks these
places, hotels, airlines, restaurants based on
user feedback and reviews. All this processing is done using Apache Spark

Apache Spark at Yahoo:

Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s
contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark
Machine learning capabilities to identify topics and news which users are interested in. This is
similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning
algo were developed in C/C++ with thousands of lines of code. While today with Spark and
Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a
big leap in turnover time as well as code understanding and maintenance. This has been made
possible due to Spark to a great extent.

You might also like