Unit-1 Distributed Databases

UNIT – 1
DISTRIBUTED DATABASES
Distributed Systems – Introduction – Architecture – Distributed
Database Concepts –DistributedData Storage – Distributed
Transactions – Commit Protocols – ConcurrencyControl –
DistributedQuery Processing
Distributed Systems – Introduction
What is meant by distributed systems?
A distributed system is a computing environment in which various components are
spread across multiple computers (or other computing devices) on a network. ...
Distributed systems reduce the risks involved with having a single point of failure, bolstering
reliability and fault tolerance.
A distributed system contains multiple nodes that are physically separate but linked together
using the network. All the nodes in this system communicate with each other and handle
processes in tandem. Each of these nodes contains a small part of the distributed operating
system software.
Types of Distributed Systems

The nodes in the distributed systems can be arranged in the form of client/server systems or
peer to peer systems. Details about these are as follows −
Client/Server Systems
In client server systems, the client requests a resource and the server provides that resource.
A server may serve multiple clients at the same time while a client is in contact with only one
server. Both the client and server usually communicate via a computer network and so they
are a part of distributed systems.
Peer to Peer Systems
The peer to peer systems contains nodes that are equal participants in data sharing. All the
tasks are equally divided between all the nodes. The nodes interact with each other as
required as share resources. This is done with the help of a network.
Advantages of Distributed Systems
Some advantages of Distributed Systems are as follows −
 All the nodes in the distributed system are connected to each other. So nodes can
easily share data with other nodes.
 More nodes can easily be added to the distributed system i.e. it can be scaled as
required.
 Failure of one node does not lead to the failure of the entire distributed system. Other
nodes can still communicate with each other.
 Resources like printers can be shared with multiple nodes rather than being restricted
to just one.
Disadvantages of Distributed Systems
Some disadvantages of Distributed Systems are as follows −
 It is difficult to provide adequate security in distributed systems because the nodes as
well as the connections need to be secured.
 Some messages and data can be lost in the network while moving from one node to
another.
 The database connected to the distributed systems is quite complicated and difficult to
handle as compared to a single user system.
 Overloading may occur in the network if all the nodes of the distributed system try to
send data at once.
Architecture
In distributed architecture, components are presented on different platforms and several
components can cooperate with one another over a communication network in order to
achieve a specific objective or goal.
 In this architecture, information processing is not confined to a single machine rather
it is distributed over several independent computers.
 A distributed system can be demonstrated by the client-server architecture which
forms the base for multi-tier architectures; alternatives are the broker architecture
such as CORBA, and the Service-Oriented Architecture (SOA).
 There are several technology frameworks to support distributed architectures,
including .NET, J2EE, CORBA, .NET Web services, AXIS Java Web services, and
Globus Grid services.
 Middleware is an infrastructure that appropriately supports the development and
execution of distributed applications. It provides a buffer between the applications
and the network.
 It sits in the middle of system and manages or supports the different components of a
distributed system. Examples are transaction processing monitors, data convertors
and communication controllers etc.
Middleware as an infrastructure for distributed system
The basis of a distributed architecture is its transparency, reliability, and availability.
The following table lists the different forms of transparency in a distributed system −
Sr.No. Transparency & Description
1 Access
Hides the way in which resources are accessed and the differences in data platform.
2 Location
Hides where resources are located.
3 Technology
Hides different technologies such as programming language and OS from user.
4 Migration / Relocation
Hide resources that may be moved to another location which are in use.
5 Replication
Hide resources that may be copied at several location.
6 Concurrency
Hide resources that may be shared with other users.
7 Failure
Hides failure and recovery of resources from user.
8 Persistence
Hides whether a resource ( software ) is in memory or disk.
Advantages
 Resource sharing − Sharing of hardware and software resources.
 Openness − Flexibility of using hardware and software of different vendors.
 Concurrency − Concurrent processing to enhance performance.
 Scalability − Increased throughput by adding new resources.
 Fault tolerance − The ability to continue in operation after a fault has occurred.
Disadvantages
 Complexity − They are more complex than centralized systems.
 Security − More susceptible to external attack.
 Manageability − More effort required for system management.
 Unpredictability − Unpredictable responses depending on the system organization
and network load.
Centralized System vs. Distributed System
Criteria Centralized system Distributed System
Economics Low High
Availability Low High
Complexity Low High
Consistency Simple High
Scalability Poor Good
Technology Homogeneous Heterogeneous
Security High Low

Client-Server Architecture
The client-server architecture is the most common distributed system architecture which
decomposes the system into two major subsystems or logical processes −
 Client − This is the first process that issues a request to the second process i.e. the
server.
 Server − This is the second process that receives the request, carries it out, and sends
a reply to the client.
In this architecture, the application is modelled as a set of services that are provided by
servers and a set of clients that use these services. The servers need not know about clients,
but the clients must know the identity of servers, and the mapping of processors to processes
is not necessarily 1 : 1
Client-server Architecture can be classified into two models based on the functionality of the
client −
Thin-client model
In thin-client model, all the application processing and data management is carried by the
server. The client is simply responsible for running the presentation software.
 Used when legacy systems are migrated to client server architectures in which legacy
system acts as a server in its own right with a graphical interface implemented on a
client
 A major disadvantage is that it places a heavy processing load on both the server and
the network.
Thick/Fat-client model
In thick-client model, the server is only in charge for data management. The software on the
client implements the application logic and the interactions with the system user.
 Most appropriate for new C/S systems where the capabilities of the client system are
known in advance
 More complex than a thin client model especially for management. New versions of
the application have to be installed on all clients.
Advantages
 Separation of responsibilities such as user interface presentation and business logic
processing.
 Reusability of server components and potential for concurrency
 Simplifies the design and the development of distributed applications
 It makes it easy to migrate or integrate existing applications into a distributed
environment.
 It also makes effective use of resources when a large number of clients are accessing
a high-performance server.
Disadvantages
 Lack of heterogeneous infrastructure to deal with the requirement changes.
 Security complications.
 Limited server availability and reliability.
 Limited testability and scalability.
 Fat clients with presentation and business logic together.
Multi-Tier Architecture (n-tier Architecture)
Multi-tier architecture is a client–server architecture in which the functions such as
presentation, application processing, and data management are physically separated. By
separating an application into tiers, developers obtain the option of changing or adding a
specific layer, instead of reworking the entire application. It provides a model by which
developers can create flexible and reusable applications.
The most general use of multi-tier architecture is the three-tier architecture. A three-tier
architecture is typically composed of a presentation tier, an application tier, and a data
storage tier and may execute on a separate processor.
Presentation Tier
Presentation layer is the topmost level of the application by which users can access directly
such as webpage or Operating System GUI (Graphical User interface). The primary function
of this layer is to translate the tasks and results to something that user can understand. It
communicates with other tiers so that it places the results to the browser/client tier and all
other tiers in the network.
Application Tier (Business Logic, Logic Tier, or Middle Tier)
Application tier coordinates the application, processes the commands, makes logical
decisions, evaluation, and performs calculations. It controls an application’s functionality by
performing detailed processing. It also moves and processes data between the two
surrounding layers.
Data Tier
In this layer, information is stored and retrieved from the database or file system. The
information is then passed back for processing and then back to the user. It includes the data
persistence mechanisms (database servers, file shares, etc.) and provides API (Application
Programming Interface) to the application tier which provides methods of managing the
stored data.
Advantages
 Better performance than a thin-client approach and is simpler to manage than a thick-
client approach.
 Enhances the reusability and scalability − as demands increase, extra servers can be
added.
 Provides multi-threading support and also reduces network traffic.
 Provides maintainability and flexibility
Disadvantages
 Unsatisfactory Testability due to lack of testing tools.
 More critical server reliability and availability.

Distributed DBMS - Concepts
For proper functioning of any organization, there’s a need for a well-maintained database. In
the recent past, databases used to be centralized in nature. However, with the increase in
globalization, organizations tend to be diversified across the globe. They may choose to
distribute data over local servers instead of a central database. Thus, arrived the concept
of Distributed Databases.
This chapter gives an overview of databases and Database Management Systems (DBMS).
A database is an ordered collection of related data. A DBMS is a software package to work
upon a database. A detailed study of DBMS is available in our tutorial named “Learn
DBMS”. In this chapter, we revise the main concepts so that the study of DDBMS can be
done with ease. The three topics covered are database schemas, types of databases and
operations on databases.
Database and Database Management System
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the
characteristic features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name,
Company_Id, Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
 Automatic Teller Machines
 Train Reservation System
 Employee Management System
 Student Information System
Examples of DBMS Packages
 MySQL
 Oracle
 SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.
Database Schemas
A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the
physical database. The three levels are −
 Internal Level having Internal Schema − It describes the physical structure, details
of internal storage and access paths for the database.
 Conceptual Level having Conceptual Schema − It describes the structure of the
whole database while hiding the details of physical storage of data. This illustrates
the entities, attributes with their data types and constraints, user operations and
relationships.
 External or View Level having External Schemas or Views − It describes the
portion of a database relevant to a particular user or a group of users while hiding the
rest of database.
Types of DBMS
There are four types of DBMS.
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that
one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and
simple.
Network DBMS
Network DBMS in one where the relationships among data in the database are of type
many-to-many in the form of a network. The structure is generally complicated due to the
existence of numerous many-to-many relationships. Network DBMS is modelled using
“graph” data structure.
Relational DBMS
In relational databases, the database is represented in the form of relations. Each relation
models an entity and is represented as a table of values. In the relation or table, a row is
called a tuple and denotes a single record. A column is called a field or an attribute and
denotes a characteristic property of the entity. RDBMS is the most popular database
management system.
For example − A Student Relation −
Object Oriented DBMS

Object-oriented DBMS is derived from the model of the object-oriented programming
paradigm. They are helpful in representing both consistent data as stored in databases, as
well as transient data, as found in executing programs. They use small, reusable elements
called objects. Each object contains a data part and a set of operations which works upon the
data. The object and its attributes are accessed through pointers instead of being stored in
relational table models.
For example − A simplified Bank Account object-oriented database −
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases
transparent to the users. In these systems, data is intentionally distributed among multiple
nodes so that all computing resources of the organization can be optimally used.
Operations on DBMS
The four basic operations on a database are Create, Retrieve, Update and Delete.
 CREATE database structure and populate it with data − Creation of a database
relation involves specifying the data structures, data types and the constraints of the
data to be stored.
Example − SQL command to create a student table −
CREATE TABLE STUDENT (
ROLL INTEGER PRIMARY KEY,
NAME VARCHAR2(25),
YEAR INTEGER,
STREAM VARCHAR2(10)
);
 Once the data format is defined, the actual data is stored in accordance with the
format in some storage medium.
Example SQL command to insert a single tuple into the student table −
INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM)
VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');
 RETRIEVE information from the database – Retrieving information generally
involves selecting a subset of a table or displaying data from the table after some
computations have been done. It is done by querying upon the table.
Example − To retrieve the names of all students of the Computer Science stream, the
following SQL query needs to be executed −
SELECT NAME FROM STUDENT
WHERE STREAM = 'COMPUTER SCIENCE';
 UPDATE information stored and modify database structure – Updating a table
involves changing old values in the existing table’s rows with new values.
Example − SQL command to change stream from Electronics to Electronics and
Communications −
UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS'
WHERE STREAM = 'ELECTRONICS';
 Modifying database means to change the structure of the table. However,
modification of the table is subject to a number of restrictions.
Example − To add a new field or column, say address to the Student table, we use
the following SQL command −
ALTER TABLE STUDENT
ADD ( ADDRESS VARCHAR2(50) );
 DELETE information stored or delete a table as a whole – Deletion of specific
information involves removal of selected rows from the table that satisfies certain
conditions.
Example − To delete all students who are in 4 th year currently when they are passing
out, we use the SQL command −
DELETE FROM STUDENT
WHERE YEAR = 4;
 Alternatively, the whole table may be removed from the database.
Example − To remove the student table completely, the SQL command used is −
DROP TABLE STUDENT;
DISTRIBUTED DATA STORAGE
In this article, we’ll learn what distributed data storage is, why we need it, and how to use it
effectively. This article is intended to help you develop applications, and so we will only cover
what application developers need to know. This includes the essential foundations, the
common pitfalls that developers run into, and the differences between different distributed
data stores.
This article does not require any distributed systems knowledge! Programming and database
experience will help, but you can also just look up topics as we come to them. Let’s start!
What is a Distributed Data Store?
A distributed data store is a system that stores and processes data on multiple machines.
As a developer, you can think of a distributed data store as how you store and retrieve
application data, metrics, logs, etc. Some popular distributed data stores you might be familiar
with are MongoDB, Amazon Web Service’s S3, and Google Cloud Platform’s Spanner.
In practice, there are many kinds of distributed data stores. They commonly come as services
managed by cloud providers or products that you deploy yourself. You can also build your
own, either from scratch or on top of other data stores.
Why do we need it?
Why not just use single-machine data stores? To really understand, we first need to realize the
scale and ubiquity of data today. Let’s see some concrete numbers:
 Steam had a peak of 18.5 million concurrent users, deployed servers with 2.7 petabytes of
SSD, and delivered 15 exabytes to users in 2018¹.
 Nasdaq in 2020 ingested a peak of 113 billion records in a single day, scaling up from an
average of 30 billion just two years earlier².
 Kellogg’s, the cereal company, processed 16 terabytes per week from just simulated
promotional activities in 2014³.
It’s honestly incredible how much data we use. Each of those bits is carefully stored and
processed somewhere. That somewhere is our distributed data stores.
Single-machine data stores simply cannot support these demands. So instead, we use
distributed data stores which offer key advantages in performance, scalability, and reliability.
Let’s understand what these advantages really mean in practice.

Performance, Scalability, and Reliability
Performance is how well a machine can do work.
Performance is critical. There are countless studies that quantify and show the business
impacts of delays as short as 100ms⁴. Slow response times don’t just frustrate people — they
cost traffic, sales, and ultimately revenue⁵.
Fortunately, we do have control of our application’s performance. In the case of single-
machine data stores, simply upgrading to a faster machine is oftentimes enough. If it isn’t
enough or you rely on a distributed data store, then other forms of scalability come into play.
Scalability is the ability to increase or decrease infrastructure resources.
Applications today often experience rapid growth and cyclical usage patterns. To meet these
load requirements, we “scale” our distributed data stores. This means that we provision more
or less resources on demand as needed. Scalability comes in two forms.
 Horizontal scaling means to add or remove computers (also known as machines or
nodes).
 Vertical scaling means to change the machine’s CPU, RAM, storage capacity, or other
hardware.
Horizontal scaling is why distributed data stores can out-perform single-machine data stores.
By spreading work over hundreds of computers, the aggregate system has higher performance
and reliability. While distributed data stores rely primarily on horizontal scaling, vertical
scaling is used in conjunction to optimize the overall performance and cost⁶.
Scaling exists on a spectrum from manual to fully-managed. Some products have manual
scaling where you provision extra capacity yourself. Others autoscale based on metrics like
remaining storage capacity. Lastly, some services handle all scaling without the developer
even thinking about it, such as Amazon Web Service’s S3.
Regardless of the approach, all services have some limits that cannot be increased, such as a
maximum object size. You can check the quotas in the documentation to see these hard limits.
You can check online benchmarks to see what performance is achievable in practice.
Reliability is the probability of being failure-free⁷.
Some applications are so critical to our lives that even seconds of failure are unacceptable.
These applications cannot use single-machine data stores because of the unavoidable hardware
and network failures that could compromise the entire service. Instead, we use distributed data
stores because they can accommodate for individual computers or network paths failing.
To be highly reliable, a system must be both available⁸ and fault-tolerant⁹.
 Availability is the percent of time that a service is reachable and responding to requests
normally.
 Fault-tolerance is the ability to tolerate hardware and software faults. Total fault
tolerance is impossible¹⁰.
Although availability and fault-tolerance may appear similar at first, they are actually quite
different. Let’s see what happens if you have one but not the other.
 Available but not fault-tolerant: Consider a system that fails every minute but recovers
within milliseconds. Users can access the service, yet long-running jobs never have
enough time to finish.
 Fault-tolerant but not available: Consider a system where half the nodes are
perpetually restarting and the others are stable. If the capacity of the stable nodes is
insufficient, then some requests will have to be rejected.

Takeaway
For an application developer, the key point is that distributed data stores can scale
performance and reliability far beyond single machines. The catch is that they have caveats in
how they work that can limit their potential.
How does it work?
Let’s cover what application developers need to know about how distributed data stores work
— partitioning, query routing, and replication. These basics will give you insight into the
behavior and characteristics of distributed data stores. It’ll help you understand the caveats,
tradeoffs, and why we don’t have a distributed data store that excels at everything.
Partitioning
Our data sets are often too large to be stored on a single machine. To overcome this,
we partition our data into smaller subsets that individual machines can store and process.
There are many ways to partition data, each with their own tradeoffs. The two main
approaches are vertical and horizontal partitioning.

Image by Author
Vertical partitioning means to split up data by related fields¹¹. Fields can be related for many
reasons. They might be properties of some common object. They might be fields that are
commonly accessed together by queries. They might even be fields that are accessed at similar
frequencies or by users with similar permissions. The exact way you vertically partition data
across machines ultimately depends on the properties of your data store and the usage patterns
you are optimizing for.
Horizontal partitioning (also known as sharding) is when we split up data into subsets all
with the same schema¹¹. For example, we can horizontally partition a relational database table
by grouping rows into shards to be stored on separate machines. We shard data when a single
machine cannot handle either the amount of data or the query load for that data. Sharding
strategies fall into two categories, Algorithmic and Dynamic, but hybrids exist¹⁰.
Image by Author
Algorithmic sharding determines which shard to allocate data to based on a function of the
data’s key. For example, when storing key-value data mapping URLs to HTML, we can range
partition our data by splitting up key-values according to the first letter of the URL. For
instance, all URLs starting with “A” would go on the first machine, “B” on the second
machine, and so on. There are innumerable strategies all with different tradeoffs.
Dynamic sharding explicitly chooses the location of data and stores that location in a lookup
table. To access data, we consult the service with the lookup table or check a local cache.
Lookup tables can be quite large, and thus they may have lookup tables pointing to sub-lookup
tables, like a B+-Tree¹². Dynamic sharding is more flexible than algorithmic sharding¹³.
Partitioning, in practice, is quite tricky and can create many problems that you need to be
aware of. Fortunately, some distributed data stores will handle all this complexity for you.
Others handle some or none.
 Shards may have uneven data sizes. This is common in algorithmic sharding where the
function is difficult to get right. We mitigate this by tailoring the sharding strategy around
the data.
 Shards may have hotspots where certain data are queried magnitudes more frequently
than others. For example, consider how much more frequently you’ll query for celebrities
than ordinary people in a social network. Careful schema design, caches, and replicas can
help here.
 Redistributing data to handle adding or removing nodes from the system is difficult when
maintaining high-availability.
 Indexes may need to be partitioned as well. Indexes may index the shard it is stored on
(local index), or it may index the entire data set and be partitioned (global index). Each
comes with tradeoffs.
 Transactions across partitions may work, or they may be disabled, slow, or inconsistent
in confusing ways. This is especially difficult when building your own distributed data
store from single-machine data stores.

Query Routing
Partitioning the data is only part of the story. We still need to route queries from the client to
the correct backend machine. Query routing can happen at different levels of the software
stack. Let’s see the three basic cases.
 Client-side partitioning is when the client holds the decision logic for which backend
node to query. The advantage is the conceptual simplicity, and the disadvantage is that
each client must implement query routing logic.
 Proxy-based partitioning is when the client sends all queries to a proxy. This proxy then
determines which backend node to query. This can help reduce the number of concurrent
connections on your backend servers and separate application logic from routing logic.
 Server-based partitioning is when the client connects to any backend node, and the node
will either handle, redirect, or forward the request.
In practice, query routing is handled by most distributed data stores. Typically, you configure
a client, and then query using the client. However, if you are building your own distributed
data store or using products like Redis that don’t handle it, you’ll need to take this into
consideration¹⁴.
Replication
The last concept we’ll cover is replication. Replication means to store multiple copies of the
same data. This has many benefits.
 Data redundancy: When hardware inevitably fails, the data is not lost because there is
another copy.
 Data accessibility: Clients can access the data from any replica. This increases resiliency
against data center outages and network partitions.
 Increased read throughput: There are more machines that can serve the data, and so the
overall capacity is higher.
 Decreased network latency: Clients can access the replica closest to them, decreasing
network latency.
Implementing replication requires mind-bending consensus protocols and exhaustive analysis
of failure scenarios. Fortunately, application developers typically only need to know where
and when data is replicated.
Where data is replicated ranges from within a data center to across zones, regions, or even
continents. By replicating data close together, we minimize the network latency when
updating data between machines. However, by replicating data further apart, we protect
against data center failures, network partitions, and potentially decrease network latency for
reads.
When data is replicated can be synchronous or asynchronous.
 Synchronous replication means data is copied to all replicas before responding to the
request. This has the advantage of ensuring identical data across replicas at the cost of
higher write latency.
 Asynchronous replication means data is stored on only one replica before responding to
the request. This has the advantage of faster writes with the disadvantages of weaker data
consistency and possible data loss.

Takeaway
Partitioning, query routing, and replication are the building blocks of a distributed data store.
The different implementations emerge as different features and properties that you make
tradeoffs between.
What are the differences?
Distributed data stores are all special snowflakes each with their unique set of features. We
will compare them by grouping their differences into categories and covering the basics of
each. This will help you know what questions to ask and what to read further into in the future.
Data Model
The first difference to consider is the data model. The data model is the type of data and how
you query it. Common types include
 Document: Nested collections of JSON documents. Query with keys or filters.
 Key-value: Key-value pairs. Query with a key.
 Relational: Tables of rows with an explicit schema. Query with SQL.

 Binary object: Arbitrary binary blobs. Query with a key.
 File system: Directories of files. Query with file path.
 Graph: Nodes with edges. Query with a graph query language.
 Message: Groups of key-value pairs, like a JSON or python dict. Query from a queue,
topic, or sender.
 Time-series: Data ordered by timestamp. Query with SQL or other query language.
 Text: Free-form text or logs. Query with a query language.
Different data models are meant for different situations. While you could just store everything
as a binary object, this would be inconvenient when querying for data and developing your
application. Instead, use the data model that best fits your type of queries. For example, if you
need fast, simple lookups for small bits of data, use key-values. We’ll provide more detail on
intended usages in a chart below.
Note that some data stores are multi-model, meaning they can efficiently operate on multiple
data models.
Guarantees
Different data stores provide different “guarantees” on behavior. While you don’t technically
need guarantees to develop robust applications, strong guarantees dramatically simplify design
and implementation. The common guarantees you’ll come across are the following:
 Consistency is whether the data looks the same to all readers and is up-to-date. Note that
the term “consistency” is ironically severely overloaded — be sure what type of
consistency is being referred to¹⁵.
 Availability is whether you can access your data.
 Durability is whether stored data remains safe and uncorrupted.
Some service providers will even contractually guarantee a level of service, such as 99.99%
availability, through a service-level agreement (SLA). In the event they fail to uphold the
agreement, you typically receive some compensation.

Ecosystem
The ecosystem (integrations, tools, supporting software, etc.) is critical to your success with a
distributed data store. Simple questions like what SDKs are available and what types of testing
are supported need to be checked. If you need features like database connectors, mobile
synchronization, ORMs, protocol buffers, geospatial libraries, etc., you need to confirm that
they are supported. Documentation and blogs will have this information.
Security
Security responsibilities are shared between you and your product/service provider. Your
responsibilities will correlate with how much of the stack you manage.
If you use a distributed data store as a service, you may only need to configure some identity
and access policies, auditing, and application security. However if you build and deploy it all,
you will need to handle everything including infrastructure security, network security,
encryption at rest/in-transit, key management, patching, etc. Check the “shared responsibility
model” for your data store to figure this out.

Compliance
Compliance can be a critical differentiator. Many applications need to comply with laws and
regulations regarding how data is handled. If you need to comply with security policies such
as FEDRAMP, PCI-DSS, HIPAA, or any others, your distributed data store needs to as well.
Further, if you have made promises to customers regarding data retention, data residency, or
data isolation, you may want a distributed data store that comes with built-in features for this.
Price
Different data stores are priced differently. Some data stores charge solely based on storage
volume, while others account for servers and license fees. The documentation will typically
have a pricing calculator that you can use to estimate the bill. Note that while some data stores
may appear to cost more at first, they may well make up for it in engineering and operational
time-savings.
Takeaway
Distributed data stores are all unique. We can understand and compare their different features
through documentation, blogs, benchmarks, pricing calculators, or by talking to professional
support and building prototypes.
What are the options?

We now know a ton about distributed data stores in the abstract. Let’s tie it together and see
the real tools!
There are seemingly-infinite options, and unfortunately there is no best one. Each distributed
data store is meant for a different purpose and needs to fit your particular use case. To
understand the different types, check out the following table. Take your time and focus on the
general types and use cases.
What are the options?
We now know a ton about distributed data stores in the abstract. Let’s tie it together and see
the real tools!
There are seemingly-infinite options, and unfortunately there is no best one. Each distributed
data store is meant for a different purpose and needs to fit your particular use case. To
understand the different types, check out the following table. Take your time and focus on the
general types and use cases.
Finally, note that real applications and companies have a wide variety of jobs to be done, and
so they rely on multiple distributed data stores. These systems work together to serve end
users as well as developers and analysts.
Closing
Data is here to stay, and distributed data stores are what enable that.
In this article, we learned that the performance and reliability of distributed data stores can
scale far beyond single-machine data stores. Distributed data stores rely on architectures with
many machines to partition and replicate data. Application developers don’t need to know all
the specifics — they only need to know enough to understand the problems that come up in
practice, like data hotspots, transaction support, the price of data replication, etc.
Like everything else, distributed data stores have a huge variety of features. At first, it can be
difficult to conceptualize it, but hopefully our breakdown helps orient your thought process
and guide your future learning.
Hopefully now you know what the big picture is and what to look more into. Consider
checking out the references. Happy to hear any comments you have! :)
DISTRIBUTED TRANSACTIONS
What is a Distributed Transaction?

A distributed transaction is a set of operations on data that is performed across two or more
data repositories (especially databases). It is typically coordinated across separate nodes
connected by a network, but may also span multiple databases on a single server.
There are two possible outcomes: 1) all operations successfully complete, or 2) none of the
operations are performed at all due to a failure somewhere in the system. In the latter case, if
some work was completed prior to the failure, that work will be reversed to ensure no net
work was done. This type of operation is in compliance with the “ACID” (atomicity-
consistency-isolation-durability) principles of databases that ensure data integrity. ACID is
most commonly associated with transactions on a single database server, but distributed
transactions extend that guarantee across multiple databases.
The operation known as a “two-phase commit” (2PC) is a form of a distributed transaction.
“XA transactions” are transactions using the XA protocol, which is one implementation of a
two-phase commit operation.
A distributed transaction spans multiple databases and guarantees data integrity.

How Do Distributed Transactions Work?
Distributed transactions have the same processing completion requirements as regular
database transactions, but they must be managed across multiple resources, making them
more challenging to implement for database developers. The multiple resources add more
points of failure, such as the separate software systems that run the resources (e.g., the
database software), the extra hardware servers, and network failures. This makes distributed
transactions susceptible to failures, which is why safeguards must be put in place to retain
data integrity.
For a distributed transaction to occur, transaction managers coordinate the resources (either
multiple databases or multiple nodes of a single database). The transaction manager can be
one of the data repositories that will be updated as part of the transaction, or it can be a
completely independent separate resource that is only responsible for coordination. The
transaction manager decides whether to commit a successful transaction or rollback an
unsuccessful transaction, the latter of which leaves the database unchanged.
First, an application requests the distributed transaction to the transaction manager. The
transaction manager then branches to each resource, which will have its own “resource
manager” to help it participate in distributed transactions. Distributed transactions are often
done in two phases to safeguard against partial updates that might occur when a failure is
encountered. The first phase involves acknowledging an intent to commit, or a “prepare-to-
commit” phase. After all resources acknowledge, they are then asked to run a final commit,
and then the transaction is completed.
We can examine a basic example of what happens when a failure occurs during a distributed
transaction. Let’s say one or more of the resources become unavailable during the prepare-to-
commit phase. When the request times out, the transaction manager tells each resource to
delete the prepare-to-commit status, and all data will be reset to its original state. If instead,
any of the resources become unavailable during the commit phase, then the transaction
manager will tell the other resources that successfully committed their portion of the
transaction to undo or “rollback” that transaction, and once again, the data is back to its
original state. It is then up to the application to retry the transaction to make sure it gets
completed.
Why Do You Need Distributed Transactions?
Distributed transactions are necessary when you need to quickly update related data that is
spread across multiple databases. For example, if you have multiple systems that track
customer information and you need to make a universal update (like updating the mailing
address) across all records, a distributed transaction will ensure that all records get updated.
And if a failure occurs, the data is reset to its original state, and it is up to the originating
application to resubmit the transaction.
Distributed Transactions for Streaming Data
Distributed transactions are especially critical today in data streaming environments because
of the volume of incoming data. Even a short-term failure in one of the resources can
represent a potentially large amount of lost data. Sophisticated stream processing engines
support “exactly-once” processing in which a distributed transaction covers the reading of
data from a data source, the processing, and the writing of data to a target destination (the
“data sink”). The “exactly-once” term refers to the fact that every data point is processed, and
there is no loss and no duplication. (Contrast this to “at-most-once” which allows data loss,
and “at-least-once” which allows duplication.) In an exactly-once streaming architecture, the
repositories for the data source and the data sink must have capabilities to support the
exactly-once guarantee. In other words, there must be functionality in those repositories that
lets the stream processing engine fully recover from failure, which does not necessarily
have to be a true transaction manager, but delivers a similar end result.
Hazelcast Jet is an example of a stream processing engine that will enable exactly-once
processing with sources and sinks that do not inherently have the capability to support
distributed transactions. This is done by managing the entire state of each data point and to
re-read, re-process, and/or re-write it if the data point in question encounters a failure. This
built-in logic allows more types of data repositories (such as Apache Kafka and JMS) to be
used as either sources or sinks in business-critical streaming applications.
When Distributed Transactions Are Not Needed
In some environments, distributed transactions are not necessary, and instead, extra auditing
activities are put in place to ensure data integrity when the speed of the transaction is not an
issue. The transfer of money across banks is a good example. Each bank that participates in
the money transfer tracks the status of the transaction, and when a failure is detected, the
partial state is corrected. This process works well without distributed transactions because the
transfer does not have to happen in (near) real time. Distributed transactions are typically
critical in situations where the complete update must be done immediately.
Commit Protocols
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system,
the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction state
and waits for all other transactions to reach their partially committed states. When it receives
the message that all the sites are ready to commit, it starts to commit. In a distributed
system, either all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The
steps in distributed commit are −
 After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
 The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
 When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
 On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site. When the controlling site has received “DONE” message from
all slaves, it sends a “Prepare” message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the
slaves, it considers the transaction as committed.
 After the controlling site has received the first “Not Ready” message from any slave
−
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves,
it considers the transaction as aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
 The controlling site issues an “Enter Prepared State” broadcast message.
 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.
CONCURRENCY CONTROL
What is Concurrency control?
 Concurrency control manages the transactions simultaneously without letting them interfere
with each another.
 The main objective of concurrency control is to allow many users perform different
operations at the same time.
 Using more than one transaction concurrently improves the performance of system.
 If we are not able to perform the operations concurrently, then there can be serious problems
such as loss of data integrity and consistency.
 Concurrency control increases the throughput because of handling multiple transactions
simultaneously.
 It reduces waiting time of transaction.
Example:
Consider two transactions T1 and T2.

T1 : Deposits Rs. 1000 to both accounts X and Y.
T2 : Doubles the balance of accounts X and Y.
T1
Read (X)
X ← X + 1000
Write (X)
Read (Y)
Y ← Y + 1000
Write (Y)
T2
Read (X)
X←X*2
Write (X)
Read (Y)
Y←Y*2
Write (Y)
The above two transactions can be executed concurrently as below
Schedule C1
The above concurrent schedule executes in the following manner:
Step 1: Part of transaction (T1) is executed, which updates X to Rs. 2000.
Step 2: The processor switches to transaction (T2). T2 is executed and updates X to Rs.
4000. Then the processor switches to transaction (T1) and remaining part of T2 which
updates Y to Rs. 2000 is executed.
Step 3: At the end of remaining part of T2 which reads Y as Rs. 2000, updates it to Rs. 4000
by multiplying value of Y.
This concurrent schedule maintains the consistency of database as,

X + Y = 2000 + 2000 = 4000, remains unchanged.
Therefore, the above schedule can be converted to equivalent serial schedule and hence
it is a consistent schedule.
Concurrency control can be divided into two protocols
1. Lock-Based Protocol
2. Timestamp Based Protocol
1. Lock-Based Protocol
 Lock is a mechanism which is important in a concurrent control.
 It controls concurrent access to a data item.
 It assures that one process should not retrieve or update a record which another process is
updating.
For example, in traffic, there are signals which indicate stop and go. When one signal is
allowed to pass at a time, then other signals are locked. Similarly, in database transaction
only one transaction is performed at a time and other transactions are locked.
 If the locking is not done properly, then it will display the inconsistent and corrupt data.
 It manages the order between the conflicting pairs among transactions at the time of
execution.
There are two lock modes,
1. Shared Lock
2. Exclusive Lock
Shared Locks are represented by S. The data items can only read without performing
modification to it from the database. S – lock is requested using lock – s instruction.
Exclusive Locks are represented by X. The data items can be read as well as written. X –
lock is requested using lock – X instruction.
Lock Compatibility Matrix

 Lock Compatibility Matrix controls whether multiple transactions can acquire locks on the
same resource at the same time.
Shared Exclusive
Shared True False
Exclusive False False
 If a resource is already locked by another transaction, then a new lock request can be granted
only if the mode of the requested lock is compatible with the mode of the existing lock.
 Any number of transactions can hold shared locks on an item, but if any transaction holds an
exclusive lock on item, no other transaction may hold any lock on the item.
2. Timestamp Based Protocol
 Timestamp Based Protocol helps DBMS to identify the transactions.
 It is a unique identifier. Each transaction is issued a timestamp when it enters into the
system.
 Timestamp protocol determines the serializability order.
 It is most commonly used concurrency protocol.
 It uses either system time or logical counter as a timestamp.
 It starts working as soon as a transaction is created.
Timestamp Ordering Protocol
 The TO Protocol ensures serializability among transactions in their conflicting read and
write operations.
 The transaction of timestamp (T) is denoted as TS(T).
 Data item (X) of read timestamp is denoted by R–timestamp(X).
 Data item (X) of write timestamp is denoted by W–timestamp(X).
CONCURRENCYCONTROL
What is Concurrency Control?

Concurrency Control in Database Management System is a procedure of managing
simultaneous operations without conflicting with each other. It ensures that Database
transactions are performed concurrently and accurately to produce correct results without
violating data integrity of the respective Database.
Concurrent access is quite easy if all users are just reading data. There is no way they can
interfere with one another. Though for any practical Database, it would have a mix of READ
and WRITE operations and hence the concurrency is a challenge.
DBMS Concurrency Control is used to address such conflicts, which mostly occur with a
multi-user system. Therefore, Concurrency Control is the most important element for proper
functioning of a Database Management System where two or more database transactions are
executed simultaneously, which require access to the same data.
Why use Concurrency method?

Reasons for using Concurrency control method is DBMS:
 To apply Isolation through mutual exclusion between conflicting transactions
 To resolve read-write and write-write conflict issues
 To preserve database consistency through constantly preserving execution
obstructions
 The system needs to control the interaction among the concurrent transactions. This
control is achieved using concurrent-control schemes.
 Concurrency control helps to ensure serializability
Example
Assume that two people who go to electronic kiosks at the same time to buy a movie ticket
for the same movie and the same show time.
However, there is only one seat left in for the movie show in that particular theatre. Without
concurrency control in DBMS, it is possible that both moviegoers will end up purchasing a
ticket. However, concurrency control method does not allow this to happen. Both moviegoers
can still access information written in the movie seating database. But concurrency control
only provides a ticket to the buyer who has completed the transaction process first.
Concurrency Control Protocols

Different concurrency control protocols offer different benefits between the amount of
concurrency they allow and the amount of overhead that they impose. Following are the
Concurrency Control techniques in DBMS:
 Lock-Based Protocols
 Two Phase Locking Protocol
 Timestamp-Based Protocols
 Validation-Based Protocols
Lock-based Protocols
Lock Based Protocols in DBMS is a mechanism in which a transaction cannot Read or
Write the data until it acquires an appropriate lock. Lock based protocols help to eliminate the
concurrency problem in DBMS for simultaneous transactions by locking or isolating a
particular transaction to a single user.
A lock is a data variable which is associated with a data item. This lock signifies that
operations that can be performed on the data item. Locks in DBMS help synchronize access
to the database items by concurrent transactions.
All lock requests are made to the concurrency-control manager. Transactions proceed only
once the lock request is granted.
Binary Locks: A Binary lock on a data item can either locked or unlocked states.
Shared/exclusive: This type of locking mechanism separates the locks in DBMS based on
their uses. If a lock is acquired on a data item to perform a write operation, it is called an
exclusive lock.
1. Shared Lock (S):
A shared lock is also called a Read-only lock. With the shared lock, the data item can be
shared between transactions. This is because you will never have permission to update data
on the data item.
For example, consider a case where two transactions are reading the account balance of a
person. The database will let them read by placing a shared lock. However, if another
transaction wants to update that account’s balance, shared lock prevent it until the reading
process is over.
2. Exclusive Lock (X):
With the Exclusive Lock, a data item can be read as well as written. This is exclusive and
can’t be held concurrently on the same data item. X-lock is requested using lock-x
instruction. Transactions may unlock the data item after finishing the ‘write’ operation.
For example, when a transaction needs to update the account balance of a person. You can
allows this transaction by placing X lock on it. Therefore, when the second transaction wants
to read or write, exclusive lock prevent this operation.
3. Simplistic Lock Protocol
This type of lock-based protocols allows transactions to obtain a lock on every object before
beginning operation. Transactions may unlock the data item after finishing the ‘write’
operation.
4. Pre-claiming Locking
Pre-claiming lock protocol helps to evaluate operations and create a list of required data items
which are needed to initiate an execution process. In the situation when all locks are granted,
the transaction executes. After that, all locks release when all of its operations are over.
Starvation
Starvation is the situation when a transaction needs to wait for an indefinite period to acquire
a lock.
Following are the reasons for Starvation:
 When waiting scheme for locked items is not properly managed
 In the case of resource leak
 The same transaction is selected as a victim repeatedly
Deadlock
Deadlock refers to a specific situation where two or more processes are waiting for each other
to release a resource or more than two processes are waiting for the resource in a circular
chain.
Two Phase Locking Protocol

Two Phase Locking Protocol also known as 2PL protocol is a method of concurrency
control in DBMS that ensures serializability by applying a lock to the transaction data which
blocks other transactions to access the same data simultaneously. Two Phase Locking
protocol helps to eliminate the concurrency problem in DBMS.
This locking protocol divides the execution phase of a transaction into three different parts.
 In the first phase, when the transaction begins to execute, it requires permission for
the locks it needs.
 The second part is where the transaction obtains all the locks. When a transaction
releases its first lock, the third phase starts.
 In this third phase, the transaction cannot demand any new locks. Instead, it only
releases the acquired locks.
The Two-Phase Locking protocol allows each transaction to make a lock or unlock request in
two steps:
 Growing Phase: In this phase transaction may obtain locks but may not release any
locks.
 Shrinking Phase: In this phase, a transaction may release locks but not obtain any
new lock
It is true that the 2PL protocol offers serializability. However, it does not ensure that
deadlocks do not happen.
In the above-given diagram, you can see that local and global deadlock detectors are
searching for deadlocks and solve them with resuming transactions to their initial states.
Strict Two-Phase Locking Method
Strict-Two phase locking system is almost similar to 2PL. The only difference is that Strict-
2PL never releases a lock after using it. It holds all the locks until the commit point and
releases all the locks at one go when the process is over.
Centralized 2PL
In Centralized 2 PL, a single site is responsible for lock management process. It has only one
lock manager for the entire DBMS.
Primary copy 2PL
Primary copy 2PL mechanism, many lock managers are distributed to different sites. After
that, a particular lock manager is responsible for managing the lock for a set of data items.
When the primary copy has been updated, the change is propagated to the slaves.
Distributed 2PL
In this kind of two-phase locking mechanism, Lock managers are distributed to all sites. They
are responsible for managing locks for data at that site. If no data is replicated, it is equivalent
to primary copy 2PL. Communication costs of Distributed 2PL are quite higher than primary
copy 2PL
Timestamp-based Protocols
Timestamp based Protocol in DBMS is an algorithm which uses the System Time or
Logical Counter as a timestamp to serialize the execution of concurrent transactions. The
Timestamp-based protocol ensures that every conflicting read and write operations are
executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time to determine
the time stamp of the transaction. This is the most commonly used concurrency protocol.
Lock-based protocols help you to manage the order between the conflicting transactions
when they will execute. Timestamp-based protocols manage conflicts as soon as an operation
is created.
Example:
Suppose there are there transactions T1, T2, and T3.
T1 has entered the system at time 0010
T2 has entered the system at 0020
T3 has entered the system at 0030
Priority will be given to transaction T1, then transaction T2 and lastly Transaction T3.
Advantages:
 Schedules are serializable just like 2PL protocols
 No waiting for the transaction, which eliminates the possibility of deadlocks!
Disadvantages:
Starvation is possible if the same transaction is restarted and continually aborted
Validation Based Protocol

Validation based Protocol in DBMS also known as Optimistic Concurrency Control
Technique is a method to avoid concurrency in transactions. In this protocol, the local copies
of the transaction data are updated rather than the data itself, which results in less interference
while execution of the transaction.
The Validation based Protocol is performed in the following three phases:
1. Read Phase
2. Validation Phase
3. Write Phase
Read Phase
In the Read Phase, the data values from the database can be read by a transaction but the
write operation or updates are only applied to the local data copies, not the actual database.
Validation Phase
In Validation Phase, the data is checked to ensure that there is no violation of serializability
while applying the transaction updates to the database.
Write Phase
In the Write Phase, the updates are applied to the database if the validation is successful, else;
the updates are not applied, and the transaction is rolled back.
Characteristics of Good Concurrency Protocol

An ideal concurrency control DBMS mechanism has the following objectives:
 Must be resilient to site and communication failures.
 It allows the parallel execution of transactions to achieve maximum concurrency.
 Its storage mechanisms and computational methods should be modest to minimize
overhead.
 It must enforce some constraints on the structure of atomic actions of transactions.
DISTRIBUTEDQUERY PROCESSING
Query Processing in DBMS
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:
1. Parsing and translation
2. Optimization
3. Evaluation
The query processing works in the following way:
Parsing and Translation
As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the file system. After this, the
actual evaluation of the queries and a variety of query -optimizing transformations and takes
place. Thus before processing a query, a computer system needs to translate the query into a
human-readable and understandable language. Consequently, SQL or Structured Query
Language is the best suitable choice for humans. But, it is not perfectly suitable for the
internal representation of the query to the system. Relational algebra is well suited for the
internal representation of a query. The translation process in query processing is similar to the
parser of a query. When a user executes any query, for generating the internal form of the
query, the parser in the system checks the syntax of the query, verifies the name of the
relation in the database, the tuple, and finally the required attribute value. The parser creates a
tree of the query, known as 'parse-tree.' Further, translate it into the form of relational algebra.
With this, it evenly replaces all the use of the views when used in the query.
Thus, we can understand the working of a query processing in the below-described diagram:
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:
select emp_name from Employee where salary>10000;
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
o σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a query
evaluation plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
o A query execution engine is responsible for generating the output of the given query.
It takes the query execution plan, executes it, and finally makes the output for the user
query.
Optimization
o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is known
as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.

Unit-1 Distributed Databases

Uploaded by

Copyright:

Available Formats

Unit-1 Distributed Databases

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-1 Distributed Databases

Uploaded by

Copyright:

Available Formats

UNIT – 1

Types of Distributed Systems

Economics Low High

Availability Low High

Complexity Low High

Consistency Simple High

Scalability Poor Good

Technology Homogeneous Heterogeneous

Security High Low

Object Oriented DBMS

DISTRIBUTED DATA STORAGE

What is a Distributed Data Store?

own, either from scratch or on top of other data stores.

Why do we need it?

SSD, and delivered 15 exabytes to users in 2018¹.

average of 30 billion just two years earlier².

promotional activities in 2014³.

processed somewhere. That somewhere is our distributed data stores.

Let’s understand what these advantages really mean in practice.

Performance is how well a machine can do work.

cost traffic, sales, and ultimately revenue⁵.

Fortunately, we do have control of our application’s performance. In the case of single-

or less resources on demand as needed. Scalability comes in two forms.

 Horizontal scaling means to add or remove computers (also known as machines or

scaling is used in conjunction to optimize the overall performance and cost⁶.

even thinking about it, such as Amazon Web Service’s S3.

Reliability is the probability of being failure-free⁷.

To be highly reliable, a system must be both available⁸ and fault-tolerant⁹.

enough time to finish.

insufficient, then some requests will have to be rejected.

how they work that can limit their potential.

How does it work?

approaches are vertical and horizontal partitioning.

you are optimizing for.

Others handle some or none.

comes with tradeoffs.

store from single-machine data stores.

stack. Let’s see the three basic cases.

each client must implement query routing logic.

will either handle, redirect, or forward the request.

against data center outages and network partitions.

overall capacity is higher.

and when data is replicated.

When data is replicated can be synchronous or asynchronous.

higher write latency.

consistency and possible data loss.

What are the differences?

you query it. Common types include

 Document: Nested collections of JSON documents. Query with keys or filters.

 Key-value: Key-value pairs. Query with a key.

 Relational: Tables of rows with an explicit schema. Query with SQL.

 File system: Directories of files. Query with file path.

 Graph: Nodes with edges. Query with a graph query language.

 Text: Free-form text or logs. Query with a query language.

intended usages in a chart below.

the term “consistency” is ironically severely overloaded — be sure what type of

consistency is being referred to¹⁵.

 Availability is whether you can access your data.

 Durability is whether stored data remains safe and uncorrupted.