Unit-1 Distributed Databases
Unit-1 Distributed Databases
Unit-1 Distributed Databases
DISTRIBUTED DATABASES
Distributed Systems – Introduction – Architecture – Distributed
Database Concepts –DistributedData Storage – Distributed
Transactions – Commit Protocols – ConcurrencyControl –
DistributedQuery Processing
Distributed Systems – Introduction
What is meant by distributed systems?
A distributed system is a computing environment in which various components are
spread across multiple computers (or other computing devices) on a network. ...
Distributed systems reduce the risks involved with having a single point of failure, bolstering
reliability and fault tolerance.
A distributed system contains multiple nodes that are physically separate but linked together
using the network. All the nodes in this system communicate with each other and handle
processes in tandem. Each of these nodes contains a small part of the distributed operating
system software.
Architecture
In distributed architecture, components are presented on different platforms and several
components can cooperate with one another over a communication network in order to
achieve a specific objective or goal.
In this architecture, information processing is not confined to a single machine rather
it is distributed over several independent computers.
A distributed system can be demonstrated by the client-server architecture which
forms the base for multi-tier architectures; alternatives are the broker architecture
such as CORBA, and the Service-Oriented Architecture (SOA).
There are several technology frameworks to support distributed architectures,
including .NET, J2EE, CORBA, .NET Web services, AXIS Java Web services, and
Globus Grid services.
Middleware is an infrastructure that appropriately supports the development and
execution of distributed applications. It provides a buffer between the applications
and the network.
It sits in the middle of system and manages or supports the different components of a
distributed system. Examples are transaction processing monitors, data convertors
and communication controllers etc.
Middleware as an infrastructure for distributed system
The basis of a distributed architecture is its transparency, reliability, and availability.
The following table lists the different forms of transparency in a distributed system −
Sr.No. Transparency & Description
1 Access
Hides the way in which resources are accessed and the differences in data platform.
2 Location
Hides where resources are located.
3 Technology
Hides different technologies such as programming language and OS from user.
4 Migration / Relocation
Hide resources that may be moved to another location which are in use.
5 Replication
Hide resources that may be copied at several location.
6 Concurrency
Hide resources that may be shared with other users.
7 Failure
Hides failure and recovery of resources from user.
8 Persistence
Hides whether a resource ( software ) is in memory or disk.
Advantages
Resource sharing − Sharing of hardware and software resources.
Openness − Flexibility of using hardware and software of different vendors.
Concurrency − Concurrent processing to enhance performance.
Scalability − Increased throughput by adding new resources.
Fault tolerance − The ability to continue in operation after a fault has occurred.
Disadvantages
Complexity − They are more complex than centralized systems.
Security − More susceptible to external attack.
Manageability − More effort required for system management.
Unpredictability − Unpredictable responses depending on the system organization
and network load.
Centralized System vs. Distributed System
Criteria Centralized system Distributed System
Advantages
Better performance than a thin-client approach and is simpler to manage than a thick-
client approach.
Enhances the reusability and scalability − as demands increase, extra servers can be
added.
Provides multi-threading support and also reduces network traffic.
Provides maintainability and flexibility
Disadvantages
Unsatisfactory Testability due to lack of testing tools.
More critical server reliability and availability.
Distributed DBMS - Concepts
For proper functioning of any organization, there’s a need for a well-maintained database. In
the recent past, databases used to be centralized in nature. However, with the increase in
globalization, organizations tend to be diversified across the globe. They may choose to
distribute data over local servers instead of a central database. Thus, arrived the concept
of Distributed Databases.
This chapter gives an overview of databases and Database Management Systems (DBMS).
A database is an ordered collection of related data. A DBMS is a software package to work
upon a database. A detailed study of DBMS is available in our tutorial named “Learn
DBMS”. In this chapter, we revise the main concepts so that the study of DDBMS can be
done with ease. The three topics covered are database schemas, types of databases and
operations on databases.
Database and Database Management System
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the
characteristic features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name,
Company_Id, Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
Automatic Teller Machines
Train Reservation System
Employee Management System
Student Information System
Examples of DBMS Packages
MySQL
Oracle
SQL Server
dBASE
FoxPro
PostgreSQL, etc.
Database Schemas
A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the
physical database. The three levels are −
Internal Level having Internal Schema − It describes the physical structure, details
of internal storage and access paths for the database.
Conceptual Level having Conceptual Schema − It describes the structure of the
whole database while hiding the details of physical storage of data. This illustrates
the entities, attributes with their data types and constraints, user operations and
relationships.
External or View Level having External Schemas or Views − It describes the
portion of a database relevant to a particular user or a group of users while hiding the
rest of database.
Types of DBMS
There are four types of DBMS.
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that
one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and
simple.
Network DBMS
Network DBMS in one where the relationships among data in the database are of type
many-to-many in the form of a network. The structure is generally complicated due to the
existence of numerous many-to-many relationships. Network DBMS is modelled using
“graph” data structure.
Relational DBMS
In relational databases, the database is represented in the form of relations. Each relation
models an entity and is represented as a table of values. In the relation or table, a row is
called a tuple and denotes a single record. A column is called a field or an attribute and
denotes a characteristic property of the entity. RDBMS is the most popular database
management system.
For example − A Student Relation −
Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases
transparent to the users. In these systems, data is intentionally distributed among multiple
nodes so that all computing resources of the organization can be optimally used.
Operations on DBMS
The four basic operations on a database are Create, Retrieve, Update and Delete.
CREATE database structure and populate it with data − Creation of a database
relation involves specifying the data structures, data types and the constraints of the
data to be stored.
Example − SQL command to create a student table −
CREATE TABLE STUDENT (
ROLL INTEGER PRIMARY KEY,
NAME VARCHAR2(25),
YEAR INTEGER,
STREAM VARCHAR2(10)
);
Once the data format is defined, the actual data is stored in accordance with the
format in some storage medium.
Example SQL command to insert a single tuple into the student table −
INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM)
VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');
RETRIEVE information from the database – Retrieving information generally
involves selecting a subset of a table or displaying data from the table after some
computations have been done. It is done by querying upon the table.
Example − To retrieve the names of all students of the Computer Science stream, the
following SQL query needs to be executed −
SELECT NAME FROM STUDENT
WHERE STREAM = 'COMPUTER SCIENCE';
UPDATE information stored and modify database structure – Updating a table
involves changing old values in the existing table’s rows with new values.
Example − SQL command to change stream from Electronics to Electronics and
Communications −
UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS'
WHERE STREAM = 'ELECTRONICS';
Modifying database means to change the structure of the table. However,
modification of the table is subject to a number of restrictions.
Example − To add a new field or column, say address to the Student table, we use
the following SQL command −
ALTER TABLE STUDENT
ADD ( ADDRESS VARCHAR2(50) );
DELETE information stored or delete a table as a whole – Deletion of specific
information involves removal of selected rows from the table that satisfies certain
conditions.
Example − To delete all students who are in 4 th year currently when they are passing
out, we use the SQL command −
DELETE FROM STUDENT
WHERE YEAR = 4;
Alternatively, the whole table may be removed from the database.
Example − To remove the student table completely, the SQL command used is −
DROP TABLE STUDENT;
In this article, we’ll learn what distributed data storage is, why we need it, and how to use it
effectively. This article is intended to help you develop applications, and so we will only cover
what application developers need to know. This includes the essential foundations, the
common pitfalls that developers run into, and the differences between different distributed
data stores.
This article does not require any distributed systems knowledge! Programming and database
experience will help, but you can also just look up topics as we come to them. Let’s start!
A distributed data store is a system that stores and processes data on multiple machines.
As a developer, you can think of a distributed data store as how you store and retrieve
application data, metrics, logs, etc. Some popular distributed data stores you might be familiar
with are MongoDB, Amazon Web Service’s S3, and Google Cloud Platform’s Spanner.
In practice, there are many kinds of distributed data stores. They commonly come as services
managed by cloud providers or products that you deploy yourself. You can also build your
Why not just use single-machine data stores? To really understand, we first need to realize the
scale and ubiquity of data today. Let’s see some concrete numbers:
Steam had a peak of 18.5 million concurrent users, deployed servers with 2.7 petabytes of
Nasdaq in 2020 ingested a peak of 113 billion records in a single day, scaling up from an
Kellogg’s, the cereal company, processed 16 terabytes per week from just simulated
It’s honestly incredible how much data we use. Each of those bits is carefully stored and
Single-machine data stores simply cannot support these demands. So instead, we use
distributed data stores which offer key advantages in performance, scalability, and reliability.
Performance is critical. There are countless studies that quantify and show the business
impacts of delays as short as 100ms⁴. Slow response times don’t just frustrate people — they
machine data stores, simply upgrading to a faster machine is oftentimes enough. If it isn’t
enough or you rely on a distributed data store, then other forms of scalability come into play.
Scalability is the ability to increase or decrease infrastructure resources.
Applications today often experience rapid growth and cyclical usage patterns. To meet these
load requirements, we “scale” our distributed data stores. This means that we provision more
nodes).
Vertical scaling means to change the machine’s CPU, RAM, storage capacity, or other
hardware.
Horizontal scaling is why distributed data stores can out-perform single-machine data stores.
By spreading work over hundreds of computers, the aggregate system has higher performance
and reliability. While distributed data stores rely primarily on horizontal scaling, vertical
Scaling exists on a spectrum from manual to fully-managed. Some products have manual
scaling where you provision extra capacity yourself. Others autoscale based on metrics like
remaining storage capacity. Lastly, some services handle all scaling without the developer
Regardless of the approach, all services have some limits that cannot be increased, such as a
maximum object size. You can check the quotas in the documentation to see these hard limits.
You can check online benchmarks to see what performance is achievable in practice.
Some applications are so critical to our lives that even seconds of failure are unacceptable.
These applications cannot use single-machine data stores because of the unavoidable hardware
and network failures that could compromise the entire service. Instead, we use distributed data
stores because they can accommodate for individual computers or network paths failing.
Availability is the percent of time that a service is reachable and responding to requests
normally.
Fault-tolerance is the ability to tolerate hardware and software faults. Total fault
tolerance is impossible¹⁰.
Although availability and fault-tolerance may appear similar at first, they are actually quite
different. Let’s see what happens if you have one but not the other.
Available but not fault-tolerant: Consider a system that fails every minute but recovers
within milliseconds. Users can access the service, yet long-running jobs never have
Fault-tolerant but not available: Consider a system where half the nodes are
perpetually restarting and the others are stable. If the capacity of the stable nodes is
For an application developer, the key point is that distributed data stores can scale
performance and reliability far beyond single machines. The catch is that they have caveats in
Let’s cover what application developers need to know about how distributed data stores work
— partitioning, query routing, and replication. These basics will give you insight into the
behavior and characteristics of distributed data stores. It’ll help you understand the caveats,
tradeoffs, and why we don’t have a distributed data store that excels at everything.
Partitioning
Our data sets are often too large to be stored on a single machine. To overcome this,
we partition our data into smaller subsets that individual machines can store and process.
There are many ways to partition data, each with their own tradeoffs. The two main
Vertical partitioning means to split up data by related fields¹¹. Fields can be related for many
reasons. They might be properties of some common object. They might be fields that are
commonly accessed together by queries. They might even be fields that are accessed at similar
frequencies or by users with similar permissions. The exact way you vertically partition data
across machines ultimately depends on the properties of your data store and the usage patterns
Horizontal partitioning (also known as sharding) is when we split up data into subsets all
with the same schema¹¹. For example, we can horizontally partition a relational database table
by grouping rows into shards to be stored on separate machines. We shard data when a single
machine cannot handle either the amount of data or the query load for that data. Sharding
strategies fall into two categories, Algorithmic and Dynamic, but hybrids exist¹⁰.
Image by Author
Algorithmic sharding determines which shard to allocate data to based on a function of the
data’s key. For example, when storing key-value data mapping URLs to HTML, we can range
partition our data by splitting up key-values according to the first letter of the URL. For
instance, all URLs starting with “A” would go on the first machine, “B” on the second
machine, and so on. There are innumerable strategies all with different tradeoffs.
Dynamic sharding explicitly chooses the location of data and stores that location in a lookup
table. To access data, we consult the service with the lookup table or check a local cache.
Lookup tables can be quite large, and thus they may have lookup tables pointing to sub-lookup
tables, like a B+-Tree¹². Dynamic sharding is more flexible than algorithmic sharding¹³.
Partitioning, in practice, is quite tricky and can create many problems that you need to be
aware of. Fortunately, some distributed data stores will handle all this complexity for you.
Shards may have uneven data sizes. This is common in algorithmic sharding where the
function is difficult to get right. We mitigate this by tailoring the sharding strategy around
the data.
Shards may have hotspots where certain data are queried magnitudes more frequently
than others. For example, consider how much more frequently you’ll query for celebrities
than ordinary people in a social network. Careful schema design, caches, and replicas can
help here.
Redistributing data to handle adding or removing nodes from the system is difficult when
maintaining high-availability.
Indexes may need to be partitioned as well. Indexes may index the shard it is stored on
(local index), or it may index the entire data set and be partitioned (global index). Each
Transactions across partitions may work, or they may be disabled, slow, or inconsistent
in confusing ways. This is especially difficult when building your own distributed data
Partitioning the data is only part of the story. We still need to route queries from the client to
the correct backend machine. Query routing can happen at different levels of the software
Client-side partitioning is when the client holds the decision logic for which backend
node to query. The advantage is the conceptual simplicity, and the disadvantage is that
Proxy-based partitioning is when the client sends all queries to a proxy. This proxy then
determines which backend node to query. This can help reduce the number of concurrent
connections on your backend servers and separate application logic from routing logic.
Server-based partitioning is when the client connects to any backend node, and the node
In practice, query routing is handled by most distributed data stores. Typically, you configure
a client, and then query using the client. However, if you are building your own distributed
data store or using products like Redis that don’t handle it, you’ll need to take this into
consideration¹⁴.
Replication
The last concept we’ll cover is replication. Replication means to store multiple copies of the
same data. This has many benefits.
Data redundancy: When hardware inevitably fails, the data is not lost because there is
another copy.
Data accessibility: Clients can access the data from any replica. This increases resiliency
Increased read throughput: There are more machines that can serve the data, and so the
Decreased network latency: Clients can access the replica closest to them, decreasing
network latency.
Implementing replication requires mind-bending consensus protocols and exhaustive analysis
of failure scenarios. Fortunately, application developers typically only need to know where
Where data is replicated ranges from within a data center to across zones, regions, or even
continents. By replicating data close together, we minimize the network latency when
updating data between machines. However, by replicating data further apart, we protect
against data center failures, network partitions, and potentially decrease network latency for
reads.
Synchronous replication means data is copied to all replicas before responding to the
request. This has the advantage of ensuring identical data across replicas at the cost of
Asynchronous replication means data is stored on only one replica before responding to
the request. This has the advantage of faster writes with the disadvantages of weaker data
Partitioning, query routing, and replication are the building blocks of a distributed data store.
The different implementations emerge as different features and properties that you make
tradeoffs between.
Distributed data stores are all special snowflakes each with their unique set of features. We
will compare them by grouping their differences into categories and covering the basics of
each. This will help you know what questions to ask and what to read further into in the future.
Data Model
The first difference to consider is the data model. The data model is the type of data and how
Message: Groups of key-value pairs, like a JSON or python dict. Query from a queue,
topic, or sender.
Time-series: Data ordered by timestamp. Query with SQL or other query language.
Different data models are meant for different situations. While you could just store everything
as a binary object, this would be inconvenient when querying for data and developing your
application. Instead, use the data model that best fits your type of queries. For example, if you
need fast, simple lookups for small bits of data, use key-values. We’ll provide more detail on
Note that some data stores are multi-model, meaning they can efficiently operate on multiple
data models.
Guarantees
Different data stores provide different “guarantees” on behavior. While you don’t technically
need guarantees to develop robust applications, strong guarantees dramatically simplify design
and implementation. The common guarantees you’ll come across are the following:
Consistency is whether the data looks the same to all readers and is up-to-date. Note that
Some service providers will even contractually guarantee a level of service, such as 99.99%
availability, through a service-level agreement (SLA). In the event they fail to uphold the
The ecosystem (integrations, tools, supporting software, etc.) is critical to your success with a
distributed data store. Simple questions like what SDKs are available and what types of testing
are supported need to be checked. If you need features like database connectors, mobile
synchronization, ORMs, protocol buffers, geospatial libraries, etc., you need to confirm that
they are supported. Documentation and blogs will have this information.
Security
Security responsibilities are shared between you and your product/service provider. Your
responsibilities will correlate with how much of the stack you manage.
If you use a distributed data store as a service, you may only need to configure some identity
and access policies, auditing, and application security. However if you build and deploy it all,
you will need to handle everything including infrastructure security, network security,
encryption at rest/in-transit, key management, patching, etc. Check the “shared responsibility
Compliance can be a critical differentiator. Many applications need to comply with laws and
regulations regarding how data is handled. If you need to comply with security policies such
as FEDRAMP, PCI-DSS, HIPAA, or any others, your distributed data store needs to as well.
Further, if you have made promises to customers regarding data retention, data residency, or
data isolation, you may want a distributed data store that comes with built-in features for this.
Price
Different data stores are priced differently. Some data stores charge solely based on storage
volume, while others account for servers and license fees. The documentation will typically
have a pricing calculator that you can use to estimate the bill. Note that while some data stores
may appear to cost more at first, they may well make up for it in engineering and operational
time-savings.
Takeaway
Distributed data stores are all unique. We can understand and compare their different features
There are seemingly-infinite options, and unfortunately there is no best one. Each distributed
data store is meant for a different purpose and needs to fit your particular use case. To
understand the different types, check out the following table. Take your time and focus on the
We now know a ton about distributed data stores in the abstract. Let’s tie it together and see
There are seemingly-infinite options, and unfortunately there is no best one. Each distributed
data store is meant for a different purpose and needs to fit your particular use case. To
understand the different types, check out the following table. Take your time and focus on the
Finally, note that real applications and companies have a wide variety of jobs to be done, and
so they rely on multiple distributed data stores. These systems work together to serve end
Closing
Data is here to stay, and distributed data stores are what enable that.
In this article, we learned that the performance and reliability of distributed data stores can
scale far beyond single-machine data stores. Distributed data stores rely on architectures with
many machines to partition and replicate data. Application developers don’t need to know all
the specifics — they only need to know enough to understand the problems that come up in
practice, like data hotspots, transaction support, the price of data replication, etc.
Like everything else, distributed data stores have a huge variety of features. At first, it can be
difficult to conceptualize it, but hopefully our breakdown helps orient your thought process
Hopefully now you know what the big picture is and what to look more into. Consider
checking out the references. Happy to hear any comments you have! :)
DISTRIBUTED TRANSACTIONS
Hazelcast Jet is an example of a stream processing engine that will enable exactly-once
processing with sources and sinks that do not inherently have the capability to support
distributed transactions. This is done by managing the entire state of each data point and to
re-read, re-process, and/or re-write it if the data point in question encounters a failure. This
built-in logic allows more types of data repositories (such as Apache Kafka and JMS) to be
used as either sources or sinks in business-critical streaming applications.
When Distributed Transactions Are Not Needed
In some environments, distributed transactions are not necessary, and instead, extra auditing
activities are put in place to ensure data integrity when the speed of the transaction is not an
issue. The transfer of money across banks is a good example. Each bank that participates in
the money transfer tracks the status of the transaction, and when a failure is detected, the
partial state is corrected. This process works well without distributed transactions because the
transfer does not have to happen in (near) real time. Distributed transactions are typically
critical in situations where the complete update must be done immediately.
Commit Protocols
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system,
the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction state
and waits for all other transactions to reach their partially committed states. When it receives
the message that all the sites are ready to commit, it starts to commit. In a distributed
system, either all sites commit or none of them does.
The different distributed commit protocols are −
One-phase commit
Two-phase commit
Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The
steps in distributed commit are −
After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
Distributed Two-phase Commit
Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −
Phase 1: Prepare Phase
After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site. When the controlling site has received “DONE” message from
all slaves, it sends a “Prepare” message to the slaves.
The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the
slaves, it considers the transaction as committed.
After the controlling site has received the first “Not Ready” message from any slave
−
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves,
it considers the transaction as aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
The controlling site issues an “Enter Prepared State” broadcast message.
The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.
CONCURRENCY CONTROL
What is Concurrency control?
Concurrency control manages the transactions simultaneously without letting them interfere
with each another.
The main objective of concurrency control is to allow many users perform different
operations at the same time.
Using more than one transaction concurrently improves the performance of system.
If we are not able to perform the operations concurrently, then there can be serious problems
such as loss of data integrity and consistency.
Concurrency control increases the throughput because of handling multiple transactions
simultaneously.
It reduces waiting time of transaction.
Example:
Schedule C1
Step 2: The processor switches to transaction (T2). T2 is executed and updates X to Rs.
4000. Then the processor switches to transaction (T1) and remaining part of T2 which
updates Y to Rs. 2000 is executed.
Step 3: At the end of remaining part of T2 which reads Y as Rs. 2000, updates it to Rs. 4000
by multiplying value of Y.
Therefore, the above schedule can be converted to equivalent serial schedule and hence
it is a consistent schedule.
Concurrency control can be divided into two protocols
1. Lock-Based Protocol
2. Timestamp Based Protocol
1. Lock-Based Protocol
Lock is a mechanism which is important in a concurrent control.
It controls concurrent access to a data item.
It assures that one process should not retrieve or update a record which another process is
updating.
For example, in traffic, there are signals which indicate stop and go. When one signal is
allowed to pass at a time, then other signals are locked. Similarly, in database transaction
only one transaction is performed at a time and other transactions are locked.
If the locking is not done properly, then it will display the inconsistent and corrupt data.
It manages the order between the conflicting pairs among transactions at the time of
execution.
There are two lock modes,
1. Shared Lock
2. Exclusive Lock
Shared Locks are represented by S. The data items can only read without performing
modification to it from the database. S – lock is requested using lock – s instruction.
Exclusive Locks are represented by X. The data items can be read as well as written. X –
lock is requested using lock – X instruction.
If a resource is already locked by another transaction, then a new lock request can be granted
only if the mode of the requested lock is compatible with the mode of the existing lock.
Any number of transactions can hold shared locks on an item, but if any transaction holds an
exclusive lock on item, no other transaction may hold any lock on the item.
2. Timestamp Based Protocol
Timestamp Based Protocol helps DBMS to identify the transactions.
It is a unique identifier. Each transaction is issued a timestamp when it enters into the
system.
Timestamp protocol determines the serializability order.
It is most commonly used concurrency protocol.
It uses either system time or logical counter as a timestamp.
It starts working as soon as a transaction is created.
Timestamp Ordering Protocol
The TO Protocol ensures serializability among transactions in their conflicting read and
write operations.
The transaction of timestamp (T) is denoted as TS(T).
Data item (X) of read timestamp is denoted by R–timestamp(X).
Data item (X) of write timestamp is denoted by W–timestamp(X).
CONCURRENCYCONTROL
Lock-based Protocols
Lock Based Protocols in DBMS is a mechanism in which a transaction cannot Read or
Write the data until it acquires an appropriate lock. Lock based protocols help to eliminate the
concurrency problem in DBMS for simultaneous transactions by locking or isolating a
particular transaction to a single user.
A lock is a data variable which is associated with a data item. This lock signifies that
operations that can be performed on the data item. Locks in DBMS help synchronize access
to the database items by concurrent transactions.
All lock requests are made to the concurrency-control manager. Transactions proceed only
once the lock request is granted.
Binary Locks: A Binary lock on a data item can either locked or unlocked states.
Shared/exclusive: This type of locking mechanism separates the locks in DBMS based on
their uses. If a lock is acquired on a data item to perform a write operation, it is called an
exclusive lock.
1. Shared Lock (S):
A shared lock is also called a Read-only lock. With the shared lock, the data item can be
shared between transactions. This is because you will never have permission to update data
on the data item.
For example, consider a case where two transactions are reading the account balance of a
person. The database will let them read by placing a shared lock. However, if another
transaction wants to update that account’s balance, shared lock prevent it until the reading
process is over.
2. Exclusive Lock (X):
With the Exclusive Lock, a data item can be read as well as written. This is exclusive and
can’t be held concurrently on the same data item. X-lock is requested using lock-x
instruction. Transactions may unlock the data item after finishing the ‘write’ operation.
For example, when a transaction needs to update the account balance of a person. You can
allows this transaction by placing X lock on it. Therefore, when the second transaction wants
to read or write, exclusive lock prevent this operation.
3. Simplistic Lock Protocol
This type of lock-based protocols allows transactions to obtain a lock on every object before
beginning operation. Transactions may unlock the data item after finishing the ‘write’
operation.
4. Pre-claiming Locking
Pre-claiming lock protocol helps to evaluate operations and create a list of required data items
which are needed to initiate an execution process. In the situation when all locks are granted,
the transaction executes. After that, all locks release when all of its operations are over.
Starvation
Starvation is the situation when a transaction needs to wait for an indefinite period to acquire
a lock.
Following are the reasons for Starvation:
When waiting scheme for locked items is not properly managed
In the case of resource leak
The same transaction is selected as a victim repeatedly
Deadlock
Deadlock refers to a specific situation where two or more processes are waiting for each other
to release a resource or more than two processes are waiting for the resource in a circular
chain.
Timestamp-based Protocols
Timestamp based Protocol in DBMS is an algorithm which uses the System Time or
Logical Counter as a timestamp to serialize the execution of concurrent transactions. The
Timestamp-based protocol ensures that every conflicting read and write operations are
executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time to determine
the time stamp of the transaction. This is the most commonly used concurrency protocol.
Lock-based protocols help you to manage the order between the conflicting transactions
when they will execute. Timestamp-based protocols manage conflicts as soon as an operation
is created.
Example:
Suppose there are there transactions T1, T2, and T3.
T1 has entered the system at time 0010
T2 has entered the system at 0020
T3 has entered the system at 0030
Priority will be given to transaction T1, then transaction T2 and lastly Transaction T3.
Advantages:
Schedules are serializable just like 2PL protocols
No waiting for the transaction, which eliminates the possibility of deadlocks!
Disadvantages:
Starvation is possible if the same transaction is restarted and continually aborted
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following query
is undertaken:
select emp_name from Employee where salary>10000;
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
o σsalary>10000 (πsalary (Employee))
o πsalary (σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a query
evaluation plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
o A query execution engine is responsible for generating the output of the given query.
It takes the query execution plan, executes it, and finally makes the output for the user
query.
Optimization
o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is known
as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.