Distributed Transactions - Database Systems
Distributed Transactions - Database Systems
Database Systems
Introduction
For much of this semester, we assumed that transactions ran on databases where all of the data
existed on one node (machine). This is often the case for databases with lighter workloads, but as
demand increases, databases scale out to improve performance by using a Shared Nothing
architecture. Each node receives a partition of the data set that is distributed based on a range
or hash key and is connected to other nodes through a network. Distributed Transactions are
needed for executing queries in distributed databases as a transaction may need to perform
reads and writes on data that exist on different nodes.
Distributed Locking
Since every node contains data that is independent of any other node’s data, every node can
maintain its own local lock table. This works if the data fits entirely within that node, such as
pages or tuples of data. For coarser grained locks on data that spans multiple nodes, such as a
table or database, the coarser grained locks can either be given to all nodes containing a
partition of that data or be centralized at a predetermined master node. This design makes
locking simple as 2 phase locking is performed at every node using local locks in order to
guarantee serializability between different transactions.
When dealing with locking, deadlock is always a possibility. To determine whether deadlock has
occurred in a distributed database, the waits-for graphs for each node must be unioned to find
cycles as transactions can be blocked by other transactions executing on different nodes. Do this
by drawing all waits-for graphs on top of each other. If a cycle is detected in this unioned graph,
then there is a deadlock within the distributed system.
https://cs186berkeley.net/notes/note16/ 1/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
property that all nodes maintain the same view of the data. It provides this guarantee by
ensuring that a distributed transaction either commits or aborts on all nodes involved. If
consensus is not enforced, some nodes may commit the transaction while others abort, causing
nodes to have views of data at different points in time.
Every distributed transaction is assigned a coordinator node that is responsible for maintaining
consensus among all participant nodes involved in the transaction. When the transaction is ready
to commit, the coordinator initiates Two Phase Commit.
1 Coordinator sends prepare message to participants to tell participants to either prepare for
commit or abort
3 Participants send yes vote to coordinator if prepare record is flushed or no vote if abort
record is flushed
1 Coordinator broadcasts (sends message to every participant) the result of the commit/abort
vote based on flushed record
2 Participants generate a commit or abort record based on the received vote message and flush
record to disk
4 Coordinator generates an end record once all ACKs are received and flushes the record
sometime in the future
https://cs186berkeley.net/notes/note16/ 3/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
The asterisk * in the diagram above means that the node must wait for that log record to flush to
disk before sending the next message.
How do we accomplish this? Luckily, the 2PC protocol tells us to log the prepare, commit, and
abort records. This means that between looking at our own log, and talking to the coordinator
node, we will have all the information needed to accomplish recovery correctly (for most failure
scenarios).
An important assumption we will make for now is that failures are temporary and that all nodes
eventually recover. Many of 2PC’s invariants break down without this assumption, but that is
beyond the scope of this class.
Let’s look at the specifics. We will analyze what happens for a failure at each possible point in the
protocol, for either the participant or the coordinator.
Note that in some cases, we can determine what recovery decisions to make just by looking at
our own log. In other cases, however, we might also have to talk to the coordinator; we wrap this
logic in a separate process called the recovery process.
https://cs186berkeley.net/notes/note16/ 4/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
• This probably means that the participant has not even started 2PC yet – and if it has, it
hasn’t yet sent out any vote messages (since votes are sent after flushing the log record to
disk).
• Since it has not sent out any vote messages, it aborts the transaction locally. No
messages need to be sent out (the participant has no knowledge of the coordinator ID).
• This situation is trickier. Looking at the diagram above, a lot of things could have
happened between logging the prepare record and crashing – for instance, we don’t even
know if we managed to send out our YES vote!
• Specifically, we don’t know whether or not the coordinator made a commit decision. So
the participant node’s recovery process must ask the coordinator whether a commit
happened ("Did the coordinator log a commit?"). The coordinator can be determined from
the coordinator ID stored in the prepare log record.
• The coordinator will respond with the commit/abort decision, and the participant
resumes 2PC from phase 2.
• The coordinator crashed at some point before receiving the votes of all participants and
logging a commit decision.
• The coordinator will abort the transaction locally. No messages need to be sent out (the
coordinator has no knowledge of the participant IDs involved in the transaction since it
does not log its own prepare record).
• If the coordinator receives an inquiry from a participant about the status of the
transaction, respond that the transaction aborted.
• We’d like to commit, but we don’t know if we managed to tell the participants.
• So, rerun phase 2 (send out commit messages to participants). The participants can be
determined from the participant IDs stored in the commit log record.
https://cs186berkeley.net/notes/note16/ 5/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
• We did all our work for this commit, but the coordinator might still be waiting for our ACK,
so send ACK to coordinator. (The coordinator can be determined from the coordinator
ID stored in the commit log record.)
• This means that everybody already finished the transaction and there is no recovery to do.
The list above only discusses commit records. What about abort records?
We could handle them the same way as commit records (e.g. tell participants, send acks). This
works fine, but is it really necessary? What if nodes just didn’t bother recovering aborted
transactions?
The protocol with the presumed abort optimization looks like this:
https://cs186berkeley.net/notes/note16/ 6/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
The asterisk * in the diagram above means that the node must wait for that log record to flush to
disk before sending the next message. Notice that in the presumed abort optimization, abort
records no longer need to be flushed to disk.
And this is how we would recover from failures in aborted transactions, with and without
presumed abort:
• Without presumed abort: This probably means that the participant has not even started
2PC yet – and if it has, it hasn’t yet sent out any vote messages (since votes are sent after
flushing the log record to disk).
• With presumed abort: It is possible that the participant decided to abort and sent a "no"
vote to the coordinator before the crash.
• With or without presumed abort, the participant aborts the transaction locally. No
messages need to be sent out (the participant has no knowledge of the coordinator ID).
• Without presumed abort: Abort the transaction locally and send “no” vote to the
coordinator. (The coordinator can be determined from the coordinator ID stored in the
abort log record.)
• With presumed abort: Abort the transaction locally. No messages need to be sent out!
(The coordinator will timeout after not hearing from the participant and presume abort.)
• Without presumed abort: The coordinator crashed at some point before reaching a
commit/abort decision.
• With presumed abort: It is possible that the coordinator decided to abort and sent out
abort messages to the participants before the crash.
• With or without presumed abort, the coordinator will abort the transaction locally. No
messages need to be sent out (the coordinator has no knowledge of the participant IDs
involved in the transaction).
• If the coordinator receives an inquiry from a participant about the status of the
transaction, respond that the transaction aborted.
https://cs186berkeley.net/notes/note16/ 7/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
• Without presumed abort: Rerun phase 2 (sending out abort messages to participants). The
participants can be determined from the participant IDs in the abort log record.
• With presumed abort: Abort the transaction locally. No messages need to be sent out!
(Participants who don’t know the decision will ask the coordinator later.)
• Without presumed abort: Abort the transaction locally, and send back ACK to coordinator.
(The coordinator can be determined from the coordinator ID stored in the abort log
record.)
• With presumed abort: Abort the transaction locally. No messages need to be sent out!
(ACKs only need to be sent back on commit.)
Note that the recovery processes for commit records is the same in Two-Phase Commit and
Two-Phase Commit with Presumed Abort. For instance, the scenarios where the participant
recovers from Two-Phase Commit with Presumed Abort and sees a prepare record or commit
record is handled in the same way as in regular Two-Phase Commit. The same goes for when a
coordinator recovers and sees a commit record.
To wrap everything up, here are some subtleties that you should be explicitly be aware of:
• The 2PC recovery decision is commit if and only if the coordinator has logged a commit
record.
• Since 2PC requires unanimous agreement, it will only make progress if all nodes are alive. This
is true for the recovery protocol as well – for recovery to finish, all failed nodes must
eventually come back alive. If the coordinator believes a participant is dead, it can respawn
the participant on a new node based on the log of the original participant, and ignore the
original participant if it does come back online. However, 2PC struggles to handle scenarios
where the coordinator is dead. For example, consider a scenario where all participants vote
yes in Phase 1, but the coordinator crashes before sending out a commit decision. The
participants will keep pinging the dead coordinator for the status of the transaction, and the
system is blocked from making progress.
Practice Questions
1 Suppose that there are some transactions happening concurrently. If no two concurrent
transactions ever operate on rows that are being stored in the same node, can a deadlock still
happen?
https://cs186berkeley.net/notes/note16/ 8/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
2 Suppose a participant receives a prepare message for a transaction. Assuming the participant
node remains online and does not fail, why might it decide to log an abort record and vote
no?
3 Suppose a participant receives a prepare message for a transaction and replies VOTE-YES .
Suppose that they were running a wound-wait deadlock avoidance policy, and a transaction
comes in with higher priority. Will the new transaction abort the prepared transaction?
4 True or false - if we, as a participant node, are recovering and see a PREPARE record, it means
we must have sent out a YES vote.
5 True or false - if the coordinator is recovering and sees a COMMIT record, we can locally commit
and end the process there.
6 How many messages and log flushes does the presumed abort optimization skip if the
transaction commits? What about if it aborts due to the participants aborting?
Solutions
1 You may be inclined to say no, since a transaction does not seem like it will ever wait on the
locks of another transaction in this scenario. But remember that in addition to each node
maintaining its own local lock table, coarser grained locks, such as locks for tables or the
entire database, can still be shared between nodes, and the transactions may still conflict
via these coarser locks.
2 The participant may decide to abort the transaction and vote no to avoid deadlock. For
example, if we are using deadlock detection and we discover that the union of the waits-for
graphs for each node contains a cycle the transaction is involved in, we may decide to abort
the transaction. If we are using a deadlock avoidance policy, we will abort the transaction if
acquiring the locks necessary to perform the transaction’s operations would require us to wait
on a lower priority transaction (for wait-die) or higher priority transaction (for wound-wait).
3 No - transactions that are prepared must be ready to commit if the coordinator tells them to,
so they cannot be aborted by anyone other than the coordinator.
4 False! We could have crashed between logging PREPARE and sending VOTE-YES .
5 False! There is no guarantee that all participants received a COMMIT record, so we must rerun
phase 2 first.
6 If the transaction commits, no messages or log flushes are skipped - the messages sent and
records logged are the same as the protocol without presumed abort.
https://cs186berkeley.net/notes/note16/ 9/10
6/7/24, 9:26 AM Distributed Transactions | Database Systems
If the participants abort, they do not flush any log records (skipping up to two flushes) and
only have to send one message (they don’t have to send the ACK). The coordinator also does
not have to flush any log records (skipping one flush), but does not skip any messages (it still
sends the messages it would without presumed abort).
1 In reality, there is no difference between a phase 1 and phase 2 abort record. Therefore, for
the without presumed abort case, a participant can simply respond with NO to the
coordinator upon seeing an abort record. The coordinator will then treat that message as
either a phase 1 vote or a phase 2 ACK accordingly. ↩
https://cs186berkeley.net/notes/note16/ 10/10