Unit 5
Unit 5
Unit 5
Concurrency Control
Lock-Based Protocols
Timestamp-Based Protocols
Validation-Based Protocols
Multiple Granularity
Multiversion Schemes
Deadlock Handling
insert and Delete Operations and Predicate reads
Weak Levels of Consistency
Concurrency in Index Structure
Lock-Based Protocols
2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S instruction.
Lock requests are made to concurrency-control manager.Transaction can proceed only after
request is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks
already held on the item by other transactions
Any number of transactions can hold shared locks on an item,but if any transaction holds an
exclusive on the item no other transaction may hold any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks
held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T2: lock-S(A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability — if A and Bget updated in-between
the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and releasing locks.
Locking protocols restrict the set of possible schedules.
Pitfalls of Lock-Based Protocols
Lock Conversion
then
read(D)
else
begin
Dept of CSE ,RGCET Page 3
DBMS Y3/S5
if necessary wait until no other
grant Ti a lock-S on D;
read(D)
end
write(D)
else
begin
if Ti has a lock-S on D
then
else
grant Ti a lock-X on D
write(D)
end;
A Lock manager can be implemented as a separate process to which transactions send lock and
unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a message asking
the transaction to roll back, in case of a deadlock)
The requesting transaction waits until its request is answered
The lock manager maintains a datastructure called a lock table to record granted locks and pending
requests
The lock table is usually implemented as an in-memory hashtable indexed on the name of the data
item being locked
Lock tables
Tree Protocol
Timestamp-Based Protocols
Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has time-
stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <TS(Tj).
The protocol manages concurrent execution such that the time- stamps determine the
serializability order.
2. If TS(Ti)≥ W-timestamp(Q), then the read operation is executed, and R-timestamp(Q) is set to the
maximum of R-
The timestamp-ordering protocol guarantees serializability since all the arcs in the precedence
graph are of the form:
Thus, there will be no cycles in the precedence graph
Solution:
A transaction is structured such that its writes are all performed at the end of its processing
All writes of a transaction form an atomic action; no transaction may execute while a transaction is
being written
A transaction that aborts is restarted with a new timestamp
Modified version of the timestamp-ordering protocol in which obsolete write operations may be
ignored under certain circumstances.
When Ti attempts to write data item Q, if TS(Ti) < W-timestamp(Q), then Ti is attempting to write
an obsolete value of{Q}. Hence, rather than rolling back Ti as the timestamp ordering
protocol would have done, this {write} operation can be ignored.
Validation-Based Protocol
2. Validation phase: Transaction Ti performs a ``validation test'' to determine if local variables can
be written without violating serializability.
3. Write phase: If Ti is validated, the updates are applied to the database; otherwise, Ti is rolled
back.
The three phases of concurrently executing transactions can be interleaved, but each transaction
must go through the three phases in that order.
Also called as optimistic concurrency control since transaction executes fully in the hope that all
will go well during validation
Each transaction Ti has 3 timestamps
− Start(Ti) : the time when Ti started its execution
2. the writes of Ti do not affect reads of Tj since Tj does not read any item written by Ti.
T14 T15
read(B)
read(B)
B:- B-50
read(A)
A:- A+50
read(A)
(validate)
display (A+B)
(validate)
write (B)
write (A)
Multiple Granularity
Allow data items to be of various sizes and define a hierarchy of data granularities, where the small
granularities are nested within larger ones
Can be represented graphically as a tree (but don't confuse with tree-locking protocol)
When a transaction locks a node in the tree explicitly, it implicitly locks all the node's
In addition to S and X lock modes, there are three additional lock modes with multiple
granularity:
intention-shared (IS): indicates explicit locking at a lower level of the tree but only with shared
locks.
intention-exclusive (IX): indicates explicit locking at a lower level with exclusive or shared locks
shared and intention-exclusive (SIX): the subtree rooted by that node is locked explicitly in
shared mode and explicit locking is being done at a lower level with exclusive-mode locks.
intention locks allow a higher level node to be locked in S or Xmode without having to check all
descendent nodes.
IS IX S SIX X
IX " "×××
S "×"××
SIX "××××
X ×××××
2. The root of the tree must be locked first, and may be locked in any mode.
4. A node Q can be locked by Ti in X, SIX, or IX mode only if the parent of Q is currently locked by Ti
in either IX or SIX mode.
5. Ti can lock a node only if it has not previously unlocked any node (that is, Ti is two-phase).
6. Ti can unlock a node Q only if none of the children of Q are currently locked by Ti.
Observe that locks are acquired in root-to-leaf order,whereas they are released in leaf-to-root
order.
Multiversion Schemes
1. If transaction Ti issues a read(Q), then the value returned is the content of version Qk.
2. If transaction Ti issues a write(Q), and if TS(Ti) < R-timestamp(Qk), then transaction Ti is rolled
back. Otherwise, if TS(Ti) = W-timestamp(Qk), the contents of Qk are overwritten, otherwise a new
version of Q is created.
Reads always succeed; a write by Ti is rejected if some other transaction Tj that (in the serialization
order defined by the timestamp values) should read Ti's write, has already read a version created
by a transaction older than Ti.
Read-only transactions are assigned a timestamp by reading the current value of ts-counter before
they start execution; they follow the multiversion timestamp-ordering protocol for
performing reads.
When an update transaction wants to read a data item, it obtains a shared lock on it, and reads the
latest version.
When it wants to write an item, it obtains X lock on; it then creates a new version of the item and
sets this version's timestamp to ∞.
When update transaction Ti completes, commit processing
occurs:
Ti increments ts-counter by 1
Read-only transactions that start after Ti increments ts-counter will see the values updated by Ti.
Read-only transactions that start before Ti increments the ts-counter will see the value before the
updates by Ti.Only serializable schedules are produced.
T1:
write (X)
write(Y)
T2:
write(Y)
write(X)
lock-X on X
write (X)
lock-X on Y
write (X)
Deadlock Handling
System is deadlocked if there is a set of transactions such that every transaction in the set is waiting
for another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
Some prevention strategies :
Require that each transaction locks all its data items before it begins execution (predeclaration).
Impose partial ordering of all data items and require that a transaction can lock data items
only in the order specified by the partial order (graph-based protocol).
a transaction may die several times before acquiring needed data item
wound-wait scheme — preemptive
◦ older transaction wounds (forces rollback) of younger transaction instead of waiting for it.
Younger transactions may wait for older ones.
may be fewer rollbacks than wait-die scheme.
Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original
timestamp. Older transactions thus have precedence over newer ones, and starvation is hence
avoided.
Timeout-Based Schemes :
◦ a transaction waits for a lock only for a specified amount of time.
After that, the wait times out and the transaction is rolled back.
Deadlock Recovery
# More effective to roll back transaction only as far as necessary to break deadlock.
• insert(Q) inserts a new data item Q into the database and assigns Q an initial value.
A delete operation may be performed only if the transaction deleting the tuple has an exclusive
lock on the tuple to be deleted.
A transaction that inserts a new tuple into the database is given an X-mode lock on the tuple
Insertions and deletions can lead to the phantom phenomenon.
A transaction that scans a relation (e.g., find all accounts in Perryridge) and a transaction that
inserts a tuple in the relation (e.g.,insert a new account at Perryridge) may conflict in spite of not
accessing any tuple in common.
If only tuple locks are used, non-serializable schedules can result: the scan transaction may
not see the new account, yet may be serialized before the insert transaction.
The transaction scanning the relation is reading information that indicates what tuples the relation
contains, while a transaction inserting a tuple updates the same information.
The information should be locked.
One solution:
Associate a data item with the relation, to represent the information about what tuples the relation
contains.
Transactions scanning the relation acquire a shared lock in the data item,
Transactions inserting or deleting a tuple acquire an exclusive lock on the data item. Above
protocol provides very low concurrency for insertions/deletions.
Index locking protocols provide higher concurrency while preventing the phantom
phenomenon, by requiring locks on certain index buckets.
Deletion
Let Ii and I j be instructions of Ti and Tj , respectively, that appear in schedule S in consecutive order. Let Ii =
delete(Q).We consider several instructions I j .
• I j = read(Q). Ii and I j conflict. If Ii comes before I j , Tj will have a logical error. If I j comes before Ii , Tj can
execute the read operation successfully.
• I j = write(Q). Ii and I j conflict. If Ii comes before I j , Tj will have a logical error. If I j comes before Ii , Tj can
execute the write operation successfully.
• I j = delete(Q). Ii and I j conflict. If Ii comes before I j , Ti will have a logical error. If I j comes before Ii , Ti
will have a logical error.
• Under the two-phase locking protocol, an exclusive lock is required on a data item before that item can
be deleted.
• Under the timestamp-ordering protocol, a test similar to that for a write must be performed. Suppose
that transaction Ti issues delete(Q).
◦ If TS(Ti ) < R-timestamp(Q), then the value of Q that Ti was to delete has already been read by a
transaction Tj with TS(Tj ) > TS(Ti ). Hence, the delete operation is rejected, and Ti is rolled back.
◦ If TS(Ti )< W-timestamp(Q), then a transaction Tj with TS(Tj )> TS(Ti) has written Q. Hence, this delete
operation is rejected, and Ti is rolled back.
Insertion
Since an insert(Q) assigns a value to data item Q, an insert is treated similarly to a write for concurrency-
control purposes:
• Under the two-phase locking protocol, if Ti performs an insert(Q) operation,Ti is given an exclusive lock
on the newly created data item Q.
Consider transaction T30 that executes the following SQL query on the university database:
select count(*)
from instructor
Transaction T30 requires access to all tuples of the instructor relation pertaining to the Physics department.
Let S be a schedule involving T30 and T31. We expect there to be potential for
• If T30 uses the tuple newly inserted by T31 in computing count(*), then T30 reads a value written by T31.
Thus, in a serial schedule equivalent to S, T31 must come before T30.
• If T30 does not use the tuple newly inserted by T31 in computing count(*), then in a serial schedule
equivalent to S, T30 must come before T31.
The second of these two cases is curious. T30 and T31 do not access any tuple in common, yet they conflict
with each other! In effect, T30 and T31 conflict on a phantom tuple. If concurrency control is performed at
the tuple granularity, this conflict would go undetected. As a result, the system could fail to prevent a
nonserializable schedule. This problem is called the phantom phenomenon.
Every relation must have at least one index. Access to a relation must be made only through one of
the indices on the relation.
A transaction Ti that performs a lookup must lock all the index buckets that it accesses, in S-mode.
A transaction Ti may not insert a tuple ti into a relation r without updating all indices to r.
Ti must perform a lookup on every index to find all index buckets that could have possibly
contained a pointer to tuple ti, had it existed already, and obtain locks in X-mode on all these
index buckets. Ti must also obtain locks in X-mode on all index buckets
that it modifies.
Degree-two consistency: differs from two-phase locking in that S-locks may be released at any
time, and locks may be acquired at any time
X-locks must be held till end of transaction
Serializability is not guaranteed, programmer must ensure that no erroneous database state will
occur.
Cursor stability:
Cursor stability is a form of degree-two consistency designed for programs that iterate over tuples of a
relation by using cursors. Instead of locking the entire relation, cursor stability ensures that:
• The tuple that is currently being processed by the iteration is locked in shared mode.
• Any modified tuples are locked in exclusive mode until the transaction commits.
Concurrency-control protocols usually consider transactions that do not involve user interaction.
• For each updated tuple, the transaction checks if the current version number is the same as the version
number of the tuple when it was first read by the transaction.
1. If the version numbers match, the update is performed on the tuple in the database, and its version
number is incremented by 1.
2. If the version numbers do not match, the transaction is aborted, rolling back all the updates it
performed.
Transactions that involve user interaction are called conversations in Hibernate to differentiate them from
regular transactions validation using version numbers is very useful for such transactions.
SNAPSHOT ISOLATION
Snapshot isolation is a particular type of concurrency-control scheme that has gained wide
acceptance in commercial and open-source systems, including Oracle, PostgreSQL, and SQL Server.
Deciding whether or not to allow an update transaction to commit requires some care. Potentially,
two transactions running concurrently might both update the same data item. Since these two transactions
operate in isolation using their own private snapshots, neither transaction sees the update made by the
other. If both transactions are allowed to write to the database, the first update written will be overwritten
by the second. The result is a lost update. Clearly, this must be prevented. There are two variants of
snapshot isolation, both of which prevent lost updates. They are called first committerwins and first
updaterwins. Both approaches are based on testing the transaction against concurrent
transactions.Atransaction is said to be concurrent with T if it was active or partially committed at any point
from the start of T up to and including the timewhen this test is being performed.
Under first committer wins, when a transaction T enters the partially committed state, the
following actions are taken in an atomic action:
A test is made to see if any transaction that was concurrentwith T has already written an update to the
database for some data item that T intends to write.
• If no such transaction is found, then T commits and its updates are written to the database.
This approach is called “first committer wins” because if transactions conflict, the first one to be tested
using the above rule succeeds in writing its updates, while the subsequent ones are forced to abort.
Under first updater wins the system uses a locking mechanism that applies only to updates.
Dept of CSE ,RGCET Page 19
DBMS Y3/S5
When a transaction Ti attempts to update a data item, it requests a write lock on that data item. If the lock
is not held by a concurrent transaction, the following steps are taken after the lock is acquired:
• If the item has been updated by any concurrent transaction, then Ti aborts.
If, however, some other concurrent transaction Tj already holds a write lock on that data item, then Ti
cannot proceed and the following rules are followed:
◦ If Tj aborts, then the lock is released and Ti can obtain the lock. After the lock is acquired, the check for an
update by a concurrent transaction is performed as described earlier: Ti aborts if a concurrent transaction
had updated the data item, and proceeds with its execution otherwise.
This approach is called “first updater wins” because if transactions conflict, the first one to obtain the lock is
the one that is permitted to commit and perform its update. Those that attempt the update later abort
unless the first updater subsequently aborts for some other reason.
Serializability Issues
Snapshot isolation is attractive in practice because the overhead is low and no aborts occur unless two
concurrent transactions update the same data item.
There is, however, one serious problemwith the snapshot isolation scheme as we have presented it, and as
it is implemented in practice: snapshot isolation does not ensure serializability.
1. Suppose that we have two concurrent transactions Ti and Tj and two data items A and B. Suppose that Ti
reads A and B, then updates B, while Tj reads Aand B, then updates A. For simplicity, we assume there are
no other concurrent transactions. Since Ti and Tj are concurrent, neither transaction
sees the update by the other in its snapshot. But, since they update different data items, both are allowed
to commit regardless of whether the system uses the first-update-wins policy or the first-committer-wins
policy.
However, the precedence graph has a cycle. There is an edge in the precedence graph from Ti to Tj because
Ti reads the value of Athat existed before Tj writes A. There is also an edge in the precedence graph from Tj
to Ti because Tj reads the value of B that existed before Ti writes B. Since there is a cycle in the precedence
graph, the result is a nonserializable schedule.
This situation, where each of a pair of transactions has read data that is written by the other, but there is
no data written by both transactions, is referred to as write skew.
Dept of CSE ,RGCET Page 20
DBMS Y3/S5
Chapter 4.3 : Recovery System
Failure Classification
Storage Structure
Recovery and Atomicity
Log-Based Recovery
Shadow Paging
Recovery With Concurrent Transactions
Buffer Management
Failure with Loss of Nonvolatile Storage
Early Lock Release and Logical Undo Operations
Failure Classification
Transaction failure :
Logical errors: transaction cannot complete due to some internal error condition
System errors: the database system must terminate an active transaction due to an error
condition (e.g., deadlock)
System crash: a power failure or other hardware or software failure causes the system to crash.
Fail-stop assumption: non-volatile storage contents are assumed to not be corrupted by system
crash Database systems have numerous integrity checks to prevent corruption of disk data
Disk failure: a head crash or similar disk failure destroys all or part of disk storage
Destruction is assumed to be detectable: disk drives use checksums to detect failures
Storage
Volatile storage:
Nonvolatile storage:
Stable storage:
Stable-Storage Implementation
2. When the first write successfully completes, write the same information onto the second
physical block.
3. The output is completed only after the second write successfully completes.
Copies of a block may differ due to failure during output operation. To recover from failure:
1. First find inconsistent blocks:
2. Better solution:
Data Access
We assume, for simplicity, that each data item fits in, and is stored inside, a single block.
Transaction transfers data items between system buffer blocks and its private work-area using the
following operations :
read(X) assigns the value of data item X to the local variable xi.
write(X) assigns the value of local variable xi to data item {X} in the buffer block.
both these commands may necessitate the issue of an input(BX) instruction before the
assignment, if the block BX in which X resides is not already in memory.
Transactions
◦ Perform read(X) while accessing X for the first time;
◦ All subsequent accesses are to the local copy.
◦ After last access, transaction executes write(X).
output(BX) need not immediately follow write(X). System can perform the output operation when
it deems fit.
Modifying the database without ensuring that the transaction will commit may leave the database
in an inconsistent state.
Consider transaction Ti that transfers $50 from account A to account B; goal is either to perform all
database modifications made by Ti or none at all.
Several output operations may be required for Ti (to output A and B). A failure may occur after one
of these modifications have been made but before all of them are made.
To ensure atomicity despite failures, we first output information describing the modifications to
stable storage without modifying the database itself.
We study two approaches:
◦ log-based recovery, and
◦ shadow-paging
We assume (initially) that transactions run serially, that is, one after the other.
Log-Based Recovery
The deferred database modification scheme records all modifications to the log, but defers all the writes to
after partial commit.
The write is not performed on X at this time, but is deferred. When Ti partially commits, <Ti
commit> is written to the log Finally, the log records are read and used to actually execute the
previously deferred writes.
During recovery after a crash, a transaction needs to be redone if and only if both <Ti start> and<Ti
commit> are there in the log.
Redoing a transaction Ti ( redoTi) sets the value of all data items updated by the transaction to the
new values.
Crashes can occur while the transaction is executing the original updates, or while recovery action
is being taken example transactions T0 and T1 (T0 executes before T1):
T0: read (A)
A: - A - 50
Write (A)
read (B)
B:- B + 50
write (B)
T1 : read (C)
write ©
Example
(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are set to 950 and 2050
respectively.
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050 respectively. Then C is set to 600
Checkpoints
During recovery we need to consider only the most recent transaction Ti that started before the
checkpoint, and transactions that started after Ti.
1. Scan backwards from end of log to find the most recent <checkpoint> record
3. Need only consider the part of log following above start record.Earlier part of log can be ignored
during recovery, and can be erased whenever desired.
4. For all transactions (starting from Ti or later) with no <Ti commit>, execute undo(Ti). (Done only
in case of immediate modification.)
5. Scanning forward in the log, for all transactions starting from Ti or later with a <Ti commit>,
execute redo(Ti).
We modify the log-based recovery schemes to allow multiple transactions to execute concurrently.
All transactions share a single disk buffer and a single log
A buffer block can have data items updated by one or more transactions
Dept of CSE ,RGCET Page 27
DBMS Y3/S5
We assume concurrency control using strict two-phase locking;
◦ i.e. the updates of uncommitted transactions should not be visible to other
transactions
We assume no updates are in progress while the checkpoint is carried out (will relax this later)
When the system recovers from a crash, it first does the following:
1. Initialize undo-list and redo-list to empty
2. Scan the log backwards from the end, stopping when the first
list
At this point undo-list consists of incomplete transactions which must be undone, and redo-list
consists of finished transactions that must be redone.
Recovery now continues as follows:
1. Scan log backwards from most recent record, stopping when
During the scan, perform undo for each log record that belongs
to a transaction in undo-list.
3. Scan log forwards from the <checkpoint L> record till the end of the log.
During the scan, perform redo for each log record that belongs
to a transaction on redo-list
Example of Recovery
<T0, A, 0, 10>
<T0 commit>
<T1 start>
<T1, B, 0, 10>
<T2 start>
<T2, C, 0, 10>
<T3 start>
<T3, D, 0, 10>
<T3 commit>
Recovery Algorithms
2. Actions taken after a failure to recover the database contents to a state that ensures
atomicity, consistency and durability
The recovery algorithm described in this section requires that a data item that has been updated by
an uncommitted transaction cannot be modified by any other transaction, until the first transaction
has either committed or aborted.
Transaction Rollback
First consider transaction rollback during normal operation (that is, not during recovery from a system
crash).Rollback of a transaction Ti is performed as follows:
1. The log is scanned backward, and for each log record of Ti of the form <Ti , Xj , V1, V2> that is found:
b. Aspecial redo-only log record<Ti , Xj , V1>is written to the log, where V1 is the value being restored to
data item Xj during the rollback. These log records are sometimes called compensation log records.
Such records do not need undo information, since we never need to undo such an undo operation.
2. Once the log record <Ti start> is found the backward scan is stopped, and a log record <Ti abort> is
written to the log.Observe that every update action performed by the transaction or on behalf of the
transaction, including actions taken to restore data items to their old value, have now been recorded in the
log.
Recovery actions, when the database system is restarted after a crash, take place in two phases:
1. In the redo phase, the system replays updates of all transactions by scanning the log forward from the
last checkpoint. The log records that are replayed include log records for transactions that were rolled back
before system crash, and those that had not committed when the system crash occurred.
This phase also determines all transactions that were incomplete at the time of the crash, and must
therefore be rolled back. Such incomplete transactions would either have been active at the time of the
checkpoint, and thus would appear in the transaction list in the checkpoint record, or would have started
The specific steps taken while scanning the log are as follows:
a. The list of transactions to be rolled back, undo-list, is initially set to the list L in the <checkpoint L> log
record.
b. Whenever a normal log record of the form <Ti , Xj , V1, V2>, or a redo-only log record of the form <Ti , Xj ,
V2> is encountered, the operation is redone; that is, the value V2 is written to data item Xj .
c. Whenever a log record of the form <Ti start> is found, Ti is added to undo-list.
d. Whenever a log record of the form <Ti abort> or <Ti commit> is found, Ti is removed from undo-list.
At the end of the redo phase, undo-list contains the list of all transactions that are incomplete, that is, they
neither committed nor completed rollback before the crash.
2. In the undo phase, the system rolls back all transactions in the undo-list. It performs rollback by scanning
the log backward from the end.
a. Whenever it finds a log record belonging to a transaction in the undolist, it performs undo actions just as
if the log record had been found during the rollback of a failed transaction.
b. When the system finds a <Ti start> log record for a transaction Ti in undo-list, it writes a <Ti abort> log
record to the log, and removes Ti from undo-list.
c. The undo phase terminates once undo-list becomes empty, that is, the system has found <Ti start> log
records for all transactions that were initially in undo-list.
After the undo phase of recovery terminates, normal transaction processing can resume.
Observe that the redo phase replays every log record since the most recent checkpoint record.
The actions include actions of incomplete transactions and the actions carried out to rollback failed
transactions.
The actions are repeated in the same order in which they were originally carried out; hence, this process is
called repeating history.
Figure : shows an example of actions logged during normal operation, and actions performed during failure
recovery. In the log shown in the figure, transaction T1 had committed, and transaction T0 had been
completely rolled back, before the system crashed. Observe how the value of data item B is restored during
When recovering from a crash, in the redo phase, the system performs a redo of all operations after the
last checkpoint record. In this phase, the list undo-list initially contains T0 and T1; T1 is removed first when
its commit log record is found, while T2 is added when its start log record is found. Transaction T0 is undo-
list. The undo phase scans the log backwards from the end, and when it finds a log record of T2 updating A,
the old value of Ais restored, and a redo-only log recordwritten to the log. When the start record for T2 is
found, an abort record is added for T2. Since undo-list contains no more transactions, the undo phase
terminates, completing recovery.
BUFFER MANAGEMENT
Log record buffering: log records are buffered in main memory, instead of of being output directly
to stable storage.
Log records are output to stable storage when a block of log records in the buffer is full, or a log
force operation is executed.
Log force is performed to commit a transaction by forcing all its log records (including the commit
record) to stable storage.
Several log records can thus be output using a single output operation, reducing the I/O cost.
Database Buffering
Database buffer can be implemented either in an area of real main-memory reserved for the
database, or in virtual memory
Implementing buffer in reserved main-memory has drawbacks:
Memory is partitioned before-hand between database buffer and applications, limiting flexibility.
Needs may change, and although operating system knows best how memory should be divided up
at any time, it cannot change the partitioning of memory.
Database buffers are generally implemented in virtual memory in spite of some drawbacks:
When operating system needs to evict a page that has been modified, to make space for another
page, the page is written to swap space on disk.
When database decides to write buffer page to disk, buffer page may be in swap space, and may
have to be read from swap space on disk and output to the database on disk, resulting in extra I/O!
" Known as dual paging problem.
Ideally when swapping out a database buffer page, operating system should pass control to
database, which in turn outputs pageto database instead of to swap space (making sure to output
log records first) " Dual paging can thus be avoided, but common operating systems do not support
such functionality.
2. Write a <checkpoint L> log record and force log to stable storage
# Follow WAL: all log records pertaining to a block must be output before the block is output
When recovering using a fuzzy checkpoint, start scan from the checkpoint record pointed to by
last_checkpoint
Log records before last_checkpoint have their updates reflected in database on disk, and need not
be redone.
Incomplete checkpoints, where system had crashed while performing checkpoint, are handled
safely
Support high-concurrency locking techniques, such as those used for B+-tree concurrency control
Operations like B+-tree insertions and deletions release locks early.
They cannot be undone by restoring old values (physical undo), since once a lock is released, other
transactions may have updated the B+-tree.
Instead, insertions (resp. deletions) are undone by executing a deletion (resp. insertion) operation
(known as logical undo).
For such operations, undo log records should contain the undo operation to be executed called
logical undo logging, in contrast to physical undo logging.
Redo information is logged physically (that is, new value for each write) even for such operations
Logical redo is very complicated since database state on disk may not be “operation consistent”
To recover from disk failure" restore database from most recent dump. " Consult the log and redo
all transactions that committed after the dump
Can be extended to allow transactions to be active during dump; known as fuzzy dump or online
dump
A dump of the database contents is also referred to as an archival dump, since we can archive the
dumps and use them later to examine old states of the database. Dumps of a database and
checkpointing of buffers are similar.
Most database systems also support an SQL dump,which writes out SQL DDL statements and SQL
insert statements to a file, which can then be reexecuted to re-create the database.
Fuzzy dump schemes have been developed that allow transactions to be active while the dump is in
progress. They are similar to fuzzy-checkpointing schemes; see the bibliographical notes for more details.
As a result of early lock release, it is possible that a value in a B+-tree node is updated by one transaction
T1, which inserts an entry (V1, R1), and subsequently by another transaction T2, which inserts an entry (V2,
R2) in the same node,moving the entry (V1, R1) even before T1 completes execution.4 At this point, we
cannot undo transaction T1 by replacing the contents of the node with the old value prior to T1 performing
its insert, since that would also undo the insert performed by T2; transaction T2 may still commit (or may
have already committed). In this example, the only way to undo the effect of insertion of (V1, R1) is to
execute a corresponding delete operation.
Logical Operations
The insertion and deletion operations are examples of a class of operations that require logical undo
operations since they release locks early; we call such operations logical operations
Operations acquire lower-level lockswhile they execute, but release them when they complete; the
corresponding transaction must however retain a higher-level lock in a two-phase manner to prevent
concurrent transactions from executing conflicting actions.
Once the lower-level lock is released, the operation cannot be undone by using the old values of updated
data items, and must instead be undone by executing a compensating operation; such an operation is
called a logical undo operation.
2. While operation is executing, normal log records with physical redo and physical undo
information are logged.
3. When operation completes, <Ti, Oj, operation-end, U> is logged, where U contains information
needed to perform a logical undo information.
To perform logical redo or undo, the database state on disk must be operation consistent, that is, it should
not have partial effects of any operation.
An operation is said to be idempotent if executing it several times in a row gives the same result as
executing it once.
If a log record <Ti, X, V1, V2> is found, perform the undo and log a special redo-only log record <Ti,
X, V1>.
If a <Ti, Oj, operation-end, U> record is found
"
– Updates performed during roll back are logged just like during normal operation execution.
– At the end of the operation rollback, instead of logging an operation-end record, generate a
record
"
Skip all preceding log records for Ti until the record <Ti, Oj
operation-begin> is found
It is important that the lower-level locks acquired during an operation are sufficient to perform a
subsequent logical undo of the operation; otherwise concurrent operations that execute during normal
processing may cause problems in the undo-phase.