Tech DBMS
Tech DBMS
Buffer Management
Reliability Control
Physical Access Structures
Query Optimization
Physical Database Design
DB Technology -- 1
DB Technology -- 2
1
Buffer Management
A buffer is a section of main memory pre-allocated
to a DBMS for disk I/O.
The buffer is organized in pages, of a size equal or
multiple of the disk input/output blocks used by the
operating system;
The size of a page can range from a few Kbytes to
about a hundred Kbytes.
Access times for main memory are generally six
orders of magnitude faster than access times for
secondary memory
Sometimes a buffer can store the entire database
(main memory database)
DB Technology -- 3
DB Technology -- 4
2
Buffer Manager Organization
DB Technology -- 5
3
fix
Searches the buffer for requested page, if found
and unused, its address is returned; else the page
is read from disk and its address is returned.
If the requested page is in the disk, a buffer page is
selected for loading; if the buffer page is free, just
load; else buffer page is sent to disk (victim) by
invoking flush, then reads reads the requested
page from disk and returns the page address.
DB Technology -- 7
DB Technology -- 8
4
Buffer Managers and File Systems
The file system is responsible for managing the data
structures stored on disk, also the current state of disk
use. It must identify which disk blocks are free and which
are allocated to files
A DBMS uses a file system for the following functions:
9 The creation (create) and removal (delete) of a file;
9 The opening (open) and closing (close) of a file;
9 read(fileid,block,buffer) for the direct access
to a block of a file which is copied to the buffer page;
9 read_seq(fileid,f-block,count,f-buffer)
for sequential access to a fixed number (count) of
blocks of a file;
9 The dual primitives write and write_seq.
DB Technology -- 9
DB Technology -- 10
5
Reliability
Control
System
Architecture
DB Technology -- 11
Stable Memory
Stable memory is a memory system that is failure-
resistant.
There is no such thing in practice; however,
replication and robust writing protocols can bring the
probability close to zero.
A failure of stable memory is assumed as
catastrophic and impossible, at least in this context.
Stable memories usually come in different forms:
9 a tape unit -- very reliable, but slow;
9 a pair of devices of different kind (e.g., a tape and
a disk) -- better alternative;
9 two mirrored disk units.
DB Technology -- 12
6
Log Organization
The actions carried out by the various transactions are
recorded in the log in chronological order (written
sequentially to the top block).
There are two types of log records:
9 Transaction records -- keep track of the activities
of the activities of a transaction by recording all
operations B(T), I(T,O,AS), D(T,O,BS),
U(T,O,BS,AS), C(T), A(T);
9 System records -- keep track of system-wide
operations dump (rare) and checkpoint (more
frequent).
DB Technology -- 13
Log Content
DB Technology -- 14
7
Undo and Redo
Undo of an action on an object O:
9 update, delete: copy BS into the object O;
9 insert: delete the object O;
Redo of an action on an object O:
9 insert, update: copy AS into the object O;
9 delete: re-insert the object O.
Idempotence of undo and redo: an arbitrary
number of undos and redos of the same action is
equivalent to the carrying out of such actions only
once:
9undo(undo(A)) = undo(A)
9redo(redo(A)) = redo(A)
DB Technology -- 15
Checkpoint
A checkpoint is carried out periodically, recording
active transactions and updating secondary memory
relative to all completed transactions
9 After having initiated a checkpoint, no commit
operations are accepted by the active transactions
9 The checkpoint ends by synchronously writing
(forcing) a checkpoint record CK(T1,T2,..Tn),
which contains the identifiers of all active
transactions;
9 In this way, the effects of all committed transactions
are permanently recorded in the database.
DB Technology -- 16
8
Dump
A dump is a complete copy of the database, which is
normally created when the system is not operative.
The copy is stored in the stable memory, typically on
tape, and is called backup.
A dump record DUMP in the log signals the presence
of a backup made at a given time and identifies the
file or device where the dump took place.
DB Technology -- 17
Transactional Rules
The reliability control system must follow two rules:
WAL rule (write-ahead log): before-state parts of the
log records must be written in the log before carrying
out the corresponding operation on the database.
Commit-Precedence rule: after-state parts of the
log records must be written in the log before carrying
out the commit.
DB Technology -- 18
9
Transaction Outcome
The atomic outcome of a transaction is established
at the time when it writes the commit record in the
log synchronously, using the force primitive
9 Before this event, a failure is followed by the undo
of the actions, thereby reconstructing the original
state of the database;
9 After this event, a failure is followed by the redo of
the actions carried out to reconstruct the final
state of the transaction;
abort records can be simply written asynchronously
into the top block of the log.
DB Technology -- 19
DB Technology -- 20
10
Protocols for Writing Log (II)
DB Technology -- 21
11
Failures in Database Systems
A system failure is caused by a software bug, (e.g.,
of the operating system), or an interruption of the
functioning of hardware (e.g., due to loss of power.)
/ Causes loss of the contents of main memory (and
thus of all buffers); but no loss of the contents of
secondary memory ( ).-
A device failure is a failure of a secondary memory
device (e.g., a disk head crash).
/ Causes loss of disk contents, but no loss of stable
storage (i.e., the log)
There are two types of restart protocols:
9 Warm restart used after system crashes;
9 Cold restart used after device crashes.
DB Technology -- 23
DB Technology -- 24
12
Restart Process
When the system restarts after a failure, it has to decide
what to do with the transactions that were executing at the
time of the failure.
Transactions are classified as:
9
Completed -- when their actions were recorded in
stable storage before the failure;
9
Committed but possibly not completed -- whose
actions must be redone, e.g., because of data lost in
the buffer;
9
Not committed -- whose actions have to be undone
because the transaction did not commit.
DB Technology -- 25
Warm Restart
Consists of four sequential steps:
Trace back the log until the most recent checkpoint
record.
Construct the UNDO set (i.e.,transactions to be
undone) and the REDO set (transactions to be
redone).
Trace back the log until the first action of the ‘oldest’
transaction in the two sets, UNDO and REDO, is found,
and undo all the actions of the transactions in the
UNDO set.
Trace forward the log and redo all the actions of the
transactions in the REDO set.
DB Technology -- 26
13
Warm Restart Protocol
The protocol guarantees:
Atomicity in the sense that all the transactions in
progress at the time of failure leave the database
either in the initial state or in the final one
Durability in the sense that all pages of transactions
in progress are written to secondary memory.
DB Technology -- 27
Cold Restart
Now parts of the database may have been damaged.
Consists of three steps:
During the first phase, the du mp operation is
executed and damaged parts are selectively copied
from the database. The most recent du mp record in
the log is then accessed.
The log is traced forward. The actions on the
database and the commit or abort actions are
applied as appropriate to the damaged parts of the
database. The situation preceding the failure is thus
restored.
Finally, a warm restart is conducted.
DB Technology -- 28
14
Physical Access Structures
Used for efficient storage and manipulation of data.
For each access structure there are access
methods, i.e. software modules providing data
access and manipulation primitives for each physical
access structure.
Each DBMS has a limited number of access methods
available.
Sometimes, access methods can be used directly
from an application without going through a DBMS.
We will consider sequential, hash-based, and index-
based access structures and methods.
DB Technology -- 29
Architecture
of the
Access
Manager
DB Technology -- 30
15
Tuples within Pages
Each access method has its own page organization.
Each page includes:
9 An initial part and a final part with control
information for the file system and the access
method;
9 A page dictionary, containing pointers to
elementary data contained in the page;
9 A useful part, which contains data; page dictionary
and useful data grow as opposing stacks within the
page;
9 A checksum, to verify that page data are valid.
Tree structures have a different page organization.
DB Technology -- 31
DB Technology -- 32
16
Page Manager Primitives
Insert or update a tuple -- may require a reorganization
of the page if there is sufficient space to manage the
extra bytes introduced).
Delete a tuple -- often carried out by marking the tuple
as ‘invalid’.
Access a tuple field -- first identify tuple by means of
its key or its offset, then locate field by its offset and
length.
Note that some page managers do not allow the
separation of a tuple in different pages.
Also, when all the tuples have the same size, the page
dictionary is simplified.
DB Technology -- 33
DB Technology -- 34
17
Entry-Sequenced Sequential
Structure
This is optimal for carrying out sequential reading
and writing operations.
Uses all the blocks available in a file and all spaces
within each block.
Primitives:
9 Access with sequential scan operation;
9 Data loading and insertion happen at the end of
the file and in sequence;
9 Deletes are normally implemented by leaving
space unused;
9 Updates can cause problems because they
increase the file size.
DB Technology -- 35
18
Ordered Sequential Structure
Each tuple has a position based on its key value.
Historically, ordered sequential structures were used
on sequential devices (tapes) by batch processes;
data were located into the main file, modifications
were collected in differential files, and files were
periodically merged (this practice no longer applies.)
Key problem: insertions or updates that increase
physical space require reordering the whole file.
Options to avoid global reorderings:
9 Leave some free slots at the time of first loading.
This is followed by ‘local reordering’ operations;
9 Use an overflow file, where new tuples are
inserted into blocks linked to form an overflow
chain.
DB Technology -- 37
Hash-Based Structures
Ensure efficient associative access to data, based on
the value of a key field, composed of an arbitrary
number of attributes of a given table.
A hash-based structure has B blocks (often adjacent.)
The access method makes use of a hash algorithm,
which, once applied to the key, returns a value
between zero and B-1.
This value is interpreted as the position of the block in
the file, and used both for reading and writing tuples
to the file
This is the most efficient technique for queries with
equality predicates, but inefficient for queries with
interval predicates. (...why?)
DB Technology -- 38
19
Details on Hash-Based Structures
Primitive interface: hash(fileId,key):BlockId
The implementation of hash consists of two parts
9 folding, transforms the key values into positive
integers, uniformly distributed over a large range;
9 hashing, transforms the positive integer into a
number between 0 and B - 1.
This technique works best if the file is made larger
than necessary. Let:
9 T the number of tuples expected for the file;
9 F the average number of tuples in each page;
then a good choice for B is T/(0.8 x F), using only
80% of the available space in the file.
DB Technology -- 39
Collisions
Occur when the same block number is returned by
the algorithm to several different keys.
Collisions are addressed by adding an overflow chain
to each page. There are extra costs in scanning the
chain.
The following table computes the length of the
overflow chain:
1 2 3 5 10 (F)
.5 0.5 0.177 0.087 0.031 0.005
.6 0.75 0.293 0.158 0.066 0.015
.7 1.167 0.494 0.286 0.136 0.042
.8 2.0 0.903 0.554 0.289 0.110
.9 4.495 2.146 1.377 0.777 0.345
T/(F xB)
DB Technology -- 40
20
Tree Structures (B Trees)
These are the most frequently used physical storage
structures for relational DBMSs.
Tree structures give associative access (based on a
value of a key, consisting of one or more attributes)
without constraints on the physical location of the
tuples.
Note that the primary key of a relation and the key of
a tree structure storing this relation need not be the
same.
DB Technology -- 41
DB Technology -- 42
21
Tree Structure Organization
DB Technology -- 43
Node Contents
Each intermediate node contains F keys (in
lexicographic order) and F + 1 pointers.
Each key Kj , 1 ≤ j ≤ F, is followed by a pointer Pj; K1 is
preceded by a pointer P0
Each pointer addresses a sub-tree:
9 P0 addresses the sub-tree with keys less than K1;
9 PF addresses the sub-tree with keys greater than or
equal to KF ;
9 Pj, 0 < j < F, addresses the sub-tree with keys
included in the interval Kj ≤ K < Kj+1 .
The value F + 1 is called the fan-out of the tree.
DB Technology -- 44
22
Search Technique
Assume that we are looking for the tuple with key V.
At each intermediate node, if V < K follow the pointer P ,
1 0
else if V KF follow the pointer PF, otherwise, follow the
pointer Pj such that Kj £ V < Kj+1
The leaf nodes of the tree can be organized in two ways:
9 In key-sequenced trees the tuples are contained in the
leaves of the tree;
9 In indirect trees leaf nodes contain pointers to the
tuples, that can be allocated by means of any other
‘primary’ mechanism (e.g.,, entry-sequenced or hash.)
Sometimes the index structure is sparse; you may then
locate a key value close to the value being sought, then
carry out a sequential search.
DB Technology -- 45
DB Technology -- 46
23
Tuple Deletion Operations
A deletion operation can always be carried out.
When the key of a to-be-deleted tuple is in an
intermediate node, it is best to recover the next
key and substitute for it.
When a deletion operation leaves two adjacent
pages underused, they are conbined into a
single page through a merge operation.
A merge causes a decrement in the number of
pointers on the next (higher) level in the tree and
may cause further merges higher up in the tree.
DB Technology -- 47
DB Technology -- 48
24
Split
and
Merge
Operations
DB Technology -- 49
B and B+ Trees
B+ trees are B trees, but their leaf nodes are linked into
a chain, which connects them in the order imposed by
the key.
B+-trees support interval queries efficiently, and are
widely used to implement relational DBMSs.
On the other hand, B trees don’t support sequential
access method of leaf nodes.
For B-trees, Intermediate nodes use two pointers for
each key value Ki:
9 One points directly to the block that contains the
tuple corresponding to Ki;
9 The other points to a subtree with keys greater
than Ki and less than Ki+1.
DB Technology -- 50
25
Example of B+ Tree
DB Technology -- 51
Example
of B Tree
DB Technology -- 52
26
Query Optimization
A query optimizer is an important component of a
DBMS architecture. An optimizer takes as input an
SQL query and produces an access plan in ‘object’
or ‘internal’ format.
Optimization steps:
9 Lexical, syntactic and semantic analysis, using the
data dictionary;
9 Translation into an internal, algebraic form;
9 Algebraic optimization through transformations;
9 Cost-based optimization;
9 Code generation using the physical data access
methods provided by the DBMS.
DB Technology -- 53
DB Technology -- 54
27
Relation Profiles
A profile contains quantitative information about a table
and is stored in the data dictionary:
9 the cardinality (number of tuples) of each table T;
9 the size in bytes of each tuple of T;
9 the size in bytes of each attribute A in T;
9 the number of distinct values of each attribute A in T;
j
DB Technology -- 55
DB Technology -- 56
28
Scan Operation
Performs a sequential access of all the tuples in a
table, and performs on each tuple various operations
of an algebraic or other nature:
9 Projection of a set of attributes;
9 Selection on a predicate (e.g., Ai = v);
9 Sort (ordering);
9 Insertions, deletions, and updates of the tuples that
are being accessed during the scan.
Primitives:
open, next, read, modify,insert, delete, close
DB Technology -- 57
Sort operation
Various methods for ordering the data contained in
the main memory, typically represented by means of
a record array.
DBMSs typically cannot load all data in the buffer;
thus, they separately order and then merge data
sets, using available buffer space.
DB Technology -- 58
29
Indexed Access
Indexes are created by the database administrator to
favor queries when they include:
9 simple predicates (of the type Ai = v)
9 interval predicates (of the type v1 ≤ Ai ≤ v2)
These predicates are supported by the index.
For a conjunction of predicates the DBMS chooses
the most selective supported predicate for primary
access, and evaluates the other predicates in main
memory.
For a disjunction of predicates, a scan is needed if
any of them are not supported; if they are all
supported, indexes can be used only with duplicate
elimination.
DB Technology -- 59
Join Methods
Joins are the most costly operation for a DBMS.
There are various methods for join evaluation,
among them we note nested-loop, merge-scan and
hashed join operations.
DB Technology -- 60
30
Nested-Loop Join
Join attribute
DB Technology -- 61
Merge-Scan Join
Both table must be ordered according to the join
attribute
DB Technology -- 62
31
Hashed Join
A hashing function h on the join attributes is used to
store both tables
DB Technology -- 63
Cost-Based Optimization
This is an optimization problem, whose decisions are:
9 The data access operations to execute, e.g., scan
vs index access;
9 The order of operations, e.g., the order of join
operations;
9 Options to be assigned to each operation, e.g.,,
choosing a join method;
9 Use of parallelism and pipelining to improve
performance.
Further options are available in selecting a plan for
distributed database query processing.
DB Technology -- 64
32
Approaches to Query Optimization
The approach often used includes the following:
9 Make use of approximate cost formulas;
9 Construct a decision tree, where each node
corresponds to a choice; each leaf node
corresponds to a specific execution plan ;
9 Assign to each plan a cost:
Ctotal = CI/Ox nI/O + Ccpu x ncpu
9 Choose the one with the lowest cost, based on
operations research (branch and bound.)
The optimizers should obtain good solutions, where
cost is near that of an optimal solution.
DB Technology -- 65
Example
of
Decision
Tree
DB Technology -- 66
33
Physical Database Design
The final phase of the database design process.
This phase takes as input the logical schema of a
database along with predictions for the application
load.
The phase produces as output a physical schema,
made up of the definitions of the relations and of the
physical access structures used, along with related
parameters.
Depends on the features that are supported by the
underlying DBMS.
DB Technology -- 67
Physical Design
for Relational DBMSs
Most relational DBMSs support index and tuple
clustering.
Physical design can be reduced to the activity of
identifying indices for each relation.
The key of a relation is usually involved in selection or
join operations; for this reason, each relation normally
supports a unique index on the primary key.
Other indices are added to support frequent queries.
If the performance is unsatisfactory, we can tune the
system by adding or dropping indices.
It is useful to check how indices are used by queries, by
using the show plan command.
DB Technology -- 68
34
Indices in SQL
Commands for creating or dropping an index are not
part of standard SQL, but their syntax is rather
similar in all DBMSs.
Syntax of the commands for the creation and
dropping of an index:
9 create [unique] index IndexName on
TableName(AttributeList)
9 drop index IndexName
DB Technology -- 69
35