Data Files and Database Options: Unit One
Data Files and Database Options: Unit One
Data Files and Database Options: Unit One
Data within a database is stored on disk in one or more data files. Prior to
being written to the data file(s), every transaction is written to a
transaction log file. In this lesson, you learn howto design the data files
underneath a database, group the files into filegroups to link
physicalstorage into a database, and manage the transaction log. You also
learn how to configure the tempdb database for optimal performance.
Although storing all your data in memory would provide extremely fast
access, you wouldlose everything after the machine was shut down. To
protect your data, it has to be persistedto disk. Underneath each
database is one or more files for persisting your data.
All objects that contain data, tables, indexes, and indexed views have an
ON clause thatyou can use to specify when you create an object that allows
you to specify the filegroupwhere SQL Server stores the object. As data
is written to the objects, SQL Server uses the filegroup definition to
determine on which file(s) it should store the data.
At the time that a file is added to a database, you specify the initial size of
the file. You canalso specify a maximum size for the file, as well as whether
SQL Server automatically increases the size of the file when it is full of 1
data. If you specify automatic growth, you can specify whether the file size
increases based on a percentage of the current size or whether the
Unless a file group has only a single file, you do not know in which file a specific row of data is
stored. When writing to files, SQL Server uses a proportional fill algorithm. The proportional
fill algorithm is designed to ensure that all files within a filegroup reach the maximum
definedcapacity at the same time. For example, if you had a data file that was 10 gigabytes (GB) and a
data file that was 1 GB, SQL Server writes ten rows to the 10 GB file for every one row that is written
to the 1 GB file.
The proportional fill algorithm is designed to allow a resize operation to occur at a filegroup
level. In other words, all files within a filegroup expand at the same time.
All data manipulation within SQL Server occurs in memory within a set of buffers. If
you areadding new data to a database, the new data is first written to a memory buffer, then written
to the transaction log, and finally persisted to a data file via a background process calledcheck
pointing. When you modify or delete an existing row, if the row does not already existin memory,
SQL Server first reads the data off disk before making the modification. Similarly ifyou are reading
data that has not yet been loaded into a memory buffer, SQL Server must read it out of the data files
on disk.
If you could always ensure that the machine hosting your databases had enough
memory tohold all the data within your databases, SQL Server could simply read all the data
off disk intomemory buffers upon startup to improve performance. However, databases
are almost alwaysmuch larger than the memory capacity on any machine, so SQL Server retrieves
data from disk only on needed basis. If SQL Server does not have enough room in memory for the
data beingread in, the least recently used buffer pools are erased to make room for newly requested
data.
Because accessing a disk drive is much slower than accessing memory, the data file design
underneath a database can have an impact on performance.
The first layer of design is within the disk subsystem. As the number of disk drives within
a volume increases, the read and write throughput for SQL Server increases. However, thereis
an upper limit on the disk input/output (I/O),which is based upon:-
the capacity of the redundant array of independent disks (RAID) controller,
host bus adapter (HBA),
and disk bus.
So you cannot fix a disk I/O bottleneck by continually adding more disk drives. The
process of designing the disk subsystem is reduced to ensuring that you have enoughdisks
along with appropriately sized controllers and disk caches to deliver the read/writethroughput
required by your database.
If it were simply a matter of the number of disks, there would be far fewer disk
I/Obottlenecks in systems. But there is a second layer of data file design:
determining how many data files you need and
the location of each data file.
SQL Server creates a thread for each file underneath a database. As you increase the number
of files underneath a database, SQL Server creates more threads that can be usedto read and write
data.
2
Designing the data layer of a database begins with the database creation. When youcreate a database,
it should have three files and two filegroups. You should have a file with an .mdf extension within a
filegroup named PRIMARY, a file with an .ndf extension in a filegroupwith any name that you choose,
and the transaction log with an .ldf extension.
Besides being the logical definition for one or more fi les that defines the storage boundary for an
object, filegroups have a property called DEFAULT. The purpose of the DEFAULT property is to
define the filegroup where SQL Server places objects if you do not specify the ON
clauseduring object creation.
When the database is created, the primary filegroup is marked as the default filegroup.After
you create the database, you should mark the second filegroup as the default filegroup. By changing
the default filegroup, you ensure that any objects you create arenot accidentally placed on the
primary filegroup and that only the system objects for thedatabase reside on the primary
filegroup. You change the default filegroup by using thefollowing command:
The main reason not to place any of your objects on the primary filegroup is to provideas
much isolation in the I/O as possible.Following the initial creation of the database, you add
filegroups as needed to separatethe storage of objects within the database. You also add files
to filegroups to increase the diskI/O available to the objects stored on the filegroup,
thereby reducing disk bottlenecks.
Transaction Logs
When SQL Server acknowledges that a transaction has been committed, SQL Server mustensure that
the change is hardened to persistent storage. Although all writes occur throughmemory
buffers, persistence is guaranteed by requiring that all changes are written to thetransaction log
prior to a commit being issued. In addition, the writes to the transaction logmust occur directly to
disk.
Because every change made to a database must be written directly to disk, the disk
storagearchitecture underneath your transaction log is the most important decision affecting
themaximum transaction throughput that you can achieve.
SQL Server writes sequentially to the transaction log but does not read from the log exceptduring a
restart recovery. Because SQL Server randomly reads and writes to the data filesunderneath a
database, by isolating the transaction log to a dedicated set of disks youensurethat the disk heads do
not have to move all over the disk and move in a mostlylinear manner.
The maximum transaction throughput for any database is bound by the amount of dataper second
that SQL Server can write to the transaction log.
3
FILESTREAM data
Although the volume of data within organizations has been exploding, leading the way
inthisdata explosion is unstructured data. To tackle the problem of storing, managing,
andcombiningthe large volumes of unstructured databases with the structured data in
yourdatabases, SQLServer 2008 introduced FILESTREAM.
The FILESTREAM feature allows you to associate files with a database. The files are storedin a folder
on the operating system, but are linked directly into a database where the filescanbe backed
up, restored, full-text-indexed, and combined with other structured data.
Prior to SQL Server 2008, you had to extract the contents of a file to store it in aVARBINARY(MAX),
VARCHAR(MAX), or NVARCHAR(MAX) data type. However, you werelimited to storing only 2 GB of
data within a large data type. To work around thisrestriction,many organizations stored the filename
within SQL Server and maintained the file on theoperating system. The main issue with storing the
file outside the database isthat it was veryeasy to move, delete, or rename a file without
making a correspondingupdate to the database.
SQL Server 2008 introduces a new property for a column called FILESTREAM.FILESTREAMcombines
the best of both worlds. Binary large objects (BLOBs) stored in aFILESTREAM columnare controlled
and maintained by SQL Server; however, the data resides in a file on the operatingsystem. By storing
the data on the file system outside ofthe database, you are no longerrestricted to the 2-GB limit on
BLOBs. In addition, whenyou back up the database, all the filesare backed up at the same
time, ensuring that the state of each file remains synchronized withthe database.
You apply the FILESTREAM property to columns with a VARBINARY(MAX) data type.
Thecolumn within the table maintains a 16-byte identifier for the file. SQL Server manages theaccess
to the files stored on the operating system.
FILEGROUP designated for FILESTREAM storage is off-line and inaccessible withina Database
Snapshot. In addition, you cannot implement Database Mirroring against adatabase containing data
stored with FILESTREAM.
to store FILESTREAM data within a database, you needto specify where the data will be stored. You
define the location for FILESTREAM data in a database by designating a filegroup within the database
to be used for storage withthe CONTAINS FILESTREAM property. The FILENAME property defined
for a FILESTREAM file group specifies the pathto a folder. The initial part of the folder path
definition must exist;however, the last folder in the path defined cannot exist and is created
automatically. Afterthe FILESTREAM folder has been created, a filestream.hdr file is created in the
folder, which is a system file used tomanage the files subsequently written to the folder.
4
tempdbDatabase
Because the tempdb database is much more heavily used than in previous versions, specialcare
needs to be taken in how you design the storage underneath tempdb.In addition to temporary
objects, SQL Server uses tempdb for worktables used ingrouping/sortingoperations, worktables
to support cursors, the version store supportingsnapshot isolation level, and overflow for table
variables. You can also cause index buildoperations to use space in tempdb.
Due to the potential for heavy write activity, you should move tempdb to a set of disksseparated
from your databases and any backup files. To spread out the disk I/O, you might
consider adding additional files to tempdb.
Database Options
A database has numerous options that control a variety of behaviors. These options arebroken down
into several categories, including the following:
Recovery
Auto options
Change tracking
Access
Parameterization
Recovery Options
The recovery options determine the behavior of the transaction log and how
damagedpagesare handled.
Recovery Models
Every database within a SQL Server instance has a property setting called the
recoverymodel.The recovery model determines the types of backups you can perform against
adatabase.
The recovery models available in SQL Server 2008 are:
Full
Bulk-logged
Simple
When a database is in the Full recovery model, all changes made, using both
datamanipulation language (DML) and data definition language (DDL), are logged to
thetransaction log. Becauseall changes are recorded in the transaction log, it is possible torecover a
database in the Fullrecovery model to a given point in time so that data loss canbe minimized or
eliminated if youshould need to
5
recover from a disaster. Changes areretained in the transaction log indefinitelyand are removed only
by executing a transactionlog backup.
Certain operations are designed to manipulate large amounts of data. However, theoverhead
of logging to the transaction log can have a detrimental impact on performance.The Bulk-
logged recovery model allows certain operations to be executed with minimallogging. When a
minimally logged operation is performed, SQL Server does not logeveryrow changed but instead logs
only the extents, thereby reducing the overhead andimproving performance. The operations
that are performed in a minimally logged mannerwith the database set in the Bulk-logged recovery
model are:
BCP
BULK INSERT
SELECT. . .INTO
CREATE INDEX
ALTER INDEX. . .REBUILD
Because the Bulk-logged recovery model does not log every change to the transactionlog,you cannot
recover a database to a point in time, within the interval that a minimallylogged transaction
executed, when the Bulk-logged recovery model was enabled.
The third recovery model is Simple. A database in the Simple recovery model logsoperations tothe
transaction log exactly as the Full recovery model does. However, eachtime the databasecheckpoint
process executes, the committed portion of the transactionlog is discarded. Adatabase in the Simple
recovery model cannot be recovered to a point intime because it is notpossible to issue a transaction
log backup for a database in the simplerecovery model.
Because the recovery model is a property of a database, you set therecovery model byusing
the ALTER DATABASE command as follows:
The backup types available for each recovery model are shown below Backup Types Available for
Each Recovery Model
6
Auto Options
There are five options for a database that enable certain actions to occur automatically:
AUTO_CLOSE
AUTO_SHRINK
AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_UPDATE_STATISTICS_ASYNCH
Each database within an instance requires a variety of resources, the most significant ofwhich is a set
of memory buffers. Each open database requires several bytes of memoryandany queries against the
database populate the data and query caches. If theAUTO_CLOSEoption is enabled, when the
last connection to a database is closed, SQLServer shuts downthe database and releases all
resources related to the database. When anew connection ismade to the database, SQL Server
starts up the database and beginsallocating resources.
By default, AUTO_CLOSE is disabled. Unless you have severe memory pressure, you shouldnot enable
a database for AUTO_CLOSE. In addition, a database that is frequently accessedshould not be set to
AUTO_CLOSE because it would cause a severe degradation inperformance.This is because you would
never be able to use the data and query cachesadequately.
Data files can be set to grow automatically when additional space is needed. Although
mostoperations to increase space affect the database on a long-term basis, some space increasesare
needed only on a temporary basis. If the AUTO_SHRINK option is enabled, SQL
Serverperiodically checks the space utilization of data and transaction log files. If the spacechecking
algorithm finds a data file that has more than 25 percent free space, the fileautomaticallyshrinks to
reclaim disk space.
Expanding a database file is a very expensive operation. Shrinking a database file is alsoan expensive
operation. Because the growth of database files is so expensive, it isrecommended to leave
the AUTO_SHRINKoption disabled and manually shrink files onlywhen necessary.
Statistics allow the Query Optimizer to build more efficient query plans. If the
AUTO_CREATE_STATSTICS option is enabled, SQL Server automatically creates statistics
aremissing during the optimization that of query processing. Although the
phase
ofstatisticsincurs some overhead, the benefit tocreation
query performance is worth the overheadcost
for SQLServer to create statistics automatically when necessary.
Databases have two options that allow SQL Server to update out-of-date statisticsautomatically. The
AUTO_UPDATE_STATISTICS option updates out-of-date statisticsduringquery optimization. If you
choose to enable AUTO_UPDATE_STATISTICS, a secondoption,AUTO_UPDATE_STATISTICS_ASYNC,
controls whether statistics are updated duringqueryoptimization or if query optimization continues
while the statistics are updatedasynchronously.
7
Access
Access to a database can be controlled through several options.
The status of a database can be explicitly set to ONLINE, OFFLINE, or EMERGENCY. Whena database
is in an ONLINE state, you can perform all operations that would otherwise bepossible. A database
that is in an OFFLINE state is inaccessible. A database in anEMERGENCYstate can be accessed only by
a member of the db_owner role, and the onlycommandallowed to be executed is SELECT.
You can control the ability to modify data for an online database by setting the databaseto either
READ_ONLY or READ_WRITE. A database in READ_ONLY mode cannot be written to.In addition,
when a database is placed in READ_ONLY mode, SQL Server removes anytransaction log file that is
specified for the database. Changing a database fromREAD_ONLY to READ_WRITEcauses SQL
Server to re- create the transaction log file.
If multiple users are using the database when you change the mode to SINGLE_USER or users that
conflict with the allowed set for RESTRICTED_USER, the ALTER DATABASEcommandis blocked until
all the non-allowed users disconnect.
Instead of waiting for usersto completeoperations and disconnect from the database, you can specify
a ROLLBACKaction to terminateconnections forcibly. The ROLLBACK IMMEDIATE option
forcibly rollsback any open transactions,along with disconnecting any nonallowed users.
ALTER DATABASE <database name> SET RESTRICTED_USER ROLLBACK IMMEDIATE
You can allowusers to complete transactionsand exit the database by using the ROLLBACK AFTER
<number of seconds> option, which waits for the specified number of seconds beforerolling back
transactions and disconnecting users.The normal operational mode for mostdatabases is
ONLINE, READ_WRITE, and MULTI_USER.
Parameterization
One of the “hot button” topics in application development is whether to parameterize callsto
the database. When a database call is parameterized, the values are passed as variables.applications
gaina significant benefit when database calls are parameterized.
SQL Server caches the query plan for every query that is executed. Unless there is pressureon the
query cache that forces a query plan from the cache, every query executed sincetheinstance started is
in the query cache. When a query is executed, SQL Server parses andcompiles the query. The query is
then compared to the query cache using a string-matchingalgorithm. If a match is found, SQL Server
retrieves the plan that has already beengeneratedand executes the query.
8
A query that is parameterized has a much higher probability of being matched becausethe
query string does not change even when the values being used vary. Therefore,parameterized
queries can reuse cached query plans more frequently and avoid the timerequired to build a query
plan.
Because not all applications parameterize calls to the database, you can force SQL Serverto
parameterize every query for a given database by setting the PARAMETERIZATIONFORCED database
option.You could see aperformance boost by forcing parameterization.
Collation Sequences
SQL Server has the capability to store character data that spans every possible written
language.However, not every language follows the same rules for sorting or datacomparisons. SQL
Server allows you to define the rules for comparison, sorting, casesensitivity, and accent sensitivity
through the specification of a collation sequence.
Whenyou install SQL Server, you specify a default collation sequence that is used for alldatabases,
tables, and columns. You can override the default collation sequence at eachlevel.The collation
sequence for an instance can be overridden at a database level byspecifyingthe COLLATE clause in
either the CREATE DATABASE or ALTER DATABASEcommand.
In a perfect world, everything that you save to disk storage would always
write correctly,readcorrectly, and never have any problems. Unfortunately, your SQL Server
databaseslive in animperfect world where things do go wrong. Although this occurs very
rarely, datawithin yourdatabase can become corrupted if there is a failure in the disk storage
systemas SQL Server iswriting to a page. Data pages are 8 kilobytes (KB) in size, but SQL
Serverdivides a page into16 blocks of 512 bytes apiece when performing write operations.
If SQLServer begins writingblocks on a page and the disk system fails in the middle of the
writeprocess, only a portionof the page is written successfully, producing a problem called atorn
page.
Databases have an option called PAGE_VERIFY. The page verification can be set to
either TORN_PAGE_DETECTION or CHECKSUM. The PAGE_VERIFYTORN_PAGE_DETECTION option
exists
for backwards compatibility and should not be used.
When the PAGE_VERIFYCHECKSUM option is enabled, SQL Server calculates a checksum forthe page
prior to thewrite. Each time a page is read off disk, a checksum is recalculated andcompared
to thechecksum written to the page. If the checksums do not match, the page hasbeen
corrupted.When SQL Server encounters a corrupt page, an error is thrown, the command
attemptingto access the corrupt page is aborted, and an entry is written into the suspect_pages table
in the msdb database.
9
You should enable the PAGE_VERIFY CHECKSUM option on every production database.
Although page verification can detect and log corrupted pages, the page must be readoff disk
to trigger the verification check. Data is normally read off disk when users andapplications access
data, but instead of having a user receive an error message, it is muchbetter for you to proactively
find corruption and fix the problem by using a backup beforethe user has a process aborted.
You can force SQL Server to read every page from disk and check the integrity by executingthe DBCC
CHECKDB command. The Syntax is:-
When DBCC CHECKDB is executed, SQL Server performs all the following actions:
Checks page allocation within the database
Checks the structural integrity of all tables and indexed views
Calculates a checksum for every data and index page to compare against the storedchecksum
Validates the contents of every indexed view
Checks the database catalog
Validates Service Broker data within the database
To accomplish these checks, DBCC CHECKDB executes the following commands:
DBCC CHECKALLOC, to check the page allocation of the database
DBCC CHECKCATALOG, to check the database catalog
DBCC CHECKTABLE, for each table and view in the database to check the structural integrity Any
errors encountered are output so that you can fix the problems. If an integrity error isfound in an
index, you should drop and re-create the index. If an integrity error is found inatable, you need to use
your most recent backups to repair the damaged pages.
If the database is participating in Database Mirroring, SQL Server attempts to retrieve acopy of the
page from the mirror. If the page can be retrieved from the mirror and hasthe correct page contents,
the page is replaced automatically on the principal withoutrequiringany intervention. When
SQL Server replaces a corrupt page from the mirror, an entry is written into
the sys.dm_db_mirroring_auto_page_repair view.
10
UNIT-TWO
Transactions Processing
What is a transaction?
A Transaction is a mechanism for applying the desired modifications/operations to a
database. It is evident in real life that the final database instance after a successful
manipulation of the content of the database is the most up-to-date copy of the database.
Action, or series of actions, carried out by a single user or application program, which
accesses or changes contents of database. (i.e. Logical unit of work on the database.)
A transaction could be a whole program, part/module of a program or a single command.
Changes made in real time to a database are called transactions. Examples include ATM
transactions, credit card approvals, flight reservations, hotel check-in, phone calls,
supermarket canning, academic registration and billing.
A transaction could be composed of one or more database and non-database operations.
Transforms database from one consistent state to another, although consistency may be
violated during transaction.
A database transaction is a unit of interaction with database management system or similar
system that is treated in a coherent and reliable way independent of other transactions.
Transaction processing system
A system that manages transactions and controls their access to a DBMS is called a TP
monitor. A transaction processing system (TPS) generally consists of a TP monitor, one or
more DBMSs, and a set of application programs containing transaction.
In database field, a transaction is a group of logical operations that must all succeed or fail as a
group. Systems dedicated to supporting such operations are known as transaction
processing systems.
In comparison with database transaction, application program is series of transactions with non-
database processing in between.
What we are interested about is the online transaction, which is the interaction between the users
of a database system and the shared data stored in the database. This transaction program
contains the steps involved in the business transaction.
11
Transactions can be started, attempted, then committed or aborted via data manipulation
commands of SQL.
A single transaction might require several queries, each reading and/or writing information in the
database. When this happens it is usually important to be sure that the database is not left with
only some of the queries carried out. For example, when doing a money transfer, if the money was
debited from one account, it is important that it also be credited to the depositing account. Also,
transactions should not interfere with each other.
D: Durability
It is referred as ACID property of transaction. Without the ACID property, the integrity of
the database cannot be guaranteed.
Atomicity
Is All or None property
Every transaction should be considered as an atomic process which cannot be sub divided into small tasks.
Due to this property, just like an atom which exists or does not exist, a transaction has only two states.
Done or Never Started.
12
Done - a transaction must complete successfully and its effect should be visible in
the database.
Never Started - If a transaction fails during execution then all its modifications must
be undone to bring back the database to the last consistent state, i.e., remove the
effect of failed transaction.
Consistency
If the transaction code is correct then a transaction, at the end of its execution, must
leave the database consistent. A transaction should transform a database from
one previous consistent state to another consistent state.
Isolation
A transaction must execute without interference from other concurrent transactions and its
intermediate or partial modifications to data must not be visible to other
transactions.
Durability
The effect of a completed transaction must persist in the database, i.e., its updates
must be available to other transaction immediately after the end of its execution,
and it should not be affected due to failures after the completion of the transaction.
State of a Transaction
A transaction is an atomic operation from the users’ perspective. But it has a collection of
operations and it can have a number of states during its execution.
13
Start Ok to Commit
Commit Commit
Database
Modified
No Error
Most SQL statements seem to be very short and easy to execute. But the reverse is true if you consider it as
a one command transaction. Actually a database system interprets a transaction not as an
application program but as a logical sequence of low- level operations read and write (referred to as
primitives).
(a) Serially: Serial Execution: In a serial execution transactions are executed strictly serially. Thus,
Transaction Ti completes and writes its results to the database then only the next transaction Tj is
scheduled for execution. This means at one time there is only one transaction that is being executed in the
system. The data is not shared between transactions at one specific time.
14
In Serial transaction execution, one transaction being executed does not interferethe execution of any other
transaction.
Correct execution, i.e., if the input is correct then output will be correct.
Fast execution, since all the resources are available to the active. The worst thing about serial execution is
very inefficient resource utilization.
Example of a serial execution, Suppose data items X = 10, Y = 6, and N =1 and T1 and T2 are transactions.
T1 T2
X := X+N X := X+N
read (Y)
Y := Y+N
write (Y)
Time T1 T2
X := X+N {X = 11}
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
X := X+N {X = 12}
write (X)
15
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.
Thus we can witness that in serial execution of transaction, if we have two transactions Ti and Ti+1, then
Ti+1 will only be executed after the completion of Ti.
(b) Concurrently: is the reverse of serially executable transactions, in this scheme the individual
operations of transactions, i.e., reads and writes are interleaved in some order.
Time T1 T2
X := X+N {X = 11}
X := X+N {X = 11}
write (X)
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
Final values at the end of T1 and T2 : X = 11, and Y = 7. This improves resource
utilization, unfortunately gives incorrect result.
In serial execution T2 read the value of X written by T1 (i.e., 11) but in concurrent
execution T2 read the same value of X (i.e., 10) as T1 did and the update made by T1 was
overwritten by T2’s update.
This is the reason the final value of X is one less than what is produced by serial execution.
16
Problems Associated with Concurrent Transaction Processing
Although two transactions may be correct in themselves, interleaving of operations
may produce an incorrect result which needs control over access.
Every transaction should be correct by themselves, but this would not guarantee that
the interleaving of these transactions will produce a correct result.
17
After the successful completion of the operation in this schedule, the final value of A
will be 200 which override the update made by the first transaction that changed the
value from 100 to 90.
Occurs when one transaction can see intermediate results of another transaction before
it is committed.
E.g.
T2 increases 100 making it 200 but then aborts the transaction before it
is committed. T1 gets 200, subtracts 10 and make it 190. But the actual balance
should be 90
Occurs when transaction reads several values but second transaction updates
some of them during execution and before the completion of the first.
E.g.
T2 would like to add the values of A=10, B=20 and C=30. after the values are
read by T2 and before its completion, T1 updates the value of B to be 50. at the
end of the execution of the two transactions T2 will come up with the sum of
60 while it should be 90 since B is updated to 50.
18
UNIT – THREE
•Locking methods
•Time stamping
•Optimistic
Locking and Time stamping are pessimistic approaches since they delay
transactions.
end. Optimistic methods assume conflict is rare and only check for
conflicts at commit.
Locking Method
Exclusive lock:A write operation changes the value of the data item. Hence two
write operations from two different transactions or a write from T1 and a read from
T2 are not allowed. A data item can be modified only under Exclusive lock. To modify a
data item T1 will do: Exclusive lock (X), then Write (X) and finally Unlock (X).
When these locks are applied, then a transaction must behave in a special way.
This special behavior of a transaction is referred to as well-formed.
Well-formed:A transaction is well- formed if it does not lock a locked data item and it
does not try to unlock an unlocked data item.
If transaction has shared lock on item, can read but not update item.
If transaction has exclusive lock on item, can both read and update item.
Reads cannot conflict, so more than one transaction can hold shared locks
simultaneously on same item.
Exclusive lock gives transaction exclusive access to that item.
Some systems allow transaction to upgrade a shared lock to an exclusive lock, or
vice-versa.
Examples: T1 and T2 are two transactions. They are executed under locking as follows. T1
locks Ain exclusive mode. When T2 wants to lock A, it finds it locked by T1 so T2 waits for
Unlock on A by T1. When A is released then T2 locks A and begins execution.
Suppose a lock on a data item is applied, the data item is processed and it is unlocked
immediately after reading/writing is completed as follows.
20
Serial Execution of T1 and then T2 Concurrent Execution of T1 and T2
T1 T2 T1 T2
A := A + 100 A := A + 100
B := B + 10 B := B * 5
B := B * 5 B := B + 10
A := A + 20 A := A + 20
A transaction follows 2PL protocol if all locking operations precede the first
unlock operation in the transaction.
The 2PL protocol demands locking and unlocking of a transaction to have two
phases.
Growing phase - acquires all locks but cannot release any locks.
Shrinking phase - releases locks but cannot acquire any new locks.
Deadlock: A deadlock that may result when two (or more) transactions are
each waiting for locks held by the other to be released.
Only one way to break deadlock: abort one or more of the transactions in the
deadlock.
Timeout
The deadlock detection could be done using the technique of TIMEOUT. Every
transaction will be given a time to wait in case of deadlock. If a transaction waits
for the predefined period of time in idle mode, the DBMS will assume that
deadlock occurred and it will abort and restart the transaction.
22
Time-stamping Method
Again each data item will have a timestamp for Read and Write.
Read/write proceeds only if last update on that data item was carried out by an
older transaction. Otherwise, transaction requesting read/write is restarted and
given a new timestamp.
23
The timestamp ordering protocol ensures that any conflicting read and
write operations are executed in the timestamp order.
If TS(Ti) < WTS(A): this implies that Ti needs to read a value of A which
was already overwritten. Hence the read operation must be rejected and Ti
is rolled back.
If TS(Ti) >= WTS(A): then the read is executed and RTS(A) is set to the
maximum of RTS(A) and TS(Ti).
Suppose that Transaction Ti issues Write(A)
If TS(Ti) < RTS(A): then this implies that the value of A that Ti is
producing was previously needed and it was assumed that it would never
be produced. Hence, the Write operation must be rejcted and Ti is rolled
back.
If TS(Ti) < WTS(A): then this implies that Ti is attempting to Write an
object value of A. hence, this write operation can be ignored.
Otherwise the Write operation is executed and WTS(A) is set to the
maximum of WTS(A) or TS(Ti).
A transaction that is rolled back due to conflict will be restarted and be
given a new timestamp.
24
Cascading Rollback
Whenever some transaction T tries to issue a Read_Item(X) or a
Write_Item(X) operation, the basic timestamp ordering algorithm compares the
timestamp of T with the read timestamp and the write timestamp of X to
ensure that the timestamp order of execution of the transactions is not violated.
If the timestamp order is violated by the operation, then transaction T will violate
the equivalent serial schedule, so T is aborted. Then T is resubmitted to the
system as a new transaction with new timestamp. If T is aborted and rolled
back, any transaction Ti that may have used a value written by T must also be
rolled back. Similarly, any transaction Tj that may have used a value written by Ti
must also be rolled back, and so on. This effect is known as cascading rollback.
Optimistic Technique
Granularityis the size of the data items chosen as the unit of protection by a
concurrency control protocol.
It could be:
26
Transaction Subsystem
Transaction manager
Scheduler
Recovery manager
Buffer manager
Transaction Manager in DBMS coordinates transactions on behalf of
application programs. Communicates with scheduler (lock manager)
which is responsible for implementing a strategy for concurrency control.
The Scheduler in the DBMS ensures that the individual steps of different
transactions preserve consistency.
Recovery Manager: Ensures that database is restored to the original state
incase failure occurs.
Buffer Manager: responsible for transfer of data between disk storage and main
memory.
27
UNIT FOUR
TCP Endpoints
Endpoints control the capability to connect to an instance of SQL
Server as well as to dictate the communications methods that are
acceptable. Acting very similar to firewalls onthenetwork, endpoints
are a layer of security at the border between applications and
yourSQLServer instance.
By combining an endpoint transport and payload, SQL Server can filter acceptable
trafficbefore a command even reaches the SQL Server instance. For example, suppose you
have an endpoint defined as TCP with a payload of TSQL. If any application attempted to
sendHTTP,SERVICE_BROKER, or DATABASE_MIRRORING traffic through the endpoint, the
connection would be denied without needing to authenticate the request.
Endpoint Access
Even if traffic going to the endpoint matches the correct transport and
payload,
aconnectionis still not allowed unless access has been granted on the
endpoint. Endpointaccess hastwo layers. The first layer of access security is
determined by the endpoint state.An endpoint can haveone of three states:
STARTED, STOPPED, and DISABLED. The threestates of an endpoint reactas follows:
STARTED The endpoint is actively listening for connections and will reply to an
application.
STOPPED The endpoint is actively listening but returns a connection error to
anapplication.
DISABLED The endpoint does not listen and does not respond to any connection thatis
attempted.
You might be wondering about all the effort involved just to create a connection to
aninstance of SQL Server before the user is even authenticated. In prior versions of
SQLServer,any application could connect to a server running SQL Server and transmit any
typeofrequest. No attempt was made to ensure that applications had to transmit validly
formedrequests, so hacking into a server running SQL Server was much easier to
accomplish.
SQL Server 2008 ensures that only valid requests can be submitted by a valid user before
arequest is scheduled within the engine. Administrators also have a master switch
to shutoffaccess immediately if they feel someone is attempting to compromise
their serverrunningSQL Server by setting the state of the endpoint being used to
DISABLED.
You can create Transmission Control Protocol (TCP) endpoints with three different
payloads: TSQL, DATABASE_MIRRORING, and SERVICE_BROKER.
TCP Protocol Arguments
You can configure TCP endpoints to listen on specific Internet Protocol (IP)
addresses
andport numbers. The two arguments you can specify that are universal for all TCP
endpointsarethe following:
29
LISTENER_PORT
LISTENER_IP
LISTENER_PORT is required. The TCP for TSQL endpoint that is created for
each instanceduring installation is already configured for TCP port 1433 or the alternate
port number for the instance.
SQL Server can encrypt all communications between endpoints, and you can specify
whichencryption algorithm to use for the communications. The default algorithm is RC4,
but you can specify the much stronger Advanced Encryption Standard (AES) algorithm.
The following T-SQL example shows how to create a Database Mirroring endpoint:
30
CREATE ENDPOINT [Mirroring] AS TCP (LISTENER_PORT = 5022)
FOR DATA_MIRRORING (ROLE = PARTNER, ENCRYPTION = REQUIRED); ALTER
ENDPOINT [Mirroring] STATE = STARTED;
you can run the following to gather information about end points configured
in your system
SELECT * FROM sys.endpoints
SELECT * FROM sys.tcp_endpoints SELECT * FROM sys.http_endpoints
SELECT * FROM sys.database_mirroring_endpoints SELECT * FROM
sys.service_broker_endpoints
Given enough time, anyone can eventually beat any security implementation. The
purpose of security is to provide enough barriers such that the effort
required to break into a system exceeds the benefit received. In this lesson,
you learn how to configure your instances to expose the minimum number of
attack points possible by minimizing thefeature set that isenabled within SQL
Server.
One of the most frequent ways that attackers gain access to a system is through
features that have been enabled but are rarely used. SQL Server now disables
every feature not required for the operation of the database engine.
31
At the time of installation, you can decide to force users to authenticate to an instance
using only Windows credentials. If the authentication mode for your instance is set to
Windows only, you have disabled users’ ability to use SQL Server native logins.
The biggest potential risk to an instance is through the use of features that expose
an external interface or ad hoc execution capability. The two features with the greatest
risk are OPENROWSET/OPENDATASOURCE and OLE Automation procedures.
You enable and disable SQL Server features by using sp_configure. The features that
youshould have disabled unless you need the specific functionality are the following:
Ad Hoc Distributed Queries
CLR Enabled
Cross Database Ownership Chaining (CDOC)
Database Mail
External Key Management
Filestream Access Level
OLE Automation Procedures
Remote Admin Connections
SQL Mail extended stored procedures (XPs)
xp_cmdshell
Execute the following code to turn on the ability to view all the configuration options for
an instance:
EXEC sp_configure 'show advanced options',1 GO
RECONFIGURE WITH OVERRIDE GO
EXEC sp_configure GO
Execute the following code to turn off ad hoc distributed queries, CDOC, CLR,
OLEautomation procedures, SQL Mail XPs, and xp_cmdshell:
EXEC sp_configure 'Ad Hoc Distributed Queries',0 EXEC sp_configure 'clr enabled',0
EXEC sp_configure 'cross db ownership chaining',0
EXEC sp_configure 'Ole Automation Procedures',0 EXEC sp_configure 'SQL Mail XPs',0
EXEC sp_configure 'xp_cmdshell',0 GO
RECONFIGURE WITH OVERRIDE
GO
Creating Principals
Principals are the means by which you authenticate and are identified within an instance
or database. Principals are broken down into two major categories: logins/users and
groups that exist at both an instance and database level.
32
Logins
To gain access to an instance, a user has to authenticate by supplying credentials
for SQLServer to validate. You create logins for an instance to allow a user to
authenticate. Logins with in SQL Server can be five different types:
SQL Server login can also be mapped to either a Windows login or a Windows group.
When adding a Windows login or Windows group, SQL Server stores the name of the
login or group along with the corresponding Windows SID.
login mapped to a certificate or asymmetric key does not provide a means to authenticate
to the instance. Logins mapped to certificates and asymmetric keys are used internally as
asecurity container.
33
When the CHECK_POLICY option (the default and recommended setting) is enabled,
SQLServer 2008 enforces the Windows password policy settings when you create a
SQL Serverlogin.
Groups
Using Windows groups provides the greatest flexibility for managing security access. You
simply add or remove accounts from the group to control access to a SQL Server instance.
A DBA is also isolated from the details of people joining and leaving companies or
moving to different groups within an organization. The DBA can then focus on
defining groups based on permission profiles and leave the mechanics of adding and
removing user accounts to standard business processes within your company.
When you are performing maintenance on a database you need to ensure that
users are not accessing the database in the meantime.
You can preventaccess while keeping the permissions for a login intact by disabling the
login. You can disable the login by executing the following code:
Roles in SQL Server provide the same functionality as groups within Windows. Roles
provide a convenient way to group multiple users with the same permissions. Permissions
are assigned to the role, instead of individual users. Users then gain the required set of
permissions by having their account added to the appropriate role.
SQL Server ships with a set of instance-level roles. The instance-level roles are referred to
as fixed server roles, because you cannot modify the permissions on the role. You
also can not create additional roles at an instance level.
34
Database Users
SQL Server security works on the principle of “no access by default.” If you haven’t
explicitly been granted permission, you cannot perform an action. You grant access to a
database by adding a login to the database as a user by executing the CREATE
USER command.
Just because a user can access a database, that does not mean that any objects within the
database can be accessed since the user still needs permissions granted
to databaseobject(s).
Loginless Users
It is possible to create a user in the database that is not associated to a login, referred to as
a loginless user. Prior to SQL Server 2005, if you wanted to allow users to access a
database only when a specific application was being used, you used an application role.
You created the application role with a password, and assigned permissions to the
applicationrole.
Users would then specify the password for the application role to gain access to
the database under the application role’s security context. Unfortunately, when you
are connected with the application role, SQL Server no longer knew the user
issuing commands, which created a problem for auditing activity.
Loginless users were added to replace application roles. Users still authenticate to
the instance using their own credentials. The user’s login needs access to the database.
35
User Database Roles
Instead of managing permissions for each account, all modern operating systems allow
you to define groups of users that all have the same permissions. All system
administrators need to do is to manage the members of a group instead of the
potentially hundreds or thousands of individual permissions.
A database role is a principal within the database that contains one or more database
users. Permissions are assigned to the database role. Although you can assign
permissions directly to a user, it is recommended that you create database roles, add users
to a role, and then grant permissions to the role.
Managing Permissions
SQL Server denies access by default. Therefore, to access any object or perform any action,
you must be granted permission.
Securables
Permissions would be rather uninteresting unless you had something to apply the
permissions to. In addition, there would be no need to have permissions if no one existed
to use the permission. Permissions work in concert with securables and principals.
You GRANT/REVOKE/DENY <permissions> ON <securables> TO <principals>.
36
Securables are the objects on which you grant permissions. Every object within SQL
Server, including the entire instance, is a securable. Securables also can be nested
inside other securables. For example, an instance contains databases; databases contain
schemas; and schemas contain tables, views, procedures, functions, and so on.
Schemas
Every object created within a database cannot exist without an owner. All objects must
have an owner because objects cannot spontaneously come into existence; rather,
they must be created by someone.
when you need to remove a user from a database it would not be possible to drop a user
unless you reassigned the objects to a different owner. Reassigning an object to a different
owner would change the name of the object.
Schemas provide the containers that own all objects within a database and in turn,
a schema is owned by a database user. By introducing a schema between users and
objects, you can drop a user from the database without affecting the name of an
object or applications that use the object.
Permissions
Permissions provide the authority for principals to perform actions within an instance or
database. Some permissions apply to a statement such as INSERT, UPDATE, and SELECT,
and other permissions apply to an action such as ALTER TRACE, and still others
encompass a broad scope of authority such as CONTROL.
You add permissions to an object with the GRANT statement. Access to an object is
prevented with the DENY statement. To access an object, permission must be
granted explicitly. A DENY overrides any other permission, a DENY overrides a GRANT.
Permission Scope
Because you grant permissions on asecurable, you can assign permissions to a securable at
any scope. Granting permission on a database causes the permission to be granted
implicitly to all schemas within the database and thereby to all objects within all schemas.
37
Granting permission on a schema causes thepermission to be granted implicitly to
all objects within a schema.
The generic syntax for creating a DDL trigger is as follows: CREATE TRIGGER trigger_name
ON ALL SERVER
[ WITH<logon_trigger_option> [ ,...n ] ]
{ FOR | AFTER } LOGON
AS { sql_statement [ ; ] [ ,...n ] |
EXTERNAL NAME < method specifier>[ ; ] }
You use the ON clause to scope a trigger as either instance-level (ON ALL
SERVER), or database-level, ON DATABASE. You specify the DDL event or event group that
the trigger fires upon within the FOR clause.
DDL triggers fire within the context of the DDL statement being executed. In addition to
obtaining information about the command that was executed, DDL triggers allow you to
prevent many DDL actions.
Audit Specifications
Prior to SQL Server 2008, you had to use multiple features to perform the full array
of auditing for an instance. DDL triggers would audit DDL changes; data
manipulation language (DML) triggers would audit data changes at the cost
of increasing transactiontimes; SQLTrace would audit SELECT statements.
SQL Server 2008 combines all the auditing capabilities into an audit specification. Audit
specifications begin with a server-level audit object that defines the logging location for
the audit trail. You then create server and database audit specifications tied to the audit
object.
C2 Auditing
38
C2 auditing has been superseded by the Common Criteria specification developed by the
European Union. Whether you are complying with C2 or Common Criteria, with respect
toSQL Server, the audit result is essentially the same. You need to audit every successful
and unsuccessful attempt to access a database object.
You must be very careful when implementing C2 auditing. Be sure to check that a lower
level of auditing does not meet your requirements. When you enable C2 auditing, you have
made the decision that the audit is more important than a transaction. If the
system becomes too busy, SQL Server aborts a user transaction to write audit information.
Encrypting Data
Data that must remain confidential, even from a user that has SELECT permission
on a table, should be encrypted.
Data Encryption
Data that needs to remain confidential within the database (such as credit
card numbers)should be encrypted. After it’s encrypted, the data cannot be read without
having the proper credentials. In addition, encrypted columns cannot be used as search
arguments or as columns within an index because each action would defeat the purpose of
encrypting the data.
Columns can be encrypted using a hash, symmetric key, asymmetric key, or a certificate.
Symmetric keys are commonly used since a symmetric key provides the best
balance between securing data and performance. Asymmetric keys and certificates
provide the strongest encryption and decryption method.
Hash Algorithms
SQLServer uses an MD5 hash to handle passwords. When a password is specified for an
object, it applies an MD5 hash and stores the hash value. When you specify a password to
access an object, the password is hashed using the same MD5 hash, the hashed password
is transmitted in a secure channel, and the hash value transmitted is compared to
the hashvalue stored. Even an administrator who is running a trace cannot access
the password.
SQL Server allows you to specify five different hash algorithms—SHA, SHA1, MD2, MD4,
and MD5. MD5 is the algorithm of choice because it provides stronger encryption than the
other algorithms. Hash algorithms are also platform-Independent. You could hash a value
39
with in PHP on a Linux system and receive the same value as if you hashed the same value
with in SQLServer, so long as you used the same algorithm.
Hash algorithms are vulnerable to brute force attacks. If the range of values that you are
seeking to encrypt is small, an attacker can easily generate all the possible hashes for the
range of possible values. After generating these hashes, the attacker needs to compare the
hash values to find a match and thus reverse-engineer your data. For example, birthdates,
salaries, and credit card numbers would not be good choices to encrypt using a
hash algorithm.
Symmetric Keys
Symmetric keys utilize a single key for both encryption and decryption. Because
only a single key is needed to encrypt and decrypt data, symmetric key encryption is not
as strong as asymmetric key or certificate-based encryption. However, symmetric keys
provide the best possible performance for routine use of encrypted data.
Certificates and asymmetric keys are based on the X.509 standard and are
essentially equivalent in their application. Asymmetric keys are generated by a key server
within an organization and cannot be backed up or moved from one system to
another. Certificatescan be backed up to and restored from a file, allowing you to move
databases that are encrypted while being able to re-create the certificate to access your
data.
you can execute the following code and compare the results for each hash algorithm:
40
UNIT FIVE
Index Architecture
Indexes are designed so that you can find the information you
are looking for within avastvolume of data by needing to perform
only a very small number of read operations.In thislesson, you
learn about the internal structure of an index, as well as how
SQL Serverbuildsand manipulates indexes. Armed with this
structural information, you can make betterdecisions on the
number, type, and definition of the indexes that you choose to
create.
Index Structure
41
A B-tree is constructed of a root node that contains a single page of data, one or
moreoptional intermediate level pages, and one or more optional leaf level pages.
The coreconcept of a B-tree can be found in the first word of the name: balanced. A B-tree
is always symmetrical, with the same number of pages on both the left and right halves at
each level.
A query scans the root page until it finds a page that contains the value being searched on.
It then uses the page pointer to hop to the next level and scan the rows in that page until it
finds a page that contains the data being searched for. It then repeats the process
with subsequent levels until it reaches the leaf level of the index. At this point, the
query haslocated the required data.
Index Levels
The number of levels in an index and the number of pages within each level of an index
aredetermined by simple calculation.
Designing Indexes
Indexes enable you to effectively query large amounts of data within a database. In this
lesson, you learn how to create clustered and nonclustered indexes, as well as why each
type of index is useful. Finally, you learnhow tocreate XML and spatial indexes to improve
search capabilities for XML documents and spatial applications.
Clustered Indexes
You can define an index by using one or more columns in the table, called the
index key,withthe following restrictions:
You can define an index with a maximum of 16 columns.
The maximum size of the index key is 900 bytes.
The column(s) defined for the clustered index are referred to as the clustering key.
A clusteredindex is special because it causes SQL Server to arrange the data in the
table according to the clustering key. Because a table cannot be sorted more than one way,
you can define only one clustered index on a table.
Clustered indexes provide a sort order for the storage of data within a table. However, a
clustered index does not provide a physical sort order.
42
Although a primary key is a constraint, SQL Server physically implements a primary key as
an index. Because the default option for a primary key is clustered unless you
specify otherwise, SQL Server creates a clustered index for a primary key.
Nonclustered Indexes
The other type of relational index that you can create is a nonclustered index.
Nonclustered indexes do not impose a sort order on the table, so you can create multiple
nonclustered indexes on a table. Nonclustered indexes have the same restrictions as a
clustered index - they can have a maximum of 900 bytes in the index key and a maximum
of 16 columns. Inaddition, a table is limited to a maximum of 1,000 nonclustered indexes.
The leaf level of a nonclustered index contains a pointer to the data you require. If
a clustered index exists on the table, the leaf level of the nonclustered index points
at theclustering key. If aclustered index does not exist on the table, the leaf level
of the nonclustered index points at therow of data in the table.
Both clustered and nonclustered indexes can be designated as unique. After an index has
been specified as such, you cannot place duplicate entries within it. If you attempt to insert
duplicate values, you receive an error and the transaction is disallowed.
By default, multi-rowinserts where even one row produces a duplicate have the
entire transaction rolledback.
Ifyou want to allow any rows to be inserted that do not produce duplicate values and
reject only the rows that cause duplicates in a multi-row insert operation, you
can specifytheIGNORE_DUP_KEY option. With the IGNORE_DUP_KEY option enabled,
rows that produced uplicate values generate a warning message, and only those rows are
rejected.
XML Indexes
An XML data type can contain up to 2 gigabytes (GB) of data in a single column.
Althoughthe XML data has a structure that can be queried, SQL Server needs to scan the
data structure to locate data within an XML document. To improve the performance
of queriesagainst XMLdata, you can create a special type of index called an XML index.
There are two different types of XML indexes: primary and secondary.
A primary XML index is built against all the nodes within the XML column. The primary
XML index is also tied to the table by maintaining a link to the corresponding row in the
43
clustered index. Therefore, a clustered index is required before you can create a primary
XML index.
After a primary XML index has been created, you can create additional secondary
indexes.Secondary indexes can be created on PATH, VALUE, or PROPERTY. A primary XML
index is first required, because secondary XML indexes are built against the data contained
within the primary XML index.
Maintaining Indexes
Over time, data changes will cause indexes to become fragmented. In order to ensure the
most efficient query operations possible, you need to ensure that fragmentation is
minimized.
Because the data within an index is stored in sorted order, over time, values can
movearoundwithin the index due to either page splits or changes in the values. To manage
the fragmentation of an index over time, you need to perform periodic maintenance.
To control the rate of fragmentation of an index, you can use an index option called the
fillfactor. You can also use the ALTER INDEX statement to remove the fragmentation.
Disabling an index
An index can be disabled by using the ALTER INDEX statement ALTER INDEX
{ index_name | ALL }
ON <object> DISABLE [
When an index is disabled, the definition remains in the system catalog but is no
longerused. SQL Server does not maintain the index as data in the table changes, and the
index cannot be used to satisfy queries. If a clustered index is disabled, the entire table
becomesinaccessible.
To enable an index, it must be rebuilt to regenerate and populate the B-tree structure. You
can accomplish this by using the following command:
44
UNIT SIX
Backing up Databases
Database backups form the backbone upon which every disaster recovery plan is
built. Backup strategies are designed in the opposite order of the material
presented in this chapter. You start with the recovery requirements and
procedures and then figure outwhich types ofbackups best meet your recovery
needs. Unless you are devising recovery-oriented backupstrategies, it is unlikely
that you will ever meet your disaster recovery requirements.
Backup Types
SQL Server 2008 allows you to create four different types of backups:
Full
Differential
Transaction log
Filegroup
Full Backups
A full backup captures all pages within a database that contain data. Pages that
do notcontain data are not included in the backup. Therefore, a backup is never larger,
and in most cases is smaller, than the database for which it is created. A full backup is
the basis for recovering a database and must exist before you can use a
differential or transaction logbackup.
The only operations that are not allowed during a full backup are
8. Adding or removing a database file
9. Shrinking a database
Every change made to a database has an entry made to the transaction log.
Each row isassigned a unique number internally called the Log Sequence Number (LSN).
The LSN is aninteger value that starts at 1 when the database is created and increments to
infinity. An LSN is never reused for a database and always increments. Essentially, an LSN
provides asequence number for every change made to a database.
The contents of a transaction log are broken down into two basic parts:-active and
inactive.
The inactive portion of the transaction log contains all the changes that have
been committed to the database.
The active portion of the log contains all the changes that have not yet been
committed.
When a transaction log backup is executed, SQL Server starts with the lowest LSN
in the transaction log and starts writing each successive transaction log record into the
backup. As soon as SQL Server reaches the first LSN that has not yet been committed (that
is, the oldest open transaction), the transaction log backup completes. The portion of
the transaction log that has been backed up is then removed, allowing the space to be
reused.
Based on the sequence number, it is possible to restore one transaction log backup
after another to recover a database to any point in time.
The restrictions on transaction log backups depend on having the entire sequence
of LSNs intact. Any action that creates a gap in the LSN sequence prevents any
subsequent transaction log backup from being executed. If an LSN gap is introduced, you
must create a full back up before you can start backing up the transaction log.
Differential Backups
A differential backup captures all extents that have changed since the last full
backup. The primary purpose of a differential backup is to reduce the number of
transaction log backups that need to be restored. A differential backup has to be applied
to a full backup and can’t exist until a full backup has been created.
SQL Server tracks each extent that has been changed following a full backup using46a
special page in the header of a database called the Differential Change Map (DCM). A full
backup zeroes out the contents of the DCM.
As changes are made to extents within the database, SQL Server sets the bit corresponding
to the extent to 1. When a differential backup is executed, SQL Server reads the contents of
the DCM to find all the extents that have changed since the last full backup. A differential
backup is not the same as an incremental backup.
A transaction log backup is an incremental backup because it captures any changes that
have occurred since the last transaction log backup. A differential backup contains all
pages changed since the last full backup.
For example, if you were to take a full backup at midnight and a differential backup every
four hours, both the 4 A.M. backup and the 8 A.M. backup would contain all the changes
made to the database since midnight.
Filegroup Backups
Although full backups capture all the used pages across the entire database, a full backup
of a large database can consume a significant amount of space and time. If you need to
reduce the footprint of a backup, you can rely on file or filegroup backups instead. As the
name implies, a file/filegroup backup allows you to target a portion of a database to be
backed up.
Filegroup backups can be used in conjunction with differential and transaction log backups
to recover a portion of the database in the event of a failure. In addition, so long as you do
not need to restore the primary filegroup and you are running SQL Server 2008
Enterprise; the database can remain online and accessible to applications during the
restore operation. Only the portion of the database being restored is off-line.
Partial Backups
Filegroups can be marked as read-only. A read-only filegroup cannot have any changes to
the objects that are stored on the filegroup. Because the purpose of backups is to capture
changes so that you can reconstruct a database to the most current state during a recovery
operation, backing up filegroups that cannot change unnecessarily consumes space within
the backup.
To reduce the size of a backup to only the filegroups that can change, you can perform a
partial backup. Partial backups are performed by specifying
the READ_WRITE_FILEGROUPS option as follows:
TO When a partial backup is executed, SQL Server backs up the primary filegroup,
all read/write filegroups, and any explicitly specified read-only filegroups.
47
Restoring Databases
Backups are the insurance policy for your data. You hope that you never need to use your
backups, but in the event of a disaster, backups allow you to recover your data
and continue business operations.
A transaction log is one or more files associated with a database that tracks every
modification made to either data or objects within the database.
The transaction log is also required to store enough information to allow SQL Server to
recover a database when the instance is restarted.
The key piece of information within a transaction log is the LSN. The LSN starts at 0 when
the database is created and increments to infinity. The LSN always moves forward, never
repeats, and cannot be reset to a previous value. Each operation that affects the state of
the database increments the LSN.
When a SQL Server is started, every database undergoes a process called restart recovery.
Restart recovery runs in two phases—UNDO and REDO. During the REDO phase, all
committed transactions in the transaction log are flushed to disk.
At the completion of the REDO phase, the UNDO phase starts. The UNDO phase
moves through the transaction log and invalidates any transaction in the log that is
still open, ensuring that an uncommitted transaction cannot be written out to disk.
Database Restores
All restore sequences begin with either a full backup or filegroup backup. When restoring
backups, you have the option to terminate the restore process at any point and make the
database available for transactions. After the database or filegroup being restored has
been brought online, you can’t apply any additional differential or transaction log backups
to the database.
When a RESTORE command is issued, if the database does not already exist within
the instance, SQL Server creates the database along with all fi les underneath the
database. During this process, each fi le is created and sized to match the fi le sizes at the
time the backup was created. After it creates the fi les, SQL Server begins restoring each
database page from the backup.
The REPLACE option is used to force the restore over the top of an existing database.
If you want the database to be online and accessible for transactions after the RESTORE
operation has completed, you need to specify the RECOVERY option. When a RESTORE is
48
issued with the NORECOVERY option, the restore completes, but the database is left in a
RECOVERING state such that subsequent differential and/or transaction log backups can
be applied.
The STANDBY option can be used to allow you to issue SELECT statements against
the database while still issuing additional differential and/or transaction log restores.
If you wanted to restore the database without losing any data, you only need to restore
the full backup and then the differential backup because the differential also contains all
the changes captured by each of the transaction log backups.
When restoring a transaction log, you can have SQL Server replay only a portion of
a transaction log by issuing what is referred to as a point-in-time restore. The
STOPAT command allows you to specify a date and time to which SQL Server
restores. The STOPATMARK and STOPBEFOREMARK options allow you to specify
either an LSN or a transaction log MARK to use for the stopping point in the restore
operation.
Online Restores
A database has a state that governs whether it can be accessed and what operations can be
performed. For example, a database in an ONLINE state allows transactions and any other
operations to be executed, but a database in an EMERGENCY state allows only
SELECT operations to be executed by a member of the db_owner database role.
Each filegroup within a database can have a state. While one filegroup can be in a
RESTORING state and not be accessible, another filegroup can be in an ONLINE state and
accept transactions. The state of the database equals the state of the filegroup designated
as PRIMARY.
SQL Server 2008 Enterprise allows you to perform restore operations while the database
is still online and accessible. However, because a full backup affects the entire database,
the state of the database is the state of the primary filegroup, and a database that is
restoring is not accessible. To perform an online restore operation, you must perform a file
or filegroup restore. In addition, you cannot be restoring the primary filegroup or a
file within the primary filegroup. A filegroup restore that affects only a portion of the
database is referred to as a partial restore.
49
UNIT SEVEN
Overview
In recent years, demand has grown for ways to deal with more complex data
types. Consider, for example, addresses. While an entire address could be viewed
as an atomic data item of type string, this view would hide details such as the street
address, city, state, and postal code, which could be of interest to queries.
With complex type systems we can represent E-R model concepts, such as
composite attributes, multivalued attributes, generalization, and specialization
directly, without a complex translation to the relational model.
51
We can now use these types to create composite attributes in a relation, by simply
declaring an attribute to be of one of these types. For example, we could create a
table person as follows:
We can also create a table whose rows are of a user-defined type. For example, we could
define a type PersonType and create the table person as follows:
A structured type can have methods defined on it. We declare methods as part of the type
definition of a structured type:
52
Note that the for clause indicates which type this method is for, while the keyword
instance indicates that this method executes on an instance of the Person type. The
variable self refers to the Person instance on which the method is invoked.
Methods can be invoked on instances of a type. If we had created a table person of type
PersonType, we could invoke the method ageOnDate() as illustrated below, to find the age
of each person.
Type Inheritance
We may want to store extra information in the database about people who are students,
and about people who are teachers. Since students and teachers are also people, we can
use inheritance to define the student and teacher types in SQL:
Both Student and Teacher inherit the attributes of Person—namely, name and
address. Student and Teacher are said to be subtypes of Person, and Person is a supertype
of Student, as well as of Teacher.
Methods of a structured type are inherited by its subtypes, just as attributes are. However,
a subtype can redefine the effect of a method by declaring the method again, using
overriding method in place of method in the method declaration.
The SQL standard requires an extra field at the end of the type definition, whose value is
either final or not final. The keyword final says that subtypes may not be created from the
given type, while not final says that subtypes may be created.
Note that the SQL standard does not support multiple inheritance, although future
versions of the SQL standard may support it.
53
Table Inheritance
We can then define tables students and teachers as subtables of people, as follows:
The typesof the subtables (Student and Teacher in the above example) are subtypes of the
type of the parent table Person. As a result, every attribute present in the table people is
also present in the subtables students and teachers.
Further, when we declare students and teachers as subtables of people, every tuple
present in students or teachers becomes implicitly present in people. Thus, if a query uses
the table people, it will find not only tuples directly inserted into that table, but also tuples
inserted into its subtables, namely students and teachers.
SQL permits us to find tuples that are in people but not in its subtables by using
“only people” in place of people in a query. The only keyword can also be used in delete
and update statements. Without the only keyword, a delete statement on a supertable,
such as people, also deletes tuples that were originally inserted in subtables (such as
students); for example, a statement:
would delete all tuples from the table people, as well as its subtables students and
teachers, that satisfy P.
The following example illustrates how these array and multiset-valued attributes can be
defined in SQL:
54
create type Publisher as (name varchar(20), branch varchar(20));
Note that we used an array, instead of a multiset, to store the names of authors, since the
ordering of authors generally has some significance, whereas we believe that the
ordering of keywords associated with a book is not significant.
If we want to insert the preceding tuple into the relation books, we could execute
the statement:
Object-oriented languages provide the ability to refer to objects. An attribute of a type can
be a reference to an object of a specified type. For example, in SQL we can define a
type Department with a field name and a field head that is a reference to the type
Person, and a table departments of type Department, as follows:
Here, the reference is restricted to tuples of the table people. The restriction of the scope
of a reference to tuples of a table is mandatory in SQL, and it makes references
behave like foreign keys.
55
We can omit the declaration scope people from the type declaration and instead make an
addition to the create table statement:
XML
The Extensible Markup Language (XML) was not designed for database applications. In
fact, like the Hyper-Text Markup Language (HTML) on which the World Wide Web is
based, XML has its roots in document management, and is derived from a language for
structuring large documents known as the Standard Generalized Markup Language
(SGML). However, unlike SGML and HTML, XML is designed to represent data.
The term markup refers to anything in a document that is not intended to be part of the
printed output. In electronic document processing, a markup language is a formal
description of what part of the document is content, what part is markup, and what the
markup means.
the title of a document might be marked up as follows:
<title>Database System Concepts</title>
56
Unlike HTML, XML does not prescribe the set of tags allowed, and the set may be chosen
as needed by each application. This feature is the key to XML’s major role
in data representation and exchange, whereas HTML is used primarily for document
formatting.
<university>
<department>
<dept name> Comp. Sci. </dept name>
<building> Taylor </building>
<budget> 100000 </budget>
</department>
<department>
<dept name> Biology </dept name>
<building> Watson </building>
<budget> 90000 </budget>
</department>
<course>
<course id> CS-101 </course id>
<title> Intro. to Computer Science </title>
<dept name> Comp. Sci </dept name>
<credits> 4 </credits>
</course>
<course>
<course id> BIO-301 </course id>
<title> Genetics </title>
<dept name> Biology </dept name>
<credits> 4 </credits>
</course>
<instructor>
<IID> 10101 </IID>
<name> Srinivasan </name>
<dept name> Comp. Sci. </dept name>
<salary> 65000 </salary>
</instructor>
<instructor>
<IID> 83821 </IID>
<name> Brandt </name>
<dept name> Comp. Sci. </dept name>
<salary> 92000 </salary>
</instructor>
<instructor>
<IID> 76766 </IID>
<name> Crick </name>
<dept name> Biology </dept name>
<salary> 72000 </salary>
</instructor>
<teaches>
<IID> 10101 </IID>
<course id> CS-101 </course id> 57
</teaches>
<teaches>
<IID> 83821 </IID>
<course id> CS-101 </course id>
</teaches>
<teaches>
<IID> 76766 </IID>
<course id> BIO-301 </course id>
</teaches>
</university>
The document type definition (DTD) is an optional part of an XML document. The
main purpose of a DTD is much like that of a schema: to constrain and type
the information present in the document. However, the DTD does not in fact
constrain types in the sense of basic types like integer or string. Instead, it
constrains only the appearance of subelements and attributes within an element.
The DTD is primarily a list of rules for what pattern of subelements may appear
within an element.
<!DOCTYPE university [
<!ELEMENT university ( (department|course|instructor|teaches)+)>
<!ELEMENT department ( dept name, building, budget)>
<!ELEMENT course ( course id, title, dept name, credits)>
59
<!ELEMENT instructor (IID, name, dept name, salary)>
<!ELEMENT teaches (IID, course id)>
<!ELEMENT dept name( #PCDATA )>
<!ELEMENT building( #PCDATA )>
<!ELEMENT budget( #PCDATA )>
<!ELEMENT course id ( #PCDATA )>
<!ELEMENT title ( #PCDATA )>
<!ELEMENT credits( #PCDATA )>
<!ELEMENT IID( #PCDATA )>
<!ELEMENT name( #PCDATA )>
<!ELEMENT salary( #PCDATA )>
]>
The elements course id, title, dept name, credits, building, budget, IID, name, and
salary are all declared to be of type #PCDATA. The keyword #PCDATA indicates
text data; it derives its name, historically, from “parsed character data.”
Two other special type declarations are empty, which says that the element
has no contents, and any, which says that there is no constraint on the subelements
of the element; that is, any elements, even those not mentioned in the DTD, can
occur as subelements of the element. The absence of a declaration for an element is
equivalent to explicitly declaring the type as any.
XML Schema
An effort to redress the deficiencies of the DTD mechanism resulted in the
development of a more sophisticated schema language, XML Schema. We provide a
brief overview of XML Schema, and then we list some areas in which it improves
DTDs.
XMLSchema defines a number of built-in types such as string, integer, decimal date,
and boolean. In addition, it allows user-defined types; these may be simple types
with added restrictions, or complex types constructed using constructors such
as complexType and sequence. 60
The first thing to note is that schemadefinitions in XMLSchema are themselves specified in
XML syntax, using a variety of tags defined by XML Schema. To avoid conflicts with user-
defined tags,weprefix the XMLSchema tag with the namespace prefix “xs:”; this prefix is
associated with the XML Schema namespace by the xmlns:xs specification in the
root element:
<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>
Note that any namespace prefix could be used in place of xs; thus we could
replace all occurrences of “xs:” in the schema definition with “xsd:” without changing the
meaning of the schema definition. All types defined by XML Schema must be
prefixed by this namespace prefix.
<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”>
<xs:element name=“university” type=“universityType” />
<xs:element name=“department”>
<xs:complexType>
<xs:sequence>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“building” type=“xs:string”/>
<xs:element name=“budget” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name=“course”>
<xs:element name=“course id” type=“xs:string”/>
<xs:element name=“title” type=“xs:string”/>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“credits” type=“xs:decimal”/>
</xs:element>
<xs:element name=“instructor”>
<xs:complexType>
<xs:sequence>
<xs:element name=“IID” type=“xs:string”/>
<xs:element name=“name” type=“xs:string”/>
<xs:element name=“dept name” type=“xs:string”/>
<xs:element name=“salary” type=“xs:decimal”/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML Schema version of DTD above
61
The type of department is defined to be a complex type, which is further specified
to consist of a sequence of elements dept name, building, and budget. Any type that has
either attributes or nested subelements must be specified to be a complex type.
Attributes are specified using the xs:attribute tag. For example,we could have defined dept
name as an attribute by adding:
<xs:attribute name = “dept name”/>
In the context of XML, we need to specify a scope within which values are unique and form
a key. The selector is a path expression that defines the scope for the constraint, and field
declarations specify the elements or attributes that form the key. To specify that dept
name
forms a key for department elements under the root university element, we add
the following constraint specification to the schema definition:
<xs:key name = “deptKey”>
<xs:selector xpath = “/university/department”/>
<xs:field xpath = “dept name”/>
</xs:key>
Correspondingly a foreign-key constraint from course to department may be defined
as follows:
<xs: name = “courseDeptFKey” refer=“deptKey”>
<xs:selector xpath = “/university/course”/>
<xs:field xpath = “dept name”/>
</xs:keyref>
Note that the refer attribute specifies the name of the key declaration that is being
referenced, while the field specification identifies the referring attributes. XML
Schema offers several benefits over DTDs, and is widely used today.
Among the benefits that we have seen in the examples above are these:
•It allows the text that appears in elements to be constrained to specific types, such
as numeric types in specific formats or complex types such as sequences of elements of
other types.
•It allows user-defined types to be created.
•It allows uniqueness and foreign-key constraints.
62
• It is integrated with namespaces to allow different parts of
a document to conform to different schemas.
In addition to the features we have seen, XML Schema supports several
other features that DTDs do not, such as these:
• It allows types to be restricted to create specialized types, for
instance
by specifying minimum and maximum values.
• It allows complex types to be extended by using a form of inheritance.
XPath
XPath addresses parts of an XML document by means of path
expressions. The language can be viewed as an extension of the
simple path expressions in object oriented and object- relational
databases. The current version of the XPath standard is XPath
2.0, and our description is based on this version.
A path expression in XPath is a sequence of location steps separated by
“/” (instead of the “.” operator that separates location steps in SQL).
63
The result of a path expression is a set of nodes. For instance, on the
XML document above, the XPath expression:
<name>Srinivasan</name>
<name>Brandt</name> The expression:
/university-3/instructor/name/text() returns the same names, but without
the enclosing tags.
Path expressions are evaluated from left to right. Like a directory hierarchy,
the initial ’/’ indicates the root of the document. Note that this is an abstract
root “above” <university-3> that is the document tag.
As a path expression is evaluated, the result of the path at any point consists
of an ordered set of nodes from the document.
Selection predicates may follow any step in a path, and are contained in
square brackets. For example,
/university-3/course[credits >= 4]
returns course elements with a credits value greater than or equal to 4,
while:
/university-3/course[credits >= 4]/@course id returns the course identifiers
of those courses.
the path expression:
/university-2/instructor[count(./teaches/course)> 2]
returns instructors who teach more than two courses. Boolean connectives
and, and or can be used in predicates, while the function not(. . .) can be
used for negation.
XQuery
The World Wide Web Consortium (W3C) has developed XQuery as the
standard query language for XML.
64
XQuery queries are modeled after SQL queries, but differ significantly from SQL. They are
organized into five sections: for, let, where, order by, and return. They are referred to as
“FLWOR” (pronounced “flower”) expressions, with the letters in FLWOR denoting the five
sections.
A simple FLWOR expression that returns course identifiers of courses with greater than 3
credits, shown below, is based on the XML document
The for clause is like the from clause of SQL, and specifies variables that range over the
results of XPath expressions. When more than one variable is specified, the results include
the Cartesian product of the possible values the variables can take, just as the SQL from
clause does.
The let clause simply allows the results of XPath expressions to be assigned to variable
names for simplicity of representation. The where clause, like the SQL where
clause, performs additional tests on the joined tuples from the for clause.
The order by clause, like the SQL order by clause, allows sorting of the output. Finally, the
return clause allows the construction of results in XML. A FLWOR query need not contain
all the clauses;
XML Applications
We now outline several applications of XML for storing and communicating (exchanging)
data and for accessing Web services (information resources).
65
Many applications need to store data that are structured, but are not easily modeled as
relations. XML-based representations are now widely used for storing
documents, spreadsheet data and other data that are part of office application packages.
XML-based representations are now widely used for storing documents, spreadsheet data
and other data that are part of office application packages. The Open Document Format
(ODF), supported by the Open Office software suite as well as other office suites, and the
Office Open XML (OOXML) format, supported by the Microsoft Office suite, are document
representation standards based on XML.
XML is also used to represent data with complex structure that must be exchanged
between different parts of an application. For example, a database system may represent a
query execution plan (a relational-algebra expression with extra information on how
to execute operations) by using XML.
XML-based standards for representation of data have been developed for a variety
of specialized applications, ranging from business applications such as banking and
shipping to scientific applications such as chemistry and molecular biology.
Web Services
Applications often require data from outside of the organization, or from
another department in the same organization that uses a different database. In
many such situations, the outside organization or department is not willing to allow direct
access to its database using SQL, but is willing to provide limited forms of
information through predefined interfaces.
In effect, the provider of the information defines procedures whose input and output are
both in XML format. The HTTP protocol is used to communicate the input and
output information, since it is widely used and can go through firewalls that institutions
use to keep out unwanted traffic from the Internet.
66
The Simple Object Access Protocol (SOAP) defines a standard for invoking
procedures, using XML for representing the procedure input and output. SOAP defines a
standard XML schema for representing the name of the procedure, and result status
indicators such as failure/error indicators. The procedure parameters and results are
application-dependent XML data embedded within the SOAP XML headers.
SOAP standard is widely used today. For example, Amazon and Google provide SOAP-
based procedures to carry out search and other activities. These procedures can be
invoked by other applications that provide higher-level services to users.
The SOAP standard is independent of the underlying programming language, and it
is
possible for a site running one language, such as C#, to invoke a service that runs on a
different language, such as Java.
A site providing such a collection of SOAP procedures is called a Web service.
Several standards have been defined to support Web services. The Web Services
Description Language (WSDL) is a language used to describe a Web service’s
capabilities. WSDL provides facilities that interface definitions (or function definitions)
provide in a traditional programming language, specifying what functions are available and
their input and output types.
In addition WSDL allows specification of the URL and network port number to be used to
invoke the Web service. There is also a standard called Universal Description, Discovery,
and Integration (UDDI) that defines how a directory of available Web services may
be created and how a program may search in the directory to find a Web service satisfying
its requirements.
Data Mediation
Once the basic tools are available to extract information from each source, a
mediator application is used to combine the extracted information under a single schema.
This may require further transformation of the XML data from each site, since different
67
sites may
structure the same information differently. They may also use different names for the
same information (for instance, acct number and account id), or may even use the same
name for different information.
The mediator must decide on a single schema that represents all required information, and
must provide code to transform data between different representations. XML query
languages such as XSLT and XQuery play an important role in the task of transformation
between different XML representations.
68