Introduction To Databases
Introduction To Databases
Databases - Notes
Contents
RELATIONAL MODEL 2
RELATIONAL DATABASE 3
RELATIONAL OPERATORS 3
LOGICAL MODELLING 7
NORMALISATION 10
SEQUENCES 17
OPERATORS 17
JOINING TABLES 21
TRANSACTION MANAGEMENT 23
SUBQUERIES 26
SQL FUNCTIONS 27
VIRTUAL TABLES 29
DATABASE CONNECTIVITY 30
1
Relational Model
The relational model was introduced in 1970 can contained the fundamental basics for a
relational DBMS’s basic structure. A DataBase Management System is the set of protocols used
to store a collection of data and explain how they are ordered.
• A domain is a set of indivisible values, such as name, data type or data format. It provides
the restrictions on size for each data type.
In a tabular representation, the relation heading forms the column headings and the relation body
forms the entries or rows.
For a relation to exist, there must be no duplicate tuples, which means there must be no two
tuples that contain the exact set of all values.
• Tuples are also unordered in each relation and there are no ordering of attributes within a
tuple.
• Tuple values are atomic, meaning they cannot be divided into further elements.
• When comparing, rows and columns are ordered and no tuples are deleted.
A candidate key K of relation R is an attribute or set of attributes which exhibit the following
properties:
• No proper subset of K has the uniqueness property (minimality). This means there are no
unnecessary attributes chosen for K.
One candidate key is chosen to be the primary key of the relation, but a relation may have
multiple candidate keys. The remaining keys are termed alternate keys. A superkey is an
attribute or combination of attributes which only exhibit the uniqueness property.
A primary key must be chosen considering the data that may be added to the table in the future.
It has a number of desirable characteristics:
• Non-intelligent - must not have any semantic meaning behind it (ie a string of numbers is
preferred)
• No change over time - the primary key is fixed from the moment of tuple creation
When writing relations, the following format is used, with the primary key underlined:
2
Relational Database
A foreign key is an attribute (or multiple) in a table that exist in the same, or another, table as a
primary key. It must either match the primary key from another table or be NULL. This pairing
between primary keys (PK) and foreign keys (FK) creates a relationship between tables.
To ensure data integrity, PK values must be unique and not be NULL. The values of the FK must
either match a value of the PK in the related relation or be NULL. All values in the column must
come from the same domain (same data type and range).
Relational Operators
Operators in databases work similar to mathematics and apply to at most two relations at a time.
They are procedural and can perform a series of tasks. They are:
• Select
• Project
• Product
• Join
• Union
• Intersection
• Difference
• Division
The project operator (π) selects the values of a few different columns. Given a specific attribute,
it displays the corresponding value of that attribute from all records in the table.
The select operator (σ) selects a particular record and displays all attributes neatly. It is the main
operator used in RMDBS systems.
The join operator combines data from two or more relations, based on a common attribute or
attributes:
• The theta join (θ) uses on of the standard arithmetic comparisons ( <, ≤, =, ≥, > ) to
connect and display two relations. It acts as a boolean output; either the condition is met
or not.
• A natural join (⋈) compares all columns of two tables which have the same column name
and then joins them together and links attributes of the same name (a primary key ID for
example).
3
• An outer join returns a set of records (or rows) that include what a normal (or inner) join
would include, but it also includes other rows for which no corresponding match is found
in the other table. This can be under three different types:
Example:
Suppose we have the following 4 relations:
This projects all names and addresses of each tuples found in the Hotel relation. The π
operator is used as a project symbol.
When implementing multiple relations in a single query, the relation name is often used,
followed by .attribute. Here, it selects all tuples that have the appropriate type and price (∩
is the intersection symbol) from the natural join of Hotel and Room (⋈). In this case, a
natural join connects the Hotel-No from Hotel with Room.
Similar to question 1.
4) List the price and type of all rooms at the Grosvenor Hotel
Here we project all rooms, with both the price and type entities, from the selection of all
hotels with the name ‘Grosvenor’. This is selected from the natural join of the two
relations, similar to question 2. Another approach that can be used saves program time if
there are very few tuples with the same hotel, but has no major effect if there are multiple.
4
5) List all names and addresses of guests currently staying at the Grosvenor Hotel (assume that
if the guest has a tuple in the BOOKING relation, then they are currently staying in the hotel)
The attribute TODAY receives the current date and compares it to the booking date start
and end times. Once again, a natural join connects the relation Hotel with Booking with
the attribute, Hotel-No.
When designing a database, there are a sequence of steps that are often followed before the
product is completed. They are:
The requirements definition identifies and analyses user views. A user view may be a report to
be produced or a particular type of transaction that should be supported. The output is a
statement of specifications which describes the user views’ particular requirements and
constraints.
During the conceptual design step, the data model of the database is designed. There are
various methodologies that can be employed, but the most common on is the ER diagram.
‣ Entity
‣ Attribute
‣ Relationship
• They can be used to display both the keys and/or attributes of each table and how they
are connected by relationships.
The logical design develops a data model which targets a particular database model (e.g
relational, hierarchal, network, ect) and is independent of DBMS implementation package.
Normalisation techniques are used to test the correctness of the logical design.
The physical design process develops a strategy for the physical implementation of the logical
data model. It is dependant upon the DBMS environment chosen to be used.
5
Entity Relationship Diagrams
An ER diagram (ERD) connects relations and attributes in a visual way. It uses the keys defined
for each relation to create a relationship and is often employed during the conceptual or logical
stages of the database design process.
A connection links two or more relations together and is often connected with a key. There are
multiple types of connections in an ERD:
• A one-to-one connection states that a single unique primary key is connected to another
relation with the same unique primary key (often foreign). For example,
• A one-to-many connection will have a unique key in one relation that is connected to an
attribute in another table with the same value. For example,
An entity is an object in the system that is modelled and has information stored about. In the
physical design of a database, they represent a table. Entities can also have different properties:
• A strong entity has a key which may be defined without reference to other entities. For
example an EMPLOYEE entity does not require other entities to make sense.
• A weak entity has a key which requires the existence of one or more entities. For
example, a FAMILY entity must include the key from EMPLOYEE to create a suitable
family. It is dependant to another entity and its primary key is partially, or totally, derived
from its parent entity.
6
• A non-identifying relationship is when the primary key attributes of the parent must not
become primary key attributes of the child. Another example is an OWNER can own a
BOOK, but the BOOK can exist without the OWNER. It is shown with a dotted line:
For each attribute, the domain specifies the set of all possible values. There are also many types
of attributes, each with a unique purpose:
• Derived attributes can be derived with an algorithm and do not need to be stored:
Logical Modelling
Cardinality describes the uniqueness of the data and often refers to the relationships between
two entities. In ER diagrams, there are a set of relationship connections using the Crow’s Foot
Symbols.
• A circle at the terminal of the relationship line indicates that that particular entity is not
required for the relationship of a particular tuple to exist.
At different levels, there are different terminologies for terms like relationships and entities.
7
Conceptial Logical Physical
Relationship - -
• Each entry is atomic; this means each cell can only contain one entry
Mapping is the process of transferring conceptual data, in the form of entities and attributes, into
logical models.
Mapping composite attributes into a relation requires only the simple component to be included.
This improves data accessibility and helps maintain data quality. For example, ADDRESS could
be mapped into STREET, CITY, STATE and ZIP, as four seperate attributes.
• When the regular entity type contains a multivalued attribute, two new relations are
created.
• The first relation contains all the attributes of the entity type except the multivalued
attribute itself.
• The second relation contains two new attributes that form the primary key. One of the
attributes is the PK from the first relation, which becomes the foreign key (FK) in the
second relation and the other is the multivalued attribute.
To map a weak entity, create a new relation and include all the simple attributes in the relation.
The PK of the identifying relation is included as a FK in the weak relation. For example:
To map a binary relationship, such as a one-to-many (1:M) relationship, first create the relation for
each of the two entity types participating in the relationship. Then include the PK attribute(s) of the
entity on one side of the relationship with the identical FK on the other side.
Similarly, a many-to-many (M:N) relationship can be created using the same system. One-to-one
(1:1) relationships mean that every single tuple in one relation is connected to, at most, one tuple
in another table and vice-a-versa.
• The primary key on the mandatory side of the relationship becomes the foreign key on the
optional side of the relationship.
8
• Where both sides are optional, place the FK on the side which causes the fewest NULL
values.
• If both sides of the relationship are mandatory, then it is likely the two entities can be
merged into one relation.
Unary relationships are ones where a relationship exists between two tuples within the same
relation. For example, an EMPLOYEE could supervise multiple EMPLOYEES. If this is the case:
• Add a FK within the same relation that references the PK of the relation
• A recursive foreign key is a FK in a relation that references the PK values of the same
relation.
Ternary relationships that exist (between 3 relationships) must be turned into three seperate
binary relationships between the three relations. Often, another relation can be created to help
facilitate this. For example, a ternary relationship for PATIENT TREATMENT between the three
entities, PATIENT, PHYSICIAN and TREATMENT that exists could become the following logical
relation:
9
Normalisation
Normalisation is a process that assigns attributes to entities so that data redundancies are
reduced or eliminated. It corrects table structure to reduce the likelihood of data anomalies to
occur. In databases, this occurs in different normal forms, namely 1NF, 2NF and 3NF (There are
others but the become increasingly more complicated and require more and more joins to create).
Normalisation operates on the logical level.
Denormalisation is the process of reducing the normal form of the database to account for
performance requirements, whilst still producing the desired output. For example, 3NF could be
converted to 2NF if needed.
• An update anomaly exists when one or more instances of duplicated data is updated, but
not all. For example, in a table with duplicate entries of the same person, if the home
address is changed on one person, it must also be changed on all the other entries with
the same person.
• An insert anomaly occurs when certain attributes cannot be inserted into the database
without the presence of other attributes.
• A delete anomaly occurs when the deletion of one attribute will cause an loss of data
from other attributes.
The object of normalisation is to produce a set of relations and data that conform to the following
properties:
• Each table represents a single subject - for example, COURSE will only contain
information about the courses, not the students doing them.
• No data item will be unnecessarily stored in more than one table - this stops update
anomalies from occurring.
• All nonprime attributes in a table are dependant on the primary key - the entire primary key
and only the primary key. This ensures data able to be uniquely identified.
‣ A prime attribute is a key attribute, usually associated with the primary key.
• Each table is void of insertion, update and delete anomalies - this ensures the integrity of
the database is maintained and data is consistent.
• The primary key is the candidate key (the minimal, irreducible, superkey) selected to
identify the rows of each table.
Dependancy is a property of relationships that describe the extent to which one attribute affects
another attribute in another table:
• A transitive dependancy exists when there are functional dependancies such that X → Y,
Y → Z and X is the primary key. This means that X determines Z via Y. The transitive
dependancy in this case is X → Z. It is a condition in which an attribute is dependent on
another attribute that is not part of the primary key.
• There are no repeating groups in the table (each row/column intersection contains one and
only one value, not a set of values).
10
• All attributes are dependant on the primary key.
1NF holds all data in a single table. As such, all relationship tables must satisfy the 1NF
requirements. To convert to 1NF the following steps must be taken:
‣ A repeating group is a group of multiple entries of the same type that can exist for
any single key attribute occurrence. For example, a car can have multiple colours for
its top, interior, bottom, trim, ect.
‣ Start by representing the data into a tabular format, where each cell contains a single
value and there are no repeating groups.
‣ Eliminate nulls by making sure each group has a suitable data value.
‣ adequate candidate key must be chosen to be the primary key that can uniquely
An
identify all tuples in a table.
‣ A dependancy diagram maps out all data dependancies (primary key, partial and
transitive) that occur within a table structure.
Although, the problem with 1NF is that it can still contain partial dependancies, which are based
on only part of the primary key. This means it is still subject to data anomalies.
• Is in 1NF.
Conversion to 2NF occurs only when the 1NF has a composite primary key. If the 1NF has a
single-attribute primary key, then the table is already in 2NF. If the primary key is composite, the
following steps are taken:
‣ For each component of the primary key that acts as a determinant in a partial
dependency, create a new table with a copy of that component as the primary key.
‣ These components must be copied to the new table, but must still exist in the old
table, where they will become foreign keys.
‣ The attributes that are dependent in a partial dependency are removed from the
original table and placed in the new table with the dependency’s determinant.
‣ All attributes that are not dependant on the partial dependancy are left in the old
table.
11
12
At this point, most anomalies are removed, as duplicate items no longer exist. However, it is still
possible for transitive dependancies to exist in 2NF through various table joins. This means the
primary key may rely on one or more nonprime attributes to functionally determine other nonprime
attributes, as indicated by a functional dependence among the nonprime attributes.
• Is in 2NF.
‣ For every transitive dependency, write a copy of its determinant as a primary key for a
new table.
‣ A determinant is any attribute whose value determines other values within a row. For
example, if you have three different transitive dependancies, you have three different
determinants.
‣ The determinants are still required to be within the original table as a foreign key.
‣ Place the dependent attributes in the new tables with their determinants and remove
them from their original tables.
Similarly to the creation of 2NF, when there is a dependancy (whether partial or transitive), the
solution is to create a new table with the dependant attributes and to connect it with the original
via a primary-foreign key pair. It is important that 2NF is achieved before 3NF can be achieved.
Standard Query Language (SQL) is a data definition language (DDL) for creating databases,
tables, indexes, views, and a data manipulation language (DML) for updating and inserting data to
the database. It has a basic command set and has under 100 unique commands. In SQL, a query
covers both a question and action done to the database; this could be creating a table or
retrieving a set of cells.
Using a standard RDBMS, you must be authenticated before tables can be created.
Authentication is the process the DBMS uses to verify only registered users can access the data.
This usually is encompassed by a username/password login.
A schema is a logical group of database objects, such as tables and indexes, that are related to
each other. Usually, a schema belongs to a single user or application. A single database can hold
multiple schemas that belong to different users. They allow the database to group tables by
owner or function and enforce a level of security by allowing each user to only see the tables that
belong to their particular schema. The following is the SQL code for creating a schema.
Data types are required to be selected for each column, and these are strict; all cells in each
column must adhere to the correct data type. For names and text, varchars are used, while for
numeric values, integers or decimals could be used. It depends on the use of the variable, and
13
this does not change after the creation of the table. Below are some examples of different data
types supported by SQL.
Some other types not included in the above table include TIME, TIMESTAMP, DOUBLE,
CURRENCY and LOGICAL.
To create a table using SQL, the CREATE command is used, and each column must be specified
with both a name and its appropriate data type.
column2 INTEGER,
);
In the example above, the table is created with four columns with the following properties:
• The NOT NULL specifies that a data entry must be made for that particular field. Validation
in the program end can be added such that a value must be entered.
• The UNIQUE specification creates a unique index in that attribute. It avoid having
duplicated values in that specific column.
• The primary key attributes contain both a not null and a unique specification.
• The REFERENCES key word connects a column with a column of the same name in
another table.
• ON UPDATE CASCADE allows the table to be updated correctly if the value in the
connected table of the foreign key is changed. Some RDBMS programs do not support
this command.
• The entire table is enclosed in parenthesis and each specific column is finished with a
comma.
• The entire command sequence ends with a semicolon (usually, but depends on the
RDBMS program being used).
In table names and column names, reserved keywords may not be used. A reserved keyword are
words that are used by SQL to perform specific functions, such as update or sum.
Constraints are a set of rules that help protect the integrity of the database and is crucial. The
foreign key is constrained with the on update specifications and determined by the table it is
referencing. A change is one table must be reflected automatically in a connected table. Besides
14
the PK and FK constraints, the ANSI SQL standard defines the following constraints to exist as
well:
• The NOT NULL constraint ensures that a column does not accept null values.
• The UNIQUE constraint ensures that all values in a column are unique.
• The DEFAULT constraint assigns a value to an attribute when a new row is added to a
table. The end user may, of course, enter a value other than the default value.
• The CHECK constraint is used to validate data when an attribute is entered, such as
checking for a minimum value or maximum date. The data for a check constraint is only
accepted if it means the appropriate condition.
For example, the following SQL could be made for the creation of a table:
column2 INTEGER,
);
To insert a row into a particular table, SQL uses the INSERT command. Below is an example of
how to add a new row, with N number of columns:
In some cases, assuming the column does not have a NOT NULL constraint, it may be needed to
not enter a particular value into a column. Here, the NULL keyword can be added:
If only some of the values are required to be inserted, and the others are to be left as empty, then
only a selected amount of columns can have their values inserted into. In the example below, the
only columns that have a value inserted are column1 and column3. The rest are left null.
In many RDBMS software, any changes made to a table contents are not saved to the disk until
the database is safely closed and has been committed. The COMMIT command saves all work:
COMMIT;
Commit commands also update the integrity of all data inserted, updated or deleted.
To produce a query, the SELECT command organises data from one or more table and displays it
in a view. For example, to list the entire contents from a particular table, the following SQL
command will be executed:
The (*) is a wildcard character and means ‘all’. So the above command will select all rows from
the table and output them in a view. In contrast, to only view a couple of columns from a table, the
following command could be executed:
The FROM clause of the query specifies which table or tables the data is to be retrieved from.
Once data is inserted, the UPDATE command can be used to modify data:
In the example above, the query will check every row in the table where the value of column3 has
a value of ‘3’ in the particular row, and in those rows only it will update the value of column2 to
‘AAA’. Similarly, updates can be made to for multiple column changes:
UPDATE T_Table SET column2 = ‘AAA’, column4 = ‘BBB’ WHERE column3 = ‘3’;
15
If the new data has not been committed yet, the ROLLBACK command undoes any changes
made to the database and returns the state back to the last commit.
ROLLBACK;
The DELETE statement can be used to delete a row from a table. For the example below, every
row which contains a ‘3’ in column3 will be removed:
Alternatively, to delete every row from a table, the where clause is not needed:
Note, this command does NOT delete the entire table, just the contents of it.
Data can be inserted into a table with a select query, known as a subquery. A subquery is
embedded (or nested) inside another query. Also known as a nested query or an inner query. For
example, all data in a particular column from a table can be added to another table with the
command:
The subquery can be as complicated as needed and the insert line will still work appropriately:
INSERT INTO T_TableNEW SELECT column1 FROM T_Table WHERE column3 = ‘3’;
Any changes to the table’s structure can be made using the ALTER TABLE command, followed
by a clause of the respective changes. There are three main alterations that can be made; add,
modify and drop.
The ADD command adds one or more new column to the table.
The column will be added to the table and unless a default clause is given, the default value will
be NULL for all existing rows. It is important that a NOT NULL clause is not given to the new
column as this will give an error message. Other column properties can be added to the new
column, and multiple columns can be added too:
ALTER TABLE T_Table ADD (column6 CHAR(1) DEFAULT ’A’, column7 CHAR(2));
The MODIFY command can change the properties of a particular column in a table. For example,
to change the datatype of a column, the following command could be executed:
In most cases, this is only allowed if the column being changed is already empty. If it has data in
it, only adjustments to the length of data can be made. For example, the amount of decimal digits
displayed in a field can be updated:
The DROP command removes a column from a table. Columns that are involved in a foreign key
attribute cannot be dropped, nor can columns that are the only one in a table. The table can only
be dropped if it is on the ‘one’ side of the relationship.
This command can also be used to remove an entire table from the database:
Sometimes it is necessary to break a table up into smaller tables, and SQL provides a way that
avoids manual copying of data. This is known as a partial table.
However, when creating a new table based on another table, the foreign and primary keys are not
added. To define a primary key, use:
Similarly, foreign keys can also be added and reference other tables.
16
Sequences
Oracle does not support AutoNumber data types for creating an auto generated primary key. A
sequence can be used to assign values to a column on a table. A basic sequence can be created
by:
The .nextval command retrieves the next value in the sequence specified and then saves the new
one to the value for the next insert to use. Once the sequence value is used, it cannot be used
again, even if the previous sequenced row has been deleted from the table. You can drop a
sequence from a database using:
Dropping a sequence does not remove values from a table that previously used the sequence
numbers.
Operators
A partial table can be created by restricting what has been selected from one or more other
tables. The WHERE clause is used to add conditional restrictions. If no rows match the
conditions, the output table will be empty. For example:
As SQL is interpreted in alpha numeric values, the operators can be applied to text and
characters. For example, selecting all values where a column < ‘C’ would select all text that began
with A or B. The string characters are judged from left to right, meaning that the word, Be, would
have a greater value than the word, Adjudication. Hence, if a number is placed in a text field, the
number 5 will be interpreted as being greater than 44.
Dates can undergo operators and are in the dd/mm/yyyy format. For example:
17
If columnDate is a date formatted field, then the above three SQL queries will output the same
result.
Additional columns can be created from expressions. For example, if you wanted to multiply the
value of one column to another column in a table, the following could be done:
The output of this example would produce two columns; column1 values for each row and an
additional column with the new expression associated with each row. These are known as
computed columns or alias. An alias is an alternative name for a column or table in an SQL
statement. By default, this computed column will have a name such as ‘column2 * column3’, but
that can be manually changed using the AS keyword followed by the new name of the column:
The rules of precedence are the rules that establish the order at which computations are
calculated. The operations are computed in the following order:
• Parenthesis ( )
• Power Operations
For multiple conditions, the logical operators are used to combine conditions into one larger
statement. The OR operator will return a row if at least one condition is met. For example:
In this query, the rows outputted will either have a column1 value greater than 2 or column2
having a value of 45 or both. There is no requirement of both conditions to be met, and it is not
exclusive - meaning both conditions are allowed to be met.
The AND operator requires all of the conditions to be met for the row to be outputted. For
example:
SELECT * FROM T_Table WHERE column1 > ‘2’ AND column2 = ’45’ AND column3 = ‘A’;
In this example, all three conditions must be satisfied to be used in the query output. The AND
and OR conditions can also be combined:
SELECT * FROM T_Table WHERE column1 > ‘2’ AND column2 = ’45’ OR column3 = ‘A’;
By default, the query will be read left to right, so in this example, the column1 and column2
conditions are grouped. To group the column2 and column3 conditions, parenthesis can be used.
SELECT * FROM T_Table WHERE column1 > ‘2’ AND (column2 = ’45’ OR column3 = ‘A’);
In this example, all rows where column1 has a value excluding ‘3’ will be selected.
The BETWEEN operator can check if an attribute has a value with a range of two values. For
example:
Some databases do not support the between operator. In this case, the following query is
identical:
SELECT * FROM T_Table WHERE column1 > ’50’ AND column1 < ‘100’;
18
To check for a null attribute value, the IS NULL keyword can be used. For example, the following
query can check all rows for a null value and update it with an actual value:
It is important that checking to see if a value is equal to ‘NULL’ is not used, as ‘NULL’ is not a
specific value, but rather a property of the cell.
The LIKE operator is used in conjunction with wildcards to find patterns within string attributes.
SQL allows for the use of the wildcards ( *, % and _ ) to be used in LIKE operators.
Keep in mind, SQL is case sensitive, so ‘J%’ will not yield the value ‘jim’. To fix this, the UPPER
(or alternatively LOWER) functions can transform the characters in the string to upper (or lower)
case characters.
In the expression above, a value of ‘jim’ in the column1 field will return a true condition. The
conditional operations (NOT, OR and AND) can also be used in conjunctions with the LIKE syntax.
Many queries that require multiple OR operators to check if a value is in a set of values can be
replaced with the IN operator. This operator will return true if a value exists in a set of fixed values.
SELECT * FROM T_Table WHERE column1 IN (‘2’, ‘3’, ‘5’, ‘7’, ’11’);
Only if the value of column1 is equal to one of these elements will the condition yield true. The IN
operator is particularly useful to check if a row exists in a subquery created. For example:
The EXISTS operator can be used to check if a set of rows exist in a subquery. For example:
SQL provides useful functions that can count, find the minimum or maximum values, calculate
averages and so on. The ORDER BY clause is useful when the listing order is important. Although
the option to selecting ascending and descending is offered, by default, ascending values are
used.
19
To produce a list in descending order, the DESC keyword can be used (ASC for ascending but is
usually unnecessary).
A cascading order sequence is a nested ordering sequence for a set of rows, such as a list in
which all last names are alphabetically ordered and, within each last name, the first names are
ordered. For this example:
The order in which the column names are entered in the ORDER BY clause is the nested order
that the rows will be arranged.
To select all distinct values in a table that exist, the DISTINCT clause can be used. For example, if
two entries had a value of ‘Jim’ in column1, only one Jim will be outputted:
The aggregate functions can perform different calculations on a set of rows. The COUNT function
creates a tally of the number of not null rows that a query outputs. For example:
This query will output just the tally number of unique column1 values. It will not include the actual
values in the column. By default, a field is not necessary to be used as a parameter, assuming the
primary field has no null values.
There are other functions that can do similar calculations, as seen in the table below:
The MIN and MAX functions will return the lowest (or highest) value of a specified column.
Similarly to the COUNT function, the table values are not included, just the output of the function.
In the same example, if it is required to select the entire row (rather than just the minimum value)
of the entry that has the minimum value for a particular field, a nested query must be used as the
parameter to the minimum function.
SELECT * FROM T_Table WHERE column1 = ( SELECT MIN( column1 ) FROM T_Table );
The SUM functions will return the total value of a particular column from all rows queried. It can
be combined with the AS clause to name the output. For example:
The AVG function is performed in the same way and finds the mean value for a specific column.
Rows can be grouped into smaller collections quickly and can be accessed using the GROUP BY
clause. It is generally used when you have columns combined with aggregate functions in the
SELECT statement. For example, to determine the minimum value of all rows with distinct values
in a particular column:
This will output the minimum values for each set of distinct values. It is important to note that the
attributes shown following the GROUP BY clause must ALL be selected.
A particularly useful clause that comes with the GROUP BY clause is the HAVING clause. It is
applied to the output of the GROUP BY operation. It restricts the selected rows that are grouped.
For example:
SELECT * FROM T_Table GROUP BY column1 HAVING AVG( column2 ) < 10;
In this case, only the groups that had an average value from column2 less than 10 will be
selected.
20
Joining Tables
The most important distinction between a relational database and a standard database is the
ability to combine or join tables on common attributes. A join is performed when data is retrieved
from more than one table at a time. To join tables, simply list the tables in the FROM clause of the
SELECT statement. By default, a natural join is applied, where two columns are matched from
different tables.
In this example, the query will join the two tables connected to the values of column3. When using
joins, it is important to specify which table the column comes from (TABLE.COLUMN). When
joining three or more tables in SQL, it is important to specify a where condition for each each
table. For example:
But avoid circular joins, in the above example, it is important that a join must not be made with
T_Table2 and T_Table3, as it is implied with the code above already.
An alias may be used to identify the source table from which data is taken. For example, a
shorthand of T_Table can be created.
This shortens the query and minimises the chance of spelling mistakes.
A recursive query is a nested query that joins a table to itself. For example, in an employee table
that includes employees, a recursive query can join a manager to an employee within the same
table. In this example:
ORDER BY E.manager;
The relational join operator merges rows from two tables and returns the rows with the following
conditions:
• Have common values in common columns or have no matching values (outer join)
The joining syntax used above connecting to tables together is the old syntax and is not of
common use. The more common way is to use the JOIN clause.
Join operations can be classified as inner joins or outer joins. An inner join is a join operation
which only rows that meet a given criterion are selected. The join criterion can be an equality
condition (natural join) or an inequality condition (theta join). It is the most commonly used type of
join.
The outer join is an operation that produces a table in which all unmatched pairs are retained;
unmatched values in the related table are left null. Below is a table with different join types.
A cross join performs a relational product (also known as the cartesian product) of two tables.
If there are 8 rows in T_Table and 12 rows in T_Table2, then the cross join will output 96 rows.
Each row from one table will be mapped with each row from the other table. It is also equivalent
to the old style join syntax:
21
SELECT * FROM T_Table, T_Table2;
A natural join returns all rows with matching values in the matching columns and eliminates
duplicate columns. This style of query is used when the tables share one or more common
attributes with common names. It will perform the following tasks:
• If there are no common attributes, return the cross join of the two tables
It is important that the matching attributes have the same name for the natural join to work
correctly.
Multiple tables can also be joined (assuming a column that matches in all three exists):
Another way to express a join is using the USING keyword. This query returns only the rows with
matching values in the column indicated following the USING clause and it is required that this
column exists in both tables with the same name.
In this case, even if there are multiple matching columns, only the rows with matching column1
values will be selected.
Another way to express a join is when the tables have no common attribute names and the JOIN
ON clause is used. The query will return only the rows that meet the indicated join condition. This
way, the columns do not need to share the same way, but must have comparable data types.
An outer join returns not only the rows matching the join condition, but returns the rows with
unmatched values too. There are three types; left, right and full. The left and right types reflect the
order in which the join operations are processed. The left table is the first table named, and the
right table is the second one.
• The left outer join returns all the rows from the first table and all the matching rows from
the second table:
22
• The full join will return the rows from both tables including those that meet the condition
and those that don’t:
Transaction Management
A transaction is a logical unit of work that must be entirely completed or aborted; no intermediate
states are allowed. They must have display certain properties:
• Atomicity requires all operations (SQL requests) of the transaction to be completed. The
transaction is treated as a single, indivisible, logical unit of work.
• Isolation means that the data used during the execution of a transaction cannot be used
by a second transaction until the first has been completed. This property is useful in
multiuser databases and protects data integrity.
• Durability ensures that once transaction changes are done and committed, they cannot
be undone or lost (even in the event of system failure).
• Serialisability exists if the results of running transactions simultaneously are the same
results as running a transaction sequence one after another. There is no mixing of
transactions.
All of the SQL statements of a transaction must run successfully, otherwise the entire transaction
must be rolled back to the previous state. If the transaction is successfully, a commit is made to
the database. The most common data integrity and consistency problem is lost updates, which
occurs when two transactions are trying to update the same column or the same row at the same
time, and only one update will be performed.
A consistent database state is one in which all the integrity states are satisfied. Most real world
transactions are formed by two or more database requests. A database request is the equivalent
of a single SQL statement in an application program. For example, if a transaction includes two
UPDATE and one INSERT entry, then three database requests have been performed.
If a transaction is initiated by the user or application, the sequence must continue through SQL
statements until one of the following events are encountered:
• A COMMIT statements is reached, which automatically ends all SQL transactions and
saves the database.
• A ROLLBACK statement is reached, which automatically aborts the process and returns
the database to the last consistent state.
• The end of a program is reached successfully, in which all changes are permanently
recorded to the database - equivalent to a commit.
• The program is abnormally terminated due to a program crash or other issues. The
database changes must also be aborted and the state must return to the latest safe state -
equivalent to a rollback.
BEGIN TRANSACTION;
A database also uses a transaction log to keep track of all transactions that update the
database. The DBMS uses the information stored in this log for any form of recovery triggered by
a ROLLBACK statement. It stores the following:
‣ Pointers to previous and next transaction log entries for the same transaction
23
A log increases the overhead of the database, but it is required to ensure that a corrupted
database can return to the previous saved state. It is important for recovery purposes.
A soft crash is a loss of volatile storage, but no damage to disks is made. A restart facility is
required to assist with this issue. A hard crash is caused when the disk becomes unreadable and
must be recovered from a previous saved state.
Restart Process - Once the cause of the soft crash has been rectified, and the database is being
restarted:
• The last checkpoint before the crash in the log file is identified. It is then read forward, and
two lists are constructed
The database is then rolled forward, using REDO logic and the after-images and rolled back, using
UNDO logic and the before-images.
Recovery Process - A hard crash involves physical damage to the disk, rendering it unreadable.
This may occur in a number of ways:
• Head-crash. The read/write head, which normally “flies” a few microns off the disk surface,
for some reason actually contacts the disk surface, and damages it.
• Accidental impact damage, vandalism or fire, all of which can cause the disk drive and
disk to be damaged.
After a hard crash, the disk unit, and disk must be replaced, reformatted, and then re-loaded with
the data base.
A backup is a copy of the data base stored on a different device to the data base, and therefore
less likely to be subjected to the same catastrophe that damages the data base. Ideally, two
copies of each backup are held, an on-site copy, and an off-site copy to cater for severe
catastrophes, such as building destruction.
There are two types of transactions between two users; serial and interleaved. Serial transactions
occur when user1 alters the transaction database and once completed and committed, only then
will user2 access the database. Interleaved transactions have both users accessing the database
between commits. In the diagram below, assume T0 is user1 and T1 is user2.
Without caution, interleaved transactions can lead to lost updates and invalid data entries.
A serial schedule is a list of all tasks performed by the database when changing the information
present. Usually, an r refers to a read operation, a w refers to a write operation and a c refers to a
commit operation. The table below is an example with two users and is ordered in increasing
time.
24
Alternatively, the same example can be as:
Where the numbers following the operation refer to the transaction number (user).
A given interleaved execution of some set of transactions is said to be serializable if and only if it
produces the same result as some serial execution of those same transactions. For interleaved
schedules, we must determine whether the schedules are serializable by creating a precedence
graph.
Locks are required to prevent another transaction from reading inconsistent data and prevents
corruption and invalidation of data from occurring when multiple users try to write to the
database. Any single user can only modify those database records to which they have applied a
lock that gives them exclusive access to the record (until the lock has been released). A
transaction must acquire a lock prior to accessing a data item and locks are released when a
transaction is completed. They are usually controlled by the DBMS managers.
The granularity of locking refers to the size of the units that are, or can be, locked. It can be done
at the following levels:
• Database
• Table
• Record - allows concurrent transactions to access different rows of the same table, even
if the rows are located on the same page
• Attribute - allows concurrent transactions to access the same row, as long as they require
the use of different attributes within that row
• A shared lock can be held simultaneously by multiple processes, allowing them to read
without updating.
‣ If T1 and T2 only wished to read P1 with no subsequent update they could both apply
an SLock on P1 and continue
• A process that needs to update a record must obtain an exclusive lock. Its application for
a lock will not proceed until all current locks are released.
A wait for graph can be created which describes the steps taken by each transaction process.
Below is an example of a WFG for three transactions. An S operation stands for a shared access,
when others can still access the database at the same time, and an X is an exclusive access
which disallows two transactions to occur concurrently.
A problem that may occur is a deadlock, also known as a deadly embrace. A scenario that may
occur could be:
• Transaction 1 has an exclusive lock on data item A, and requests a lock on data item B.
• Transaction 2 has an exclusive lock on data item B, and requests a lock on data item A.
25
Without committing data before the second transaction begins, the result is a deadlock where
neither transaction can run, while it waits for the other to complete. To prevent deadlocks, a
transaction must acquire a necessary lock before it updates any records and if it cannot access it,
it will release all locks and try again later.
Subqueries
The use of joins in a database allows you to get information from two or more tables. It is often
necessary to process data based on other processed data. A subquery can generate this
information and then use this new set of data to perform an action on it (insert, update, ect). It has
some basic characteristics:
• The first query in the SQL statement is known as the outer query
• The query inside the SQL statement is known as the inner query
• The output of the inner query is used as the input to the outer query
• One single value - For example an average price can be calculated and used to update a
value in another table
• A list of values - This type of subquery is used when a list of values is expected, such as
using an IN clause.
• A virtual table - Can be used when a table is expected, such as using a FROM clause
If the subquery returns no values at all, it returns a NULL and depending on the outer query, this
may cause an error or another NULL value.
The most common type of subquery uses an inner SELECT subquery on the right side of a
WHERE comparison expression. For example, to find a list of all items that have a price greater
than the average price of all the items, you could use the following SQL:
SELECT * FROM T_Table WHERE price > ( SELECT AVG( price ) FROM T_Table );
This type of expression (using equality operators) require the inner query to present an output that
is a single value. If it returns more than one value, the DBMS will produce an error.
Another common subquery type uses the IN clause to check if a value exists in another table. For
example, to find a list of all customers who have purchased an ‘apple’, the following SQL could
be executed:
For these expressions, the inner query can output a set of values (in this case a column).
Just as WHERE subqueries exist, a subquery can be used with a HAVING clause to restrict the
output of a GROUP BY query by applying additional criteria to the new grouped rows. For
example to list all products with a total quantity sold greater than the average quantity sold:
The IN clause can allow subqueries to check if a value exists within a list of values. However, this
does not work for inequality expressions ( < or > ). The ALL operator allows you to compare a
single value with a list of values returned by the first subquery using a comparison operator (other
than equals). For example, to select a product that is more expensive than all ‘apple’ products
that exist:
26
WHERE id IN ( SELECT id FROM T_Purchase WHERE item_name = ‘apple’ ) );
A similar operator is the ANY operator, which does the same function as the IN clause.
If the output of the subquery can produce a table of values, this subquery can be called upon by
the FROM clause and be used to analyse data with. For this to work, the output must be a virtual
table. For example, if you wanted to know all customers who have bought the products ‘apple’
and ‘banana’:
A correlated subquery is a subquery that executes once for each row in the outer query. This
process is similar to nested loop in a programming language.
WHERE PS.units > ( SELECT AVG( units ) FROM T_Product PA WHERE PA.id = PS.id );
In these subqueries, the inner query must make use of an attribute from the outer query. These
should be handled carefully, as the computation time increases due to the looping within each
query.
SQL Functions
Functions in SQL are very similar as to other languages. The function will take a number of
parameters (or none) and return an output. It can be called from any location in the SQL code
where an attribute value is being replaced.
There are a range of date and time functions that can be used in SQL Server:
• Similarly, the MONTH and DAY functions take a date parameter and return the numerical
value for the month and day from the date.
SELECT SYSDATE;
• To add a specified number of date-parts to a given date, use the DATEADD function. For
example, to add 90 days to a date from a table attribute:
• To find the difference between two dates, use the DATEDIFF function. Again, the day
parameter is used to specify what the output should be as. Both month and year can be
used, or hours, minutes, seconds.
• To convert a date into another format (such as a varchar), the CONVERT function can be
used. This is for SQL Server. It has three parameters, the conversion format, the date and
the format type.
In this case, the format used is format 1. This corresponds to MM/DD/YY. Other formats
include:
‣ 101: MM/DD/YYYY
‣ 2: YY.MM.DD
‣ 102: YYYY.MM.DD
‣ 3: DD/MM/YY
27
‣ 103: DD/MM/YYYY
For Oracle and SQL Developer, the TO_CHAR function returns the character
representation of a date or set of date parts, given a set format.
Numeric functions can be grouped into algebraic, trigonometric and logarithmic functions:
• The ABS function returns the absolute value of the passed parameter (if negative, make it
positive)
• The ROUND function rounds a value to a specified precision (number of digits). For
rounding to the nearest integer value, use the precision value of 0.
• The CEILING and FLOOR functions outputs the nearest integer of the given value. The
ceiling function will return the nearest one above the current value (assuming it is not an
integer), while the floor function will return the nearest one below the current value (again,
assuming the input value is not an integer). So, CEILING( 10.5 ) = 11, FLOOR( 2.99 ) = 3,
CEILING ( 2.000 ) = 2.
String manipulations are one of the most useful SQL functions and can convert strings to
uppercase, concatenate them, ect:
• To concatenate two strings, use the || operators. This will add two the two strings on either
side of the operator together. Multiple concatenations can be used, as given in the
example below:
• To convert the entire string to upper or lower case, use the UPPER and LOWER functions
respectfully.
• The SUBSTRING functions returns a portion of a string given in the parameter. The first
parameter is the input string, the second is the starting index and the third is the length of
characters to cut.
• To get the number of characters that a particular string is, the LENGTH function is used
(for Oracle systems). In other database systems, it may be LEN instead. Similar differences
exist for other functions.
Another use function is the NVL function, which takes in a parameter and returns a specific
number if the parameter is NULL. This way, two fields can be added and even if one field is null,
the second will not be affected.
In the example above, even if number2 is null, (assuming number1 is always a value, then the
query will still output a value (where number2 = 0). If number2 is not null, then it will use it’s
correct value in the calculation.
28
Relational Set Operators
Most SQL commands are set oriented. This means that they deal with groups of things and
specific sets of data - they operate over entire rows, or columns at once. Union, intersection and
difference relational operators can be used to select part of the data set, but the relations must be
union compatible. This means that two or more tables must share the same number of columns
and have columns with the same data types corresponding with each other (the actual column
name is not important).
The UNION statement combines rows from two or more queries without including duplicate rows.
The syntax for such command is query UNION query. If one table has 5 fields, while the second
table has an additional 6th field, the output of the union query will ignore all fields that aren’t
present in both tables. An example SQL code would be:
It can be used with more than one query, and can combine the output of multiple queries into one
larger output.
The UNION ALL clause will include all duplicated rows. It can combine two queries into one
single query (provided it is union compatible) but will include duplicates.
Similarly to this, the INTERSECT clause will provide the names of ALL duplicated rows between
the two queries. Only the rows that appear in both sets of data will be shown (not duplicated in
the output).
The MINUS (or EXCEPT in some systems) clause will output all rows that appear in the first set
(from the first query) but not in the second set (in the second query).
Virtual Tables
The output of a relational operator, such as SELECT, is another relation (or table). If the output of
such a query is needed to be saved, a relational view can be formed. A view is a virtual table
based on a SELECT query. It is saved as an object in the database and can contain columns,
computed columns, aliases and aggregate functions from one or more tables. The tables of which
the view is based upon are called the base tables.
To create a view, the CREATE VIEW command is used, followed by the view name. Before the
select query, the AS keyword is used:
A view can be used to replace anywhere a table name is expected in an SQL statement. They are
also dynamically updated - this means they are recreated each time they are used. So if a set of
data is updated, the view will be updated each time it is used (old data will then become
redundant). Views can provide security, as a company can provide each department a specific
view that they can call their data analysis from, rather than the whole database.
A batch update routine is a routine that pools transactions into a single group to update a master
table in a single operation. It updates data from two or more tables in a single operation. In normal
cases, the database system will produce an error when attempting to update data from a JOIN
table clause. A solution to this is to create an updatable view, which is a view that can update
attributes in the base tables used in the view. An updatable view has no unique syntax, but has a
range of restrictions in place:
29
• Operators such as UNION, INTERSECT or EXCEPT cannot be used
• The base tables being updated must be key-preserved, meaning that the values in the
primary keys of the tables must be unique and remain that way.
Once the view has been created, simply update the values in the view and all base tables will be
updated:
Database Connectivity
Database connectivity refers to the mechanisms in which application programs connect and
communicate with data repositories. The DBMS is the intermediary structure between the data
stored and the user’s applications. For online databases, a client/server approach must be used.
It can be broken down into three fundamental layers:
• The middle layer that manages connectivity and data transformation issues. It is
responsible for translating the language into code that the database can respond to.
Native SQL connectivity refers to the connection interface that is provided by the database
vendor and is unique to that vendor. For an Oracle database, the Oracle SQL Net interface must
be installed on the client’s device to access the database and interface of the online server. To
ensure that all DBMS programs act in a standard way, the Call Level Interface was developed by
the SQL access group that provides the database standards that all major vendors must abide by.
Developed in the early ‘90s, Open Database Connectivity (ODBC) is Microsoft’s implementation
of a the SQL standard for database access. It allows any Windows application to access relational
data sources, using SQL via a standard API. However, overtime, ODBC did not provide significant
functionality beyond the ability to execute SQL queries, so other data interfaces were developed.
There are many API adopted by databases to provide specific functions. One such is data
access objects (DAO), which is an object oriented API used to access desktop databases such
as MS Access and provides optimised interfaces for such programs. Another major API is the
remote data objects (RDO) which is a higher level, object oriented API used to access remote
database servers. It is used to deal with server based databases.
ODBC executes on the Windows operating system through dynamic linked-libraries (DLLs),
which are stored as files with a .dll extension. Running as a DLL, the code speeds up load and run
times. ODBC architecture has three main components:
• A high level ODBC API, through which applications access the ODBC functionality
Java Database Connectivity (JDBC) is an application programming interface that allows a Java
program to interact with a wide range of data sources, including relational databases, tabular data
sources, spreadsheets, and text files. JDBC allows a Java program to establish a connection with
a data source, prepare and send the SQL code to the database server, and process the result set.
One advantage of JDBC over other middleware is that it requires no configuration on the client
side. It provides a way to connect to the database using the ODBC driver.
Providers are objects that manage the connection with a data source and provide data to the
consumers. They can be broken into two categories:
• Data providers provide data to other processes and are used to create the functionality of
the underlying data source.
30
• Service providers provide additional functionality to users and is located between the
data provider and the consumer. The service provider requests the data from the data
provider, transforms the data and then sends the transformed data to the end user. These
include transaction management services and querying services (such as SQL).
A session is a connection period between the two providers and a command is used to
manipulate the interaction between the two to create objects.
A script is written in a programming language that is not compiled, but interpreted and executed
at run time. A connection object is used to set up and establish a connection with a data source
of any time. A recordset object contains data generated by the execution of a command. It will
also contain any new data to be written to the data source. The recordset can be disconnected
from the data source. The DataSet is a disconnected, memory-resident representation of the
database and stores the data that has been read by the data provider, usually stored as an XML
format.
Internet database connectivity allows for a range of services such. The benefits of internet
technologies include, but are not limited to:
• Hardware and software independence - savings on equipment and no need for multiple
platform development.
• Common and simple user interface - reduced training time and cost and reduced end-user
support cost.
• Location independence - Global access through Internet infrastructure and mobile smart
devices and reduced requirements and cost for delicate connections.
In general, a web server is the main hub through which all Internet services are accessed. When a
user dynamically queries a database, the client requests a webpage from a web server. When the
web server receives the request, it finds the page on the hard disk and sends it back to the client.
Dynamic webpages encompass most modern websites now, in which the web server generates
the contents of the page before sending it to the client. However, to gather data from a database
in a dynamic webpage, neither the client nor the web server knows how to connect to the
database and retrieve the data required.
A server-side extensive is required to allow this to occur. It is a program that interacts directly
with the server process to handle specific types of requests. Server-side extensions add
significant functionality to web servers and intranets. A database server-side extension is also
known as a web-to-database middleware, which retrieves data from the database and gives it
to the web server to send to the client. This can be done in the following process:
2. The web server receives and passes the request to the web-to-database middleware
for processing
3. The requested page might contain some kind of scripting which the web server passes
to the middleware.
4. The middleware reads the data, validates it and then executes the script. It can
connect to the database and passes the query using the database connectivity layer.
5. The database server executes the query and passes the result back to the middleware.
6. The middleware then compiles the result and dynamically generates a HTML formatted
page that includes the data retrieved from the database and sends it to the web server.
7. The web server returns the HTML page, including the query result, back to the client.
A web server interface defines a standard way to exchange messages with the external programs.
There are two well-defined web server interfaces:
The Common Gateway Interface (CGI) uses script files that perform specific functions based on
a client’s parameters that are passed to the web server. The script file is a small program
31
containing the commands written in a programming language, usually Perl, C++ or Basic. The
script will convert the retrieved data to a HTML format. The main disadvantage of CGI is the
executable program is external and will run separately from the main processor, affecting
performance.
A newer web server interface is the Application Programming Interface (API) which is more
efficient and faster than the CGI script. They are implemented as shared code or dynamic link
libraries (DLLs) meaning API is treated part of the web server program that is invoked when
needed. Their code resides in memory, so there is no need to run an external program, like CGI
does. Another advantage is that API can use a shared connection to the database instead of
creating a new one every time, as in the case of CGI scripts. However, as the memory is stored in
the same storage as the web-server, an error in the API can bring down the whole web server.
APIs are also specific to web server and operating system.
The web browser is software that allows end users to navigate the web from the client computer.
Each time the end user clicks a hyperlink, the browser generates a HTTP GET page request that
is sent to the designated web server using the TCP/IP protocol. The browser’s job is to interpret
the HTML code and display it visually on the screen.
The web is a stateless system, meaning at any given time a web server does not know the status
of any of the clients communicating with it. The web does not reserve memory to maintain an
open communications state between the client and the server. The server does not know what the
client does with the webpage sent, as the page is stored in the clients cache.
• JavaScript: A scripting language that allows web authors to design interactive sites. It is
embedded in the web page code that the client’s device can access and will be executed
by the browser and not on the web-server.
• ActiveX: This is more specific for Microsoft clients and allows programs to be written
inside webpages, similar to JavaScript. It adds controls such as drop down windows and
calendars to webpages.
• VBScript: Another Microsoft product that is used to extend browser functionality, derived
from Visual Basic. It is similar to JavaScript and the code is stored inside the HTML
document and executed by the web browser.
A web application server is a middleware application that expands the functionality of web
servers by linking them to a wide range of services such as databases, directory systems and
search engines. They are used to perform some of the following tasks:
• Public cloud: This infrastructure is built by third party organisations to sell cloud services
to the general public, such as Amazon Web Services (AWS), the Google Engine and
Microsoft Azure. In this model, cloud consumers share resources with other consumers
transparently.
• Private cloud: This is an internal infrastructure built by an organisation for the sole
purpose of servicing its own needs. It can be managed by the IT staff of the organisation
or by an external third party.
32
• Community cloud: This type of cloud is built by and for a specific group of organisations
that share a common trade, such as agencies of the federal government, the military or
higher education.
33