Database PDF
Database PDF
Database PDF
Articles
Database
Entityrelationship model
24
Database design
32
Relational database
35
Relational model
39
Binary relation
49
Database normalization
55
63
67
70
73
78
80
83
85
Relation (database)
87
Functional dependency
88
Multivalued dependency
90
Join dependency
92
Concurrency control
93
References
Article Sources and Contributors
101
104
Article Licenses
License
105
Database
Database
A database is an organized collection of data. The data are typically organized to model relevant aspects of reality
(for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for
example, finding a hotel with vacancies).
The term database is correctly applied to the data and their supporting data structures, and not to the database
management system (DBMS). The database data collection with DBMS is called a database system.
The term database system implies that the data are managed to some level of quality (measured in terms of accuracy,
availability, usability, and resilience) and this in turn often implies the use of a general-purpose database
management system (DBMS).[1] A general-purpose DBMS is typically a complex software system that meets many
usage requirements to properly maintain its databases which are often large and complex. The utilization of
databases is now so widespread that virtually every technology and product relies on databases and DBMSs for its
development and commercialization, or even may have DBMS software embedded in it. Also, organizations and
companies, from small to large, depend heavily on databases for their operations.
Well known DBMSs include FoxPro, IBM DB2, Linter, Microsoft Access, Microsoft SQL Server, MySQL, Oracle,
PostgreSQL and SQLite. A database is not generally portable across different DBMS, but different DBMSs can
inter-operate to some degree by using standards like SQL and ODBC together to support a single application built
over more than one database. A DBMS also needs to provide effective run-time execution to properly support (e.g.,
in terms of performance, availability, and security) as many database end-users as needed.
A way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical,
or multimedia objects. Another way is by their application area, for example: accounting, music compositions,
movies, banking, manufacturing, or insurance.
The term database may be narrowed to specify particular aspects of organized collection of data and may refer to the
logical database, to the physical database as data content in computer data storage or to many other database
sub-definitions.
History
Database concept
The database concept has evolved since the 1960s to ease increasing difficulties in designing, building, and
maintaining complex information systems (typically with many concurrent end-users, and with a large amount of
diverse data). It has evolved together with database management systems which enable the effective handling of
databases. Though the terms database and DBMS define different entities, they are inseparable: a database's
properties are determined by its supporting DBMS. The Oxford English dictionary cites a 1962 technical report as
the first to use the term "data-base." With the progress in technology in the areas of processors, computer memory,
computer storage and computer networks, the sizes, capabilities, and performance of databases and their respective
DBMSs have grown in orders of magnitudes. For decades it has been unlikely that a complex information system
can be built effectively without a proper database supported by a DBMS. The utilization of databases is now spread
to such a wide degree that virtually every technology and product relies on databases and DBMSs for its
development and commercialization, or even may have such embedded in it. Also, organizations and companies,
from small to large, heavily depend on databases for their operations.
No widely accepted exact definition exists for DBMS. However, a system needs to provide considerable
functionality to qualify as a DBMS. Accordingly its supported data collection needs to meet respective usability
requirements (broadly defined by the requirements below) to qualify as a database. Thus, a database and its
supporting DBMS are defined here by a set of general requirements listed below. Virtually all existing mature
Database
DBMS products meet these requirements to a great extent, while less mature either meet them or converge to meet
them.
Database
General-purpose DBMS
A DBMS has evolved into a complex software system and its development typically requires thousands of
person-years of development effort. Some general-purpose DBMSs, like Oracle, Microsoft SQL Server, FoxPro, and
IBM DB2, have been undergoing upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many
applications as possible, which typically makes them even more complex than special-purpose databases. However,
the fact that they can be used "off the shelf", as well as their amortized cost over many applications and instances,
makes them an attractive alternative (Vs. one-time development) whenever they meet an application's requirements.
Though attractive in many cases, a general-purpose DBMS is not always the optimal solution: When certain
applications are pervasive with many operating instances, each with many users, a general-purpose DBMS may
introduce unnecessary overhead and too large "footprint" (too large amount of unnecessary, unutilized software
code). Such applications usually justify dedicated development. Typical examples are email systems, though they
need to possess certain DBMS properties: email systems are built in a way that optimizes email messages handling
and managing, and do not need significant portions of a general-purpose DBMS functionality.
Types of people involved
Three types of people are involved with a general-purpose DBMS:
1. DBMS developers - These are the people that design and build the DBMS product, and the only ones who touch
its code. They are typically the employees of a DBMS vendor (e.g., Oracle, IBM, Microsoft, Sybase), or, in the
case of Open source DBMSs (e.g., MySQL), volunteers or people supported by interested companies and
organizations. They are typically skilled systems programmers. DBMS development is a complicated task, and
some of the popular DBMSs have been under development and enhancement (also to follow progress in
technology) for decades.
2. Application developers and database administrators - These are the people that design and build a
database-based application that uses the DBMS. The latter group members design the needed database and
maintain it. The first group members write the needed application programs which the application comprises.
Both are well familiar with the DBMS product and use its user interfaces (as well as usually other tools) for their
work. Sometimes the application itself is packaged and sold as a separate product, which may include the DBMS
inside (see embedded database; subject to proper DBMS licensing), or sold separately as an add-on to the DBMS.
3. Application's end-users (e.g., accountants, insurance people, medical doctors, etc.) - These people know the
application and its end-user interfaces, but need not know nor understand the underlying DBMS. Thus, though
they are the intended and main beneficiaries of a DBMS, they are only indirectly involved with it.
Database machines and appliances
In the 1970s and 1980s attempts were made to build database systems with integrated hardware and software. The
underlying philosophy was that such integration would provide higher performance at lower cost. Examples were
IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine. Another approach to
hardware support for database management was ICL's CAFS accelerator, a hardware disk controller with
programmable search capabilities. In the long term these efforts were generally unsuccessful because specialized
database machines could not keep pace with the rapid development and progress of general-purpose computers. Thus
most database systems nowadays are software systems running on general-purpose hardware, using general-purpose
computer data storage. However this idea is still pursued for certain applications by some companies like Netezza
and Oracle (Exadata).
Database
Database research
Database research has been an active and diverse area, with many specializations, carried out since the early days of
dealing with the database concept in the 1960s. It has strong ties with database technology and DBMS products.
Database research has taken place at research and development groups of companies (e.g., notably at IBM Research,
who contributed technologies and ideas virtually to any DBMS existing today), research institutes, and academia.
Research has been done both through theory and prototypes. The interaction between research and database related
product development has been very productive to the database area, and many related key concepts and technologies
emerged from it. Notable are the Relational and the Entity-relationship models, the atomic transaction concept and
related Concurrency control techniques, Query languages and Query optimization methods, RAID, and more.
Research has provided deep insight to virtually all aspects of databases, though not always has been pragmatic,
effective (and cannot and should not always be: research is exploratory in nature, and not always leads to accepted or
useful ideas). Ultimately market forces and real needs determine the selection of problem solutions and related
technologies, also among those proposed by research. However, occasionally, not the best and most elegant solution
wins (e.g., SQL). Along their history DBMSs and respective databases, to a great extent, have been the outcome of
such research, while real product requirements and challenges triggered database research directions and sub-areas.
The database research area has several notable dedicated academic journals (e.g., ACM Transactions on Database
Systems-TODS, Data and Knowledge Engineering-DKE, and more) and annual conferences (e.g., ACM SIGMOD,
ACM PODS, VLDB, IEEE ICDE, and more), as well as an active and quite heterogeneous (subject-wise) research
community all over the world.
Database
5
as to make them available for further use.
Operations in a data warehouse are typically concerned with bulk data manipulation, and as such, it is
unusual and inefficient to target individual rows for update, insert or delete. Bulk native loaders for
input data and bulk SQL passes for aggregation are the norm.
Distributed database
The definition of a distributed database is broad, and may be utilized in different meanings. In general it
typically refers to a modular DBMS architecture that allows distinct DBMS instances to cooperate as a
single DBMS over processes, computers, and sites, while managing a single database distributed itself
over multiple computers, and different sites.
Examples are databases of local work-groups and departments at regional offices, branch offices,
manufacturing plants and other work sites. These databases can include both segments shared by
multiple sites, and segments specific to one site and used only locally in that site.
Document-oriented database
A document-oriented database is a computer program designed for storing, retrieving, and managing
document-oriented, or semi structured data, information. Document-oriented databases are one of the
main categories of so-called NoSQL databases and the popularity of the term "document-oriented
database" (or "document store") has grown with the use of the term NoSQL itself.
Utilized to conveniently store, manage, edit and retrieve documents.
Embedded database
An embedded database system is a DBMS which is tightly integrated with an application software that
requires access to stored data in a way that the DBMS is hidden from the applications end-user and
requires little or no ongoing maintenance. It is actually a broad technology category that includes
DBMSs with differing properties and target markets. The term "embedded database" can be confusing
because only a small subset of embedded database products is used in real-time embedded systems such
as telecommunications switches and consumer electronics devices.[3]
End-user database
These databases consist of data developed by individual end-users. Examples of these are collections of
documents, spreadsheets, presentations, multimedia, and other files. Several products exist to support
such databases. Some of them are much simpler than full fledged DBMSs, with more elementary DBMS
functionality (e.g., not supporting multiple concurrent end-users on a same database), with basic
programming interfaces, and a relatively small "foot-print" (not much code to run as in "regular"
general-purpose databases). However, also available general-purpose DBMSs can often be used for such
purpose, if they provide basic user-interfaces for straightforward database applications (limited query
and data display; no real programming needed), while still enjoying the database qualities and
protections that these DBMSs can provide.
Federated database and multi-database
A federated database is an integrated database that comprises several distinct databases, each with its
own DBMS. It is handled as a single database by a federated database management system (FDBMS),
which transparently integrates multiple autonomous DBMSs, possibly of different types (which makes it
a heterogeneous database), and provides them with an integrated conceptual view. The constituent
databases are interconnected via computer network, and may be geographically decentralized.
Sometime the term multi-database is used as a synonym to federated database, though it may refer to a
less integrated (e.g., without an FDBMS and a managed integrated schema) group of databases that
cooperate in a single application. In this case typically middleware for distribution is used which
Database
6
typically includes an atomic commit protocol (ACP), e.g., the two-phase commit protocol, to allow
distributed (global) transactions (vs. local transactions confined to a single DBMS) across the
participating databases.
Graph database
A graph database is a kind of NoSQL database that uses graph structures with nodes, edges, and
properties to represent and store information. General graph databases that can store any graph are
distinct from specialized graph databases such as triplestores and network databases.
Hypermedia databases
The World Wide Web can be thought of as a database, albeit one spread across millions of independent
computing systems. Web browsers "process" these data one page at a time, while web crawlers and
other software provide the equivalent of database indexes to support search and other activities.
Hypertext database
In a Hypertext database, any word or a piece of text representing an object, e.g., another piece of text, an
article, a picture, or a film, can be linked to that object. Hypertext databases are particularly useful for
organizing large amounts of disparate information. For example they are useful for organizing online
encyclopedias, where users can conveniently jump in the texts, in a controlled way, by using hyperlinks.
In-memory database
An in-memory database (IMDB; also main memory database or MMDB) is a database that primarily
resides in main memory, but typically backed-up by non-volatile computer data storage. Main memory
databases are faster than disk databases. Accessing data in memory reduces the I/O reading activity
when, for example, querying the data. In applications where response time is critical, such as
telecommunications network equipment, main memory databases are often used.[4]
Knowledge base
A knowledge base (abbreviated KB, kb or [5][6]) is a special kind of database for knowledge
management, providing the means for the computerized collection, organization, and retrieval of
knowledge. Also a collection of data representing problems with their solutions and related experiences.
Operational database
These databases store detailed data about the operations of an organization. They are typically organized
by subject matter, process relatively high volumes of updates using transactions. Essentially every major
organization on earth uses such databases. Examples include customer databases that record contact,
credit, and demographic information about a business' customers, personnel databases that hold
information such as salary, benefits, skills data about employees, Enterprise resource planning that
record details about product components, parts inventory, and financial databases that keep track of the
organization's money, accounting and financial dealings.
Parallel database
A parallel database, run by a parallel DBMS, seeks to improve performance through parallelization for
tasks such as loading data, building indexes and evaluating queries. Parallel databases improve
processing and input/output speeds by using multiple central processing units (CPUs) (including
multi-core processors) and storage in parallel. In parallel processing, many operations are performed
simultaneously, as opposed to serial, sequential processing, where operations are performed with no
time overlap.
The major parallel DBMS architectures (which are induced by the underlying hardware architecture are:
Shared memory architecture, where multiple processors share the main memory space, as well as other
data storage.
Database
7
Shared disk architecture, where each processing unit (typically consisting of multiple processors) has its
own main memory, but all units share the other storage.
Shared nothing architecture, where each processing unit has its own main memory and other storage.
Real-time database
If a DBMS system responses users' request in a given time period, it can be regarded as a real time database.
Spatial database
A spatial database can store the data with multidimensional features. The queries on such data include location based
queries, like "where is the closest hotel in my area".
Temporal database
A temporal database is a database with built-in time aspects, for example a temporal data model and a temporal
version of Structured Query Language (SQL). More specifically the temporal aspects usually include valid-time and
transaction-time.
Unstructured-data database
An unstructured-data database is intended to store in a manageable and protected way diverse objects
that do not fit naturally and conveniently in common databases. It may include email messages,
documents, journals, multimedia objects etc. The name may be misleading since some objects can be
highly structured. However, the entire possible object collection does not fit into a predefined structured
framework. Most established DBMSs now support unstructured data in various ways, and new dedicated
DBMSs are emerging.
Functional requirements
Certain general functional requirements need to be met in conjunction with a database. They describe what is needed
to be defined in a database for any specific application.
The database's structure must be defined. The database needs to be based on a data model that is sufficiently rich to
describe in the database all the needed respective application's aspects. Data definition languages exist to describe
the databases within the data model. Such languages are typically data model specific.
A database data model needs support by a sufficiently rich data manipulation language to allow database
manipulations, and for information to be generated from the data. Such language is typically data model specific.
A database needs built-in security means to protect its content (and users) from dangers of unauthorized users (either
humans or programs). Protection is also provided from types of unintentional breach. Security types and levels
should be defined by the database owners.
Manipulating database data often involves processes of several interdependent steps, at different times (e.g., when
different people's interactions are involved; e.g., generating an insurance policy). Data manipulation languages are
typically intended to describe what is needed in a single such step. Dealing with multiple steps typically requires
Database
writing quite complex programs. Most applications are programmed using common programming languages and
software development tools. However the area of process description has evolved in the frameworks of workflow and
business processes with supporting languages and software packages which considerably simplify the tasks.
Traditionally these frameworks have been out of the scope of common DBMSs, but utilization of them has become
common-place, and often they are provided as add-ons to DBMSs.
Operational requirements
Operational requirements are needed to be met by a database in order to effectively support an application when
operational. Though it typically may be expected that operational requirements are automatically met by a DBMS, in
fact it is not so in most of the cases: To be met substantial work of design and tuning is typically needed by database
administrators. This is typically done by specific instructions/operations through special database user interfaces and
tools, and thus may be viewed as secondary functional requirements (which are not less important than the primary).
Availability
A DB should maintain needed levels of availability, i.e., the DB needs to be available in a way that a user's action
does not need to wait beyond a certain time range before starting executing upon the DB. Availability also relates to
failure and recovery from it (see Recovery from failure and disaster below): Upon failure and during recovery
normal availability changes, and special measures are needed to satisfy availability requirements.
Performance
Users' actions upon the DB should be executed within needed time ranges.
Isolation between users
When multiple users access the database concurrently the actions of a user should be uninterrupted and unaffected by
actions of other users. These concurrent actions should maintain the DB's consistency (i.e., keep the DB from
corruption).
Recovery from failure and disaster
All computer systems, including DBMSs, are prone to failures for many reasons (both software and hardware
related). Failures typically corrupt the DB, typically to the extent that it is impossible to repair it without special
measures. The DBMS should provide automatic recovery from failure procedures that repair the DB and return it to
a well defined state.
Backup and restore
Sometimes it is desired to bring a database back to a previous state (for many reasons, e.g., cases when the database
is found corrupted due to a software error, or if it has been updated with erroneous data). To achieve this a backup
operation is done occasionally or continuously, where each desired database state (i.e., the values of its data and their
embedding in database's data structures) is kept within dedicated backup files (many techniques exist to do this
effectively). When this state is needed, i.e., when it is decided by a database administrator to bring the database back
to this state (e.g., by specifying this state by a desired point in time when the database was in this state), these files
are utilized to restore that state.
Database
Data independence
Data independence pertains to a database's life cycle (see Database building, maintaining, and tuning below). It
strongly impacts the convenience and cost of maintaining an application and its database, and has been the major
motivation for the emergence and success of the Relational model, as well as the convergence to a common database
architecture. In general the term "data independence" means that changes in the database's structure do not require
changes in its application's computer programs, and that changes in the database at a certain architectural level (see
below) do not affect the database's levels above. Data independence is achieved to a great extent in contemporary
DBMS, but it is not completely attainable, and achieved at different degrees for different types of database structural
changes.
Data models
A data model is an abstract structure that provides the means to effectively describe specific data structures needed
to model an application. As such a data model needs sufficient expressive power to capture the needed aspects of
applications. These applications are often typical to commercial companies and other organizations (like
manufacturing, human-resources, stock, banking, etc.). For effective utilization and handling it is desired that a data
model is relatively simple and intuitive. This may be in conflict with high expressive power needed to deal with
certain complex applications. Thus any popular general-purpose data model usually well balances between being
intuitive and relatively simple, and very complex with high expressive power. The application's semantics is usually
not explicitly expressed in the model, but rather implicit (and detailed by documentation external to the model) and
hinted to by data item types' names (e.g., "part-number") and their connections (as expressed by generic data
structure types provided by each specific model).
Early data models
These models were popular in the 1960s, 1970s, but nowadays can be found primarily in old legacy systems. They
are characterized primarily by being navigational with strong connections between their logical and physical
representations, and deficiencies in data independence.
Hierarchical model
In the Hierarchical model different record types (representing real-world entities) are embedded in a predefined
hierarchical (tree-like) structure. This hierarchy is used as the physical order of records in storage. Record access is
done by navigating through the data structure using pointers combined with sequential accessing.
This model has been supported primarily by the IBM IMS DBMS, one of the earliest DBMSs. Various limitations of
the model have been compensated at later IMS versions by additional logical hierarchies imposed on the base
physical hierarchy.
Network model
In this model a hierarchical relationship between two record types (representing real-world entities) is established by
the set construct. A set consists of circular linked lists where one record type, the set owner or parent, appears once
in each circle, and a second record type, the subordinate or child, may appear multiple times in each circle. In this
way a hierarchy may be established between any two record types, e.g., type A is the owner of B. At the same time
another set may be defined where B is the owner of A. Thus all the sets comprise a general directed graph
(ownership defines a direction), or network construct. Access to records is either sequential (usually in each record
Database
type) or by navigation in the circular linked lists.
This model is more general and powerful than the hierarchical, and has been the most popular before being replaced
by the Relational model. It has been standardized by CODASYL. Popular DBMS products that utilized it were
Cincom Systems' Total and Cullinet's IDMS. IDMS gained a considerable customer base and exists and supported
until today. In the 1980s it has adopted the Relational model and SQL in addition to its original tools and languages.
Inverted file model
An inverted file or inverted index of a first file, by a field in this file (the inversion field), is a second file in which
this field is the key. A record in the second file includes a key and pointers to records in the first file where the
inversion field has the value of the key. This is also the logical structure of contemporary database indexes. The
related Inverted file data model utilizes inverted files of primary database files to efficiently directly access needed
records in these files.
Notable for using this data model is the ADABAS DBMS of Software AG, introduced in 1970. ADABAS has
gained considerable customer base and exists and supported until today. In the 1980s it has adopted the Relational
model and SQL in addition to its original tools and languages.
Relational model
The relational model is a simple model that provides flexibility. It organizes data based on two-dimensional arrays
known as relations, or tables as related to databases. These relations consist of a heading and a set of zero or more
tuples in arbitrary order. The heading is an unordered set of zero or more attributes, or columns of the table. The
tuples are a set of unique attributes mapped to values, or the rows of data in the table. Data can be associated across
multiple tables with a key. A key is a single, or set of multiple, attribute(s) that is common to both tables. The most
common language associated with the relational model is the Structured Query Language (SQL), though it differs in
some places.
Object model
In recent years, the object-oriented paradigm has been applied in areas such as engineering and spatial databases,
telecommunications and in various scientific domains. The conglomeration of object oriented programming and
database technology led to this new kind of database. These databases attempt to bring the database world and the
application-programming world closer together, in particular by ensuring that the database uses the same type system
as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of
converting information between its representation in the database (for example as rows in tables) and its
representation in the application program (typically as objects). At the same time, object databases attempt to
introduce key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.
A variety of these ways have been tried for storing objects in a database. Some products have approached the
problem from the application-programming side, by making the objects manipulated by the program persistent. This
also typically requires the addition of some kind of query language, since conventional programming languages do
not provide language-level functionality for finding objects based on their information content. Others have attacked
the problem from the database end, by defining an object-oriented data model for the database, and defining a
database programming language that allows full programming capabilities as well as traditional query facilities.
10
Database
11
Database languages
Database languages are dedicated programming languages, tailored and utilized to
define a database (i.e., its specific data types and the relationships among them),
manipulate its content (e.g., insert new data occurrences, and update or delete existing ones), and
query it (i.e., request information: compute and retrieve any information based on its data).
Database languages are data-model-specific, i.e., each language assumes and is based on a certain structure of the
data (which typically differs among different data models). They typically have commands to instruct execution of
the desired operations in the database. Each such command is equivalent to a complex expression (program) in a
regular programming language, and thus programming in dedicated (database) languages simplifies the task of
handling databases considerably. An expressions in a database language is automatically transformed (by a compiler
or interpreter, as regular programming languages) to a proper computer program that runs while accessing the
database and providing the needed results. The following are notable examples:
SQL for the Relational model
A major Relational model language supported by all the relational DBMSs and a standard.
SQL was one of the first commercial languages for the relational model. Despite not adhering to the relational model
as described by Codd, it has become the most widely used database language.[10][11] Though often described as, and
to a great extent is a declarative language, SQL also includes procedural elements. SQL became a standard of the
American National Standards Institute (ANSI) in 1986, and of the International Organization for Standards (ISO) in
1987. Since then the standard has been enhanced several times with added features. However, issues of SQL code
portability between major RDBMS products still exist due to lack of full compliance with, or different interpretations
of the standard. Among the reasons mentioned are the large size, and incomplete specification of the standard, as
well as vendor lock-in.
Database
OQL for the Object model
An object model language standard (by the Object Data Management Group) that has influenced the design of some
of the newer query languages like JDOQL and EJB QL, though they cannot be considered as different flavors of
OQL.
XQuery for the XML model
XQuery is an XML based database language (also named XQL). SQL/XML combines XQuery and XML with
SQL.[12]
Database architecture
Database architecture (to be distinguished from DBMS architecture; see below) may be viewed, to some extent, as
an extension of data modeling. It is used to conveniently answer requirements of different end-users from a same
database, as well as for other benefits. For example, a financial department of a company needs the payment details
of all employees as part of the company's expenses, but not other many details about employees, that are the interest
of the human resources department. Thus different departments need different views of the company's database, that
both include the employees' payments, possibly in a different level of detail (and presented in different visual forms).
To meet such requirement effectively database architecture consists of three levels: external, conceptual and
internal. Clearly separating the three levels was a major feature of the relational database model implementations
that dominate 21st century databases.[13]
The external level defines how each end-user type understands the organization of its respective relevant data in
the database, i.e., the different needed end-user views. A single database can have any number of views at the
external level.
The conceptual level unifies the various external views into a coherent whole, global view.[13] It provides the
common-denominator of all the external views. It comprises all the end-user needed generic data, i.e., all the data
from which any view may be derived/computed. It is provided in the simplest possible way of such generic data,
and comprises the back-bone of the database. It is out of the scope of the various database end-users, and serves
database application developers and defined by database administrators that build the database.
The Internal level (or Physical level) is as a matter of fact part of the database implementation inside a DBMS
(see Implementation section below). It is concerned with cost, performance, scalability and other operational
matters. It deals with storage layout of the conceptual level, provides supporting storage-structures like indexes, to
enhance performance, and occasionally stores data of individual views (materialized views), computed from
generic data, if performance justification exists for such redundancy. It balances all the external views'
performance requirements, possibly conflicting, in attempt to optimize the overall database usage by all its
end-uses according to the database goals and priorities.
All the three levels are maintained and updated according to changing needs by database administrators who often
also participate in the database design.
The above three-level database architecture also relates to and being motivated by the concept of data independence
which has been described for long time as a desired database property and was one of the major initial driving forces
of the Relational model. In the context of the above architecture it means that changes made at a certain level do not
affect definitions and software developed with higher level interfaces, and are being incorporated at the higher level
automatically. For example, changes in the internal level do not affect application programs written using conceptual
level interfaces, which saves substantial change work that would be needed otherwise.
In summary, the conceptual is a level of indirection between internal and external. On one hand it provides a
common view of the database, independent of different external view structures, and on the other hand it is
uncomplicated by details of how the data are stored or managed (internal level). In principle every level, and even
every external view, can be presented by a different data model. In practice usually a given DBMS uses the same
12
Database
data model for both the external and the conceptual levels (e.g., relational model). The internal level, which is hidden
inside the DBMS and depends on its implementation (see Implementation section below), requires a different level
of detail and uses its own data structure types, typically different in nature from the structures of the external and
conceptual levels which are exposed to DBMS users (e.g., the data models above): While the external and
conceptual levels are focused on and serve DBMS users, the concern of the internal level is effective implementation
details.
Database security
Database security deals with all various aspects of protecting the database content, its owners, and its users. It ranges
from protection from intentional unauthorized database uses to unintentional database accesses by unauthorized
entities (e.g., a person or a computer program).
The following are major areas of database security (among many others).
Access control
Database access control deals with controlling who (a person or a certain computer program) is allowed to access
what information in the database. The information may comprise specific database objects (e.g., record types,
specific records, data structures), certain computations over certain objects (e.g., query types, or specific queries), or
utilizing specific access paths to the former (e.g., using specific indexes or other data structures to access
information).
Database access controls are set by special authorized (by the database owner) personnel that uses dedicated
protected security DBMS interfaces.
Data security
The definition of data security varies and may overlap with other database security aspects. Broadly it deals with
protecting specific chunks of data, both physically (i.e., from corruption, or destruction, or removal; e.g., see
Physical security), or the interpretation of them, or parts of them to meaningful information (e.g., by looking at the
strings of bits that they comprise, concluding specific valid credit-card numbers; e.g., see Data encryption).
Database audit
Database audit primarily involves monitoring that no security breach, in all aspects, has happened. If security breach
is discovered then all possible corrective actions are taken.
Database design
Database design is done before building it to meet needs of end-users within a given application/information-system
that the database is intended to support. The database design defines the needed data and data structures that such a
database comprises. A design is typically carried out according to the common three architectural levels of a
database (see Database architecture above). First, the conceptual level is designed, which defines the over-all
picture/view of the database, and reflects all the real-world elements (entities) the database intends to model, as well
as the relationships among them. On top of it the external level, various views of the database, are designed
according to (possibly completely different) needs of specific end-user types. More external views can be added
later. External views requirements may modify the design of the conceptual level (i.e., add/remove entities and
relationships), but usually a well designed conceptual level for an application well supports most of the needed
external views. The conceptual view also determines the internal level (which primarily deals with data layout in
storage) to a great extent. External views requirement may add supporting storage structures, like materialized views
and indexes, for enhanced performance. Typically the internal layer is optimized for top performance, in an average
way that takes into account performance requirements (possibly conflicting) of different external views according to
13
Database
their relative importance. While the conceptual and external levels design can usually be done independently of any
DBMS (DBMS-independent design software packages exist, possibly with interfaces to some specific popular
DBMSs), the internal level design highly relies on the capabilities and internal data structure of the specific DBMS
utilized (see the Implementation section below).
A common way to carry out conceptual level design is to use the entity-relationship model (ERM) (both the basic
one, and with possible enhancement that it has gone over), since it provides a straightforward, intuitive perception of
an application's elements and semantics. An alternative approach, which preceded the ERM, is using the Relational
model and dependencies (mathematical relationships) among data to normalize the database, i.e., to define the
("optimal") relations (data record or tupple types) in the database. Though a large body of research exists for this
method it is more complex, less intuitive, and not more effective than the ERM method. Thus normalization is less
utilized in practice than the ERM method.
The ERM may be less subtle than normalization in several aspects, but it captures the main needed dependencies
which are induced by keys/identifiers of entities and relationships. Also the ERM inherently includes the important
inclusion dependencies (i.e., an entity instance that does not exist (has not been explicitly inserted) cannot appear in
a relationship with other entities) which usually have been ignored in normalization.[14] In addition the ERM allows
entity type generalization (the Is-a relationship) and implied property (attribute) inheritance (similarly to the that
found in the object model).
Another aspect of database design is its security. It involves both defining access control to database objects (e.g.,
Entities, Views) as well as defining security levels and methods for the data themselves (See Database security
above).
Entities and relationships
The most common database design methods are based on the entity relationship model (ERM, or ER model). This
model views the world in a simplistic but very powerful way: It consists of "Entities" and the "Relationships" among
them. Accordingly a database consists of entity and relationship types, each with defined attributes (field types) that
model concrete entities and relationships. Modeling a database in this way typically yields an effective one with
desired properties (as in some normal forms; see normalization below). Such models can be translated to any other
data model required by any specific DBMS for building an effective database.
Database normalization
In the design of a relational database, the process of organizing database relations to minimize redundancy is called
normalization. The goal is to produce well-structured relations so that additions, deletions, and modifications of a
field can be made in just one relation (table) without worrying about appearance and update of the same field in
other relations. The process is algorithmic and based on dependencies (mathematical relations) that exist among
relations' field types. The process result is bringing the database relations into a certain "normal form". Several
normal forms exist with different properties.
14
Database
15
becomes operational while empty from application's data, and data are accumulated along its operation.
After completing building the database and making it operational arrives the database maintenance stage: Various
database parameters may need changes and tuning for better performance, application's data structures may be
changed or added, new related application programs may be written to add to the application's functionality, etc.
Miscellaneous areas
Database migration between DBMSs
See also Database migration in Data migration
A database built with one DBMS is not portable to another DBMS (i.e., the other DBMS cannot run it). However, in
some situations it is desirable to move, migrate a database from one DBMS to another. The reasons are primarily
economical (different DBMSs may have different total costs of ownership-TCO), functional, and operational
(different DBMSs may have different capabilities). The migration involves the database's transformation from one
DBMS type to another. The transformation should maintain (if possible) the database related application (i.e., all
related application programs) intact. Thus, the database's conceptual and external architectural levels should be
maintained in the transformation. It may be desired that also some aspects of the architecture internal level are
maintained. A complex or large database migration may be a complicated and costly (one-time) project by itself,
which should be factored into the decision to migrate. This in spite of the fact that tools may exist to help migration
between specific DBMS. Typically a DBMS vendor provides tools to help importing databases from other popular
DBMSs.
Database
16
Database storage
Database storage is the container of the physical materialization of a database. It comprises the Internal (physical)
level in the database architecture. It also contains all the information needed (e.g., metadata, "data about the data",
and internal data structures) to reconstruct the Conceptual level and External level from the Internal level when
needed. It is not part of the DBMS but rather manipulated by the DBMS (by its Storage engine; see above) to
manage the database that resides in it. Though typically accessed by a DBMS through the underlying Operating
system (and often utilizing the operating systems' File systems as intermediates for storage layout), storage
properties and configuration setting are extremely important for the efficient operation of the DBMS, and thus are
closely maintained by database administrators. A DBMS, while in operation, always has its database residing in
several types of storage (e.g., memory and external storage). The database data and the additional needed
information, possibly in very large amounts, are coded into bits. Data typically reside in the storage in structures that
Database
look completely different from the way the data look in the conceptual and external levels, but in ways that attempt
to optimize (the best possible) these levels' reconstruction when needed by users and programs, as well as for
computing additional types of needed information from the data (e.g., when querying the database).
In principle the database storage can be viewed as a linear address space, where every bit of data has its unique
address in this address space. Practically only a very small percentage of addresses is kept as initial reference points
(which also requires storage), and most of the database data are accessed by indirection using displacement
calculations (distance in bits from the reference points) and data structures which define access paths (using pointers)
to all needed data in an effective manner, optimized for the needed data access operations.
Data
Coding the data and Error-correcting codes
Data are encoded by assigning a bit pattern to each language alphabet character, digit, other numerical patterns,
and multimedia object. Many standards exist for encoding (e.g., ASCII, JPEG, MPEG-4).
By adding bits to each encoded unit, the redundancy allows both to detect errors in coded data and to correct them
based on mathematical algorithms. Errors occur regularly in low probabilities due to random bit value flipping, or
"physical bit fatigue," loss of the physical bit in storage its ability to maintain distinguishable value (0 or 1), or
due to errors in inter or intra-computer communication. A random bit flip (e.g., due to random radiation) is
typically corrected upon detection. A bit, or a group of malfunctioning physical bits (not always the specific
defective bit is known; group definition depends on specific storage device) is typically automatically fenced-out,
taken out of use by the device, and replaced with another functioning equivalent group in the device, where the
corrected bit values are restored (if possible). The Cyclic redundancy check (CRC) method is typically used in
storage for error detection.
Data compression
Data compression methods allow in many cases to represent a string of bits by a shorter bit string ("compress") and
reconstruct the original string ("decompress") when needed. This allows to utilize substantially less storage (tens of
percents) for many types of data at the cost of more computation (compress and decompress when needed). Analysis
of trade-off between storage cost saving and costs of related computations and possible delays in data availability is
done before deciding whether to keep certain data in a database compressed or not.
Data compression is typically controlled through the DBMS's data definition interface, but in some cases may be a
default and automatic.
Data encryption
For security reasons certain types of data (e.g., credit-card information) may be kept encrypted in storage to prevent
the possibility of unauthorized information reconstruction from chunks of storage snapshots (taken either via
unforeseen vulnerabilities in a DBMS, or more likely, by bypassing it).
Data encryption is typically controlled through the DBMS's data definition interface, but in some cases may be a
default and automatic.
17
Database
Data storage types
This collection of bits describes both the contained database data and their related metadata (i.e., data that describe
the contained data and allows computer programs to manipulate the database data correctly). The size of a database
can nowadays be tens of Terabytes, where a byte is eight bits. The physical materialization of a bit can employ
various existing technologies, while new and improved technologies are constantly under development. Common
examples are:
Magnetic medium (e.g., in Magnetic disk) - Orientation of magnetic field in magnetic regions on a surface of
material (two orientation directions, for 0 and 1).
Dynamic random-access memory (DRAM) - State of a miniature electronic circuit consisting of few transistors
(among millions nowadays) in an integrated circuit (two states for 0 and 1).
These two examples are respectively for two major storage types:
Nonvolatile storage can maintain its bit states (0s and 1s) without electrical power supply, or when power supply
is interrupted;
Volatile storage loses its bit values when power supply is interrupted (i.e., its content is erased).
Sophisticated storage units, which can, in fact, be effective dedicated parallel computers that support a large amount
of nonvolatile storage, typically must include also components with volatile storage. Some such units employ
batteries that can provide power for several hours in case of external power interruption (e.g., see the EMC
Symmetrix) and thus maintain the content of the volatile storage parts intact. Just before such a device's batteries lose
their power the device typically automatically backs-up its volatile content portion (into nonvolatile) and shuts off to
protect its data.
Databases are usually too expensive (in terms of importance and needed investment in resources, e.g., time, money,
to build them) to be lost by a power interruption. Thus at any point in time most of their content resides in
nonvolatile storage. Even if for operational reason very large portions of them reside in volatile storage (e.g., tens of
Gigabytes in volatile memory, for in-memory databases), most of this is backed-up in nonvolatile storage. A
relatively small portion of this, which temporarily may not have nonvolatile backup, can be reconstructed by proper
automatic database recovery procedures after volatile storage content loss.
More examples of storage types:
Volatile storage can be found in processors, computer memory (e.g., DRAM), etc.
Non-volatile storage types include ROM, EPROM, Hard disk drives, Flash memory and drives, Storage arrays,
etc.
Storage metrics
Databases always use several types of storage when operational (and implied several when idle). Different types may
significantly differ in their properties, and the optimal mix of storage types is determined by the types and quantities
of operations that each storage type needs to perform, as well as considerations like physical space and energy
consumption and dissipation (which may become critical for a large database). Storage types can be categorized by
the following attributes:
Volatile/Nonvolatile.
Cost of the medium (e.g., per Megabyte), Cost to operate (cost of energy consumed per unit time).
Access speed (e.g., bytes per second).
Granularity from fine to coarse (e.g., size in bytes of access operation).
Reliability (the probability of spontaneous bit value change under various conditions).
Maximal possible number of writes (of any specific bit or specific group of bits; could be constrained by the
technology used (e.g., "write once" or "write twice"), or due to "physical bit fatigue," loss of ability to distinguish
between the 0, 1 states due to many state changes (e.g., in Flash memory)).
18
Database
19
Power needed to operate (Energy per time; energy per byte accessed), Energy efficiency, Heat to dissipate.
Packaging density (e.g., realistic number of bytes per volume unit)
Protecting storage device content: Device mirroring (replication) and RAID
See also Disk storage replication
While a group of bits malfunction may be resolved by error detection and correction mechanisms (see above),
storage device malfunction requires different solutions. The following solutions are commonly used and valid for
most storage devices:
Device mirroring (replication) - A common solution to the problem is constantly maintaining an identical copy
of device content on another device (typically of a same type). The downside is that this doubles the storage, and
both devices (copies) need to be updated simultaneously with some overhead and possibly some delays. The
upside is possible concurrent read of a same data group by two independent processes, which increases
performance. When one of the replicated devices is detected to be defective, the other copy is still operational,
and is being utilized to generate a new copy on another device (usually available operational in a pool of stand-by
devices for this purpose).
Redundant array of independent disks (RAID) - This method generalizes the device mirroring above by
allowing one device in a group of N devices to fail and be replaced with content restored (Device mirroring is
RAID with N=2). RAID groups of N=5 or N=6 are common. N>2 saves storage, when comparing with N=2, at
the cost of more processing during both regular operation (with often reduced performance) and defective device
replacement.
Device mirroring and typical RAID are designed to handle a single device failure in the RAID group of devices.
However, if a second failure occurs before the RAID group is completely repaired from the first failure, then data
can be lost. The probability of a single failure is typically small. Thus the probability of two failures in a same RAID
group in time proximity is much smaller (approximately the probability squared, i.e., multiplied by itself). If a
database cannot tolerate even such smaller probability of data loss, then the RAID group itself is replicated
(mirrored). In many cases such mirroring is done geographically remotely, in a different storage array, to handle also
recovery from disasters (see disaster recovery above).
Database storage layout
Database bits are laid-out in storage in data-structures and grouping that can take advantage of both known effective
algorithms to retrieve and manipulate them and the storage own properties. Typically the storage itself is design to
meet requirements of various areas that extensively utilize storage, including databases. A DBMS in operation
always simultaneously utilizes several storage types (e.g., memory, and external storage), with respective layout
methods.
Database storage hierarchy
A database, while in operation, resides simultaneously in several types of storage. By the nature of contemporary
computers most of the database part inside a computer that hosts the DBMS resides (partially replicated) in volatile
storage. Data (pieces of the database) that are being processed/manipulated reside inside a processor, possibly in
processor's caches. These data are being read from/written to memory, typically through a computer bus (so far
typically volatile storage components). Computer memory is communicating data (transferred to/from) external
storage, typically through standard storage interfaces or networks (e.g., fibre channel, iSCSI). A storage array, a
common external storage unit, typically has storage hierarchy of it own, from a fast cache, typically consisting of
(volatile and fast) DRAM, which is connected (again via standard interfaces) to drives, possibly with different
speeds, like flash drives and magnetic disk drives (non-volatile). The drives may be connected to magnetic tapes, on
which typically the least active parts of a large database may reside, or database backup generations.
Database
Typically a correlation exists currently between storage speed and price, while the faster storage is typically volatile.
Data structures
A data structure is an abstract construct that embeds data in a well defined manner. An efficient data structure allows
to manipulate the data in efficient ways. The data manipulation may include data insertion, deletion, updating and
retrieval in various modes. A certain data structure type may be very effective in certain operations, and very
ineffective in others. A data structure type is selected upon DBMS development to best meet the operations needed
for the types of data it contains. Type of data structure selected for a certain task typically also takes into
consideration the type of storage it resides in (e.g., speed of access, minimal size of storage chunk accessed, etc.). In
some DBMSs database administrators have the flexibility to select among options of data structures to contain user
data for performance reasons. Sometimes the data structures have selectable parameters to tune the database
performance.
Databases may store data in many data structure types.[15] Common examples are the following:
heaps
Application data and DBMS data
A typical DBMS cannot store the data of the application it serves alone. In order to handle the application data the
DBMS need to store these data in data structures that comprise specific data by themselves. In addition the DBMS
needs its own data structures and many types of bookkeeping data like indexes and logs. The DBMS data are an
integral part of the database and may comprise a substantial portion of it.
Database indexing
Indexing is a technique for improving database performance. The many types of indexes share the common property
that they reduce the need to examine every entry when running a query. In large databases, this can reduce query
time/cost by orders of magnitude. The simplest form of index is a sorted list of values that can be searched using a
binary search with an adjacent reference to the location of the entry, analogous to the index in the back of a book.
The same data can have multiple indexes (an employee database could be indexed by last name and hire date.)
Indexes affect performance, but not results. Database designers can add or remove indexes without changing
application logic, reducing maintenance costs as the database grows and database usage evolves.
Given a particular query, the DBMS' query optimizer is responsible for devising the most efficient strategy for
finding matching data.
Indexes can speed up data access, but they consume space in the database, and must be updated each time the data
are altered. Indexes therefore can speed data access but slow data maintenance. These two properties determine
whether a given index is worth the cost.
Database data clustering
In many cases substantial performance improvement is gained if different types of database objects that are usually
utilized together are laid in storage in proximity, being clustered. This usually allows to retrieve needed related
objects from storage in minimum number of input operations (each sometimes substantially time consuming). Even
for in-memory databases clustering provides performance advantage due to common utilization of large caches for
input-output operations in memory, with similar resulting behavior.
20
Database
21
For example it may be beneficial to cluster a record of an item in stock with all its respective order records. The
decision of whether to cluster certain objects or not depends on the objects' utilization statistics, object sizes, caches
sizes, storage types, etc. In a relational database clustering the two respective relations "Items" and "Orders" results
in saving the expensive execution of a Join operation between the two relations whenever such a join is needed in a
query (the join result is already ready in storage by the clustering, available to be utilized).
Database materialized views
Often storage redundancy is employed to increase performance. A common example is storing materialized views,
which consist of frequently needed external views or query results. Storing such views saves the expensive
computing of them each time they are needed. The downsides of materialized views are the overhead incurred when
updating them to keep them synchronized with their original updated database data, and the cost of storage
redundancy.
Database and database object replication
See also Replication below
Occasionally a database employs storage redundancy by database objects replication (with one or more copies) to
increase data availability (both to improve performance of simultaneous multiple end-user accesses to a same
database object, and to provide resiliency in a case of partial failure of a distributed database). Updates of a
replicated object need to be synchronized across the object copies. In many cases the entire database is replicated.
Database transactions
As with every software system, a DBMS that operates in a faulty computing environment is prone to failures of
many kinds. A failure can corrupt the respective database unless special measures are taken to prevent this. A DBMS
achieves certain levels of fault tolerance by encapsulating operations within transactions. The concept of a database
transaction (or atomic transaction) has evolved in order to enable both a well understood database system behavior
in a faulty environment where crashes can happen any time, and recovery from a crash to a well understood database
state. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g.,
reading a database object, writing, acquiring lock, etc.), an abstraction supported in database and also other systems.
Each transaction has well defined boundaries in terms of which program/code executions are included in that
transaction (determined by the transaction's programmer via special transaction commands).
ACID rules
Every database transaction obeys the following rules:
Atomicity - Either the effects of all or none of its operations remain ("all or nothing" semantics) when a
transaction is completed (committed or aborted respectively). In other words, to the outside world a committed
transaction appears (by its effects on the database) to be indivisible, atomic, and an aborted transaction does not
leave effects on the database at all, as if never existed.
Consistency - Every transaction must leave the database in a consistent (correct) state, i.e., maintain the
predetermined integrity rules of the database (constraints upon and among the database's objects). A transaction
must transform a database from one consistent state to another consistent state (however, it is the responsibility of
the transaction's programmer to make sure that the transaction itself is correct, i.e., performs correctly what it
intends to perform (from the application's point of view) while the predefined integrity rules are enforced by the
DBMS). Thus since a database can be normally changed only by transactions, all the database's states are
consistent. An aborted transaction does not change the database state it has started from, as if it never existed
(atomicity above).
Isolation - Transactions cannot interfere with each other (as an end result of their executions). Moreover, usually
(depending on concurrency control method) the effects of an incomplete transaction are not even visible to
Database
another transaction. Providing isolation is the main goal of concurrency control.
Durability - Effects of successful (committed) transactions must persist through crashes (typically by recording
the transaction's effects and its commit event in a non-volatile memory).
Isolation, concurrency control, and locking
Isolation provides the ability for multiple users to operate on the database at the same time without corrupting the
data.
Concurrency control comprises the underlying mechanisms in a DBMS which handle isolation and guarantee
related correctness. It is heavily utilized by the Database and Storage engines (see above) both to guarantee the
correct execution of concurrent transactions, and (different mechanisms) the correctness of other DBMS
processes. The transaction-related mechanisms typically constrain the database data access operations' timing
(transaction schedules) to certain orders characterized as the Serializability and Recoverabiliry schedule
properties. Constraining database access operation execution typically means reduced performance (rates of
execution), and thus concurrency control mechanisms are typically designed to provide the best performance
possible under the constraints. Often, when possible without harming correctness, the serializability property is
compromised for better performance. However, recoverability cannot be compromised, since such typically
results in a quick database integrity violation.
Locking is the most common transaction concurrency control method in DBMSs, used to provide both
serializability and recoverability for correctness. In order to access a database object a transaction first needs to
acquire a lock for this object. Depending on the access operation type (e.g., reading or writing an object) and on
the lock type, acquiring the lock may be blocked and postponed, if another transaction is holding a lock for that
object.
Query optimization
A query is a request for information from a database. It can be as simple as "finding the address of a person with SS#
123-45-6789," or more complex like "finding the average salary of all the employed married men in California
between the ages 30 to 39, that earn less than their wives." Queries results are generated by accessing relevant
database data and manipulating them in a way that yields the requested information. Since database structures are
complex, in most cases, and especially for not-very-simple queries, the needed data for a query can be collected from
a database by accessing it in different ways, through different data-structures, and in different orders. Each different
way typically requires different processing time. Processing times of a same query may have large variance, from a
fraction of a second to hours, depending on the way selected. The purpose of query optimization, which is an
automated process, is to find the way to process a given query in minimum time. The large possible variance in time
justifies performing query optimization, though finding the exact optimal way to execute a query, among all
possibilities, is typically very complex, time consuming by itself, may be too costly, and often practically impossible.
Thus query optimization typically tries to approximate the optimum by comparing several common-sense
alternatives to provide in a reasonable time a "good enough" plan which typically does not deviate much from the
best possible result.
DBMS support for the development and maintenance of a database and its application
A DBMS typically intends to provide convenient environment to develop and later maintain an application built
around its respective database type. A DBMS either provides such tools, or allows integration with such external
tools. Examples for tools relate to database design, application programming, application program maintenance,
database performance analysis and monitoring, database configuration monitoring, DBMS hardware configuration (a
DBMS and related database may span computers, networks, and storage units) and related database mapping
(especially for a distributed DBMS), storage allocation and database layout monitoring, storage migration, etc.
22
Database
References
[1] Jeffrey Ullman and Jennifer widom 1997: First course in database systems, Prentice-Hall Inc., Simon & Schuster, Page 1, ISBN
0-13-861337-0.
[2] C. W. Bachmann, The Programmer as Navigator
[3] Graves, Steve. "COTS Databases For Embedded Systems" (http:/ / www. embedded-computing. com/ articles/ id/ ?2020), Embedded
Computing Design magazine, January, 2007. Retrieved on August 13, 2008.
[4] "TeleCommunication Systems Signs up as a Reseller of TimesTen; Mobile Operators and Carriers Gain Real-Time Platform for
Location-Based Services" (http:/ / findarticles. com/ p/ articles/ mi_m0EIN/ is_2002_June_24/ ai_87694370). Business Wire. 2002-06-24. .
[5] Argumentation in Artificial Intelligence by Iyad Rahwan, Guillermo R. Simari
[6] "OWL DL Semantics" (http:/ / www. obitko. com/ tutorials/ ontologies-semantic-web/ owl-dl-semantics. html). . Retrieved 10 December
2010.
[7] Introducing databases by Stephen Chu, in Conrick, M. (2006) Health informatics: transforming healthcare with technology, Thomson, ISBN
0-17-012731-1, p. 69.
[8] Date, C. J. (June 1, 1999). "When's an extension not an extension?" (http:/ / intelligent-enterprise. informationweek. com/ db_area/ archives/
1999/ 990106/ online1. jhtml;jsessionid=Y2UNK1QFKXMBTQE1GHRSKH4ATMY32JVN). Intelligent Enterprise 2 (8). .
[9] Zhuge, H. (2008). The Web Resource Space Model. Web Information Systems Engineering and Internet Technologies Book Series. 4.
Springer. ISBN978-0-387-72771-4.
[10] Chapple, Mike. "SQL Fundamentals" (http:/ / databases. about. com/ od/ sql/ a/ sqlfundamentals. htm). Databases. About.com. . Retrieved
2009-01-28.
[11] "Structured Query Language (SQL)" (http:/ / publib. boulder. ibm. com/ infocenter/ db2luw/ v9/ index. jsp?topic=com. ibm. db2. udb.
admin. doc/ doc/ c0004100. htm). International Business Machines. October 27, 2006. . Retrieved 2007-06-10.
[12] Wagner, Michael (2010), "1. Auflage", SQL/XML:2006 - Evaluierung der Standardkonformitt ausgewhlter Datenbanksysteme, Diplomica
Verlag, ISBN3-8366-9609-6
[13] Date 1990, pp.3132
[14] Johann A. Makowsky, Victor M. Markowitz and Nimrod Rotics, 1986: "Entity-relationship consistency for relational schemas" (http:/ /
www. springerlink. com/ content/ p67756164r127m18/ ) Proceedings of the 1986 Conference on Database Theory (ICDT '86), Lecture Notes
in Computer Science, 1986, Volume 243/1986, pp. 306-322, Springer, doi:10.1007/3-540-17187-8_43
[15] Lightstone, Teorey & Nadeau 2007
Further reading
Ling Liu and Tamer M. zsu (Eds.) (2009). " Encyclopedia of Database Systems (http://www.springer.com/
computer/database+management+&+information+retrieval/book/978-0-387-49616-0), 4100 p.60 illus.
ISBN 978-0-387-49616-0.
Beynon-Davies, P. (2004). Database Systems. 3rd Edition. Palgrave, Houndmills, Basingstoke.
Connolly, Thomas and Carolyn Begg. Database Systems. New York: Harlow, 2002.
Date, C. J. (2003). An Introduction to Database Systems, Fifth Edition. Addison Wesley. ISBN0-201-51381-1.
Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques, 1st edition, Morgan Kaufmann
Publishers, 1992.
Kroenke, David M. and David J. Auer. Database Concepts. 3rd ed. New York: Prentice, 2007.
Lightstone, S.; Teorey, T.; Nadeau, T. (2007). Physical Database Design: the database professional's guide to
exploiting indexes, views, storage, and more. Morgan Kaufmann Press. ISBN0-12-369389-6.
Teorey, T.; Lightstone, S. and Nadeau, T. Database Modeling & Design: Logical Design, 4th edition, Morgan
Kaufmann Press, 2005. ISBN 0-12-685352-5
External links
Database (http://www.dmoz.org/Computers/Data_Formats/Database/) at the Open Directory Project
23
Entityrelationship model
24
Entityrelationship model
In software engineering, an Entity
Relationship model (ER model for
short) is an abstract way to describe a
database. It usually starts with a
relational database, which stores data in
tables. Some of the data in these tables
point to data in other tables - for
instance, your entry in the database
could point to several entries for each of
the phone numbers that are yours. The
ER model would say that you are an
entity, and each phone number is an
entity, and the relationship between you
and the phone numbers is 'has a phone
number'. Diagrams created to design
these entities and relationships are
called entityrelationship diagrams or
ER diagrams.
A sample Entity Relationship diagram using Chen's notation
Overview
Using the three schema approach to software engineering, there are three levels of ER models that may be
developed. The conceptual data model is the highest level ER model in that it contains the least granular detail but
establishes the overall scope of what is to be included within the model set. The conceptual ER model normally
defines master reference data entities that are commonly used by the organization. Developing an enterprise-wide
conceptual ER model is useful to support documenting the data architecture for an organization.
A conceptual ER model may be used as the foundation for one or more logical data models. The purpose of the
conceptual ER model is then to establish structural metadata commonality for the master data entities between the set
of logical ER models. The conceptual data model may be used to form commonality relationships between ER
models as a basis for data model integration.
A logical ER model does not require a conceptual ER model especially if the scope of the logical ER model is to
develop a single disparate information system. The logical ER model contains more detail than the conceptual ER
model. In addition to master data entities, operational and transactional data entities are now defined. The details of
each data entity are developed and the entity relationships between these data entities are established. The logical ER
model is however developed independent of technology into which it will be implemented.
One or more physical ER models may be developed from each logical ER model. The physical ER model is
normally developed be instantiated as a database. Therefore, each physical ER model must contain enough detail to
produce a database and each physical ER model is technology dependent since each database management system is
somewhat different.
Entityrelationship model
25
The physical model is normally forward engineered to instantiate the structural metadata into a database
management system as relational database objects such as database tables, database indexes such as unique key
indexes, and database constraints such as a foreign key constraint or a commonality constraint. The ER model is also
normally used to design modifications to the relational database objects and to maintain the structural metadata of
the database.
The first stage of information system design uses these models during the requirements analysis to describe
information needs or the type of information that is to be stored in a database. The data modeling technique can be
used to describe any ontology (i.e. an overview and classifications of used terms and their relationships) for a certain
area of interest. In the case of the design of an information system that is based on a database, the conceptual data
model is, at a later stage (usually called logical design), mapped to a logical data model, such as the relational model;
this in turn is mapped to a physical model during physical design. Note that sometimes, both of these phases are
referred to as "physical design".
Entityrelationship model
Examples: an owns relationship between a company and a computer, a supervises relationship between an employee
and a department, a performs relationship between an artist and a song, a proved relationship between a
mathematician and a theorem.
The model's linguistic aspect described above is utilized in the declarative database query language ERROL, which
mimics natural language constructs. ERROL's semantics and implementation are based on Reshaped relational
algebra (RRA), a relational algebra which is adapted to the entityrelationship model and captures its linguistic
aspect.
Entities and relationships can both have attributes. Examples: an employee entity might have a Social Security
Number (SSN) attribute; the proved relationship may have a date attribute.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying attributes, which is called
the entity's primary key.
Entityrelationship diagrams don't show single entities or single instances of relations. Rather, they show entity sets
and relationship sets. Example: a particular song is an entity. The collection of all songs in a database is an entity set.
The eaten relationship between a child and her lunch is a single relationship. The set of all such child-lunch
relationships in a database is a relationship set. In other words, a relationship set corresponds to a relation in
mathematics, while a relationship corresponds to a member of the relation.
Certain cardinality constraints on relationship sets may be indicated as well.
Relationship names
A relationship expressed with a single verb implying direction, makes it impossible to discuss the model using the
following proper English. For example:
the song and the artist are related by a 'performs'
the husband and wife are related by an 'is-married-to'.
Expressing the relationships with a noun resolves this:
the song and the artist are related by a 'performance'
the husband and wife are related by a 'marriage'.
Traditionally, the relationships are expressed twice, (using present continuous verb phrases), once in each direction.
This gives two English statements per relationship. For example:
the song is performed by the artist
the artist performs the song
26
Entityrelationship model
Role naming
It has also become prevalent to name roles with phrases e.g. is-the-owner-of and is-owned-by etc. Correct nouns in
this case are "owner" and "possession". Thus "person plays the role of owner" and "car plays the role of possession"
rather than "person plays the role of is-the-owner-of" etc.
The use of nouns has direct benefit when generating physical implementations from semantic models. When a
person has two relationships with car then it is possible to very simply generate names such as "owner_person" and
"driver_person" which are immediately meaningful.
Cardinalities
Modifications to the original specification can be beneficial. Chen described look-across cardinalities. As an aside,
the Barker-Ellis notation, used in Oracle Designer, uses same-side for minimum cardinality (analogous to
optionality) and role, but look-across for maximum cardinality (the crows foot).
In Merise,[5] Elmasri & Navathe[6] and others[7] there is a preference for same-side for roles and both minimum and
maximum cardinalities. Recent researchers (Feinerer,[8] Dullea et al.[9]) have shown that this is more coherent when
applied to n-ary relationships of order > 2.
In Dullea et al. one reads "A 'look across' notation such as used in the UML does not effectively represent the
semantics of participation constraints imposed on relationships where the degree is higher than binary."
In Feinerer it says "Problems arise if we operate under the look-across semantics as used for UML associations.
Hartmann[10] investigates this situation and shows how and why different transformations fail." (Although the
"reduction" mentioned is spurious as the two diagrams 3.4 and 3.5 are in fact the same) and also "As we will see on
the next few pages, the look-across interpretation introduces several difficulties which prevent the extension of
simple mechanisms from binary to n-ary associations."
Semantic modelling
The father of ER modelling said in his seminal paper: "The entity-relationship model adopts the more natural view
that the real world consists of entities and relationships. It incorporates some of the important semantic information
about the real world." [1] He is here in accord with philosophic and theoretical traditions from the time of the Ancient
Greek philosophers: Socrates, Plato and Aristotle (428 BC) through to modern epistemology, semiotics and logic of
Peirce, Frege and Russell. Plato himself associates knowledge with the apprehension of unchanging Forms (The
forms, according to Socrates, are roughly speaking archetypes or abstract representations of the many types of
things, and properties) and their relationships to one another. In his original 1976 article Chen explicitly contrasts
entityrelationship diagrams with record modelling techniques: "The data structure diagram is a representation of the
organisation of records and is not an exact representation of entities and relationships." Several other authors also
support his program:
Kent in "Data and Reality" [11] : "One thing we ought to have clear in our minds at the outset of a modelling
endeavour is whether we are intent on describing a portion of "reality" (some human enterprise) or a data processing
activity."
Abrial in "Data Semantics" : "... the so called "logical" definition and manipulation of data are still influenced
(sometimes unconsciously) by the "physical" storage and retrieval mechanisms currently available on computer
systems."
Stamper: "They pretend to describe entity types, but the vocabulary is from data processing: fields, data items,
values. Naming rules don't reflect the conventions we use for naming people and things; they reflect instead
techniques for locating records in files."
In Jackson's words: "The developer begins by creating a model of the reality with which the system is concerned, the
reality which furnishes its [the system's] subject matter ..."
27
Entityrelationship model
28
Elmasri, Navathe: "The ER model concepts are designed to be closer to the users perception of data and are not
meant to describe the way in which data will be stored in the computer."
A semantic model is a model of concepts, it is sometimes called a "platform independent model". It is an intensional
model. At the latest since Carnap, it is well known that:[12] "...the full meaning of a concept is constituted by two
aspects, its intension and its extension. The first part comprises the embedding of a concept in the world of concepts
as a whole, i.e. the totality of all relations to other concepts. The second part establishes the referential meaning of
the concept, i.e. its counterpart in the real or in a possible world". An extensional model is that which maps to the
elements of a particular methodology or technology, and is thus a "platform specific model". The UML specification
explicitly states that associations in class models are extensional and this is in fact self-evident by considering the
extensive array of additional "adornments" provided by the specification over and above those provided by any of
the prior candidate "semantic modelling languages"."UML as a Data Modeling Notation, Part 2" [13]
Diagramming conventions
Chen's notation for entityrelationship modeling uses
rectangles to represent entities, and diamonds to represent
relationships appropriate for first-class objects: they can
have attributes and relationships of their own. Entity sets
are drawn as rectangles, relationship sets as diamonds. If an
entity set participates in a relationship set, they are
connected with a line.
Attributes are drawn as ovals and are connected with a line
to exactly one entity or relationship set.
Cardinality constraints are expressed as follows:
a double line indicates a participation constraint, totality
or surjectivity: all entities in the entity set must
participate in at least one relationship in the relationship
set;
an arrow from entity set to relationship set indicates a
key constraint, i.e. injectivity: each entity of the entity
set can participate in at most one relationship in the
relationship set;
a thick line indicates both, i.e. bijectivity: each entity in
the entity set is involved in exactly one relationship.
Entityrelationship model
29
Two related entities shown using Crow's Foot notation. In this example, an optional
relationship is shown between Artist and Song; the symbols closest to the song entity
represents "zero, one, or many", whereas a song has "one and only one" Artist. The
former is therefore read as, an Artist (can) perform(s) "zero, one, or many" song(s).
Bachman notation
Barker's Notation
EXPRESS
IDEF1X[14]
Martin notation
(min, max)-notation of Jean-Raymond Abrial in 1974
UML class diagrams
Merise
Object-Role Modeling
ER diagramming tools
There are many ER diagramming tools. Free software ER diagramming tools that can interpret and generate ER
models and SQL and do database analysis are MySQL Workbench (formerly DBDesigner), and Open ModelSphere
(open-source). A freeware ER tool that can generate database and application layer code (webservices) is the RISE
Editor.
Proprietary ER diagramming tools are Avolution, dbForge Studio for MySQL, ER/Studio, ERwin, MagicDraw,
MEGA International, ModelRight, Navicat Data Modeler, OmniGraffle, Oracle Designer, PowerDesigner, Rational
Rose, Sparx Enterprise Architect, SQLyog, System Architect, Toad Data Modeler, and Visual Paradigm.
Free software diagram tools just draw the shapes without having any knowledge of what they mean, nor do they
generate SQL. These include Creately, yEd, LucidChart, Kivio, and Dia.
Entityrelationship model
Limitations
ER models assume information content that can readily be represented in a relational database. They describe only a
relational structure for this information.
Hence, they are inadequate for systems in which the information cannot readily be represented in relational form,
such as with semi-structured data.
Furthermore, for many systems, the possible changes to the information contained are nontrivial and important
enough to warrant explicit specification. Some authors have extended ER modeling with constructs to represent
change, an approach supported by the original author;[15] an example is Anchor Modeling.
An alternative is to model change separately, using a process modeling technique.
Additional techniques can be used for other aspects of systems. For instance, ER models roughly correspond to just 1
of the 14 different modeling techniques offered by UML.
Another limitation: ER modeling is aimed at specifying information from scratch. This suits the design of new,
standalone information systems, but is of less help in integrating pre-existing information sources that already define
their own data representations in detail.
Even where it is suitable in principle, ER modeling is rarely used as a separate activity. One reason for this is today's
abundance of tools to support diagramming and other design support directly on relational database management
systems. These tools can readily extract database diagrams that are very close to ER diagrams from existing
databases, and they provide alternative views on the information contained in such diagrams.
In a survey, Brodie and Liu[16] could not find a single instance of entityrelationship modeling inside a sample of ten
Fortune 100 companies. Badia and Lemire[17] blame this lack of use on the lack of guidance but also on the lack of
benefits, such as lack of support for data integration.
Also, the enhanced entityrelationship model (EER modeling) introduces several concepts which are not present in
ER modeling.
References
[1] "The Entity Relationship Model: Toward a Unified View of Data" (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 123.
1085) for entityrelationship modeling.
[2] A.P.G. Brown, "Modelling a Real-World System and Designing a Schema to Represent It", in Douque and Nijssen (eds.), Data Base
Description, North-Holland, 1975, ISBN 0-7204-2833-5.
[3] Designing a Logical Database: Supertypes and Subtypes (http:/ / technet. microsoft. com/ en-us/ library/ cc505839. aspx)
[4] Paul Beynon-Davies (2004). Database Systems. Houndmills, Basingstoke, UK: Palgrave
[5] Hubert Tardieu, Arnold Rochfeld and Ren Colletti La methode MERISE: Principes et outils (Paperback - 1983)
[6] Elmasri, Ramez, B. Shamkant, Navathe, Fundamentals of Database Systems, third ed., Addison-Wesley, Menlo Park, CA, USA, 2000.
[7] ER 2004 : 23rd International Conference on Conceptual Modeling, Shanghai, China, November 8-12, 2004 (http:/ / books. google. com/
books?id=odZK99osY1EC& pg=PA52& img=1& pgis=1& dq=genova& sig=ACfU3U3tDC_q8WOMqUJW4EZCa5YQywoYLw& edge=0)
[8] A Formal Treatment of UML Class Diagrams as an Efficient Method for Configuration Management 2007 (http:/ / publik. tuwien. ac. at/
files/ pub-inf_4582. pdf)
[9] James Dullea, Il-Yeol Song, Ioanna Lamprou - An analysis of structural validity in entity-relationship modeling 2002 (http:/ / www. ischool.
drexel. edu/ faculty/ song/ publications/ p_DKE_03_Validity. pdf)
[10] "Reasoning about participation constraints and Chen's constraints" S Hartmann - 2003 (http:/ / www. acs. org. au/ documents/ public/ crpit/
CRPITV17Hartmann. pdf)
[11] http:/ / www. bkent. net/ Doc/ darxrp. htm
[12] http:/ / wenku. baidu. com/ view/ 8048e7bb1a37f111f1855b22. html
[13] http:/ / www. tdan. com/ view-articles/ 8589
[14] IDEF1X (https:/ / idbms. navo. navy. mil/ DataModel/ IDEF1X. html)
[15] P. Chen. Suggested research directions for a new frontier: Active conceptual modeling (http:/ / www. springerlink. com/ content/
5160x2634402663r/ ). ER 2006, volume 4215 of Lecture Notes in Computer Science, pages 14. Springer Berlin / Heidelberg, 2006.
[16] M. L. Brodie and J. T. Liu. The power and limits of relational technology in the age of information ecosystems (http:/ / www.
michaelbrodie. com/ documents/ The Power and Limits of Relational Technology In the Age of Information Ecosystems V2. pdf). On The
Move Federated Conferences, 2010.
30
Entityrelationship model
[17] A. Badia and D. Lemire. A call to arms: revisiting database design (http:/ / dl. acm. org/ citation. cfm?id=2070750). SIGMOD Record 40, 3
(November 2011), 61-69.
Further reading
Richard Barker (1990). CASE Method: Tasks and Deliverables. Wokingham, England: Addison-Wesley.
Paul Beynon-Davies (2004). Database Systems. Houndmills, Basingstoke, UK: Palgrave
Peter Chen (March 1976). "The Entity-Relationship Model - Toward a Unified View of Data". ACM Transactions
on Database Systems 1 (1): 936. doi:10.1145/320434.320440.
1976. "The Entity-Relationship Model--Toward a Unified View of Data" (http://csc.lsu.edu/news/erd.pdf). In:
ACM Transactions on Database Systems 1/1/1976 ACM-Press ISSN 0362-5915, S. 936
External links
Entity Relationship Modeling (http://www.devarticles.com/c/a/Development-Cycles/
Entity-Relationship-Modeling/) - Article from Development Cycles
Entity Relationship Modelling (http://www.databasedesign.co.uk/bookdatabasesafirstcourse/chap3/chap3.
htm)
An Entity Relationship Diagram Example (http://rapidapplicationdevelopment.blogspot.com/2007/06/
entity-relationship-diagram-example.html). Demonstrates the crow's feet notation by way of an example.
"Entity-Relationship Modeling: Historical Events, Future Trends, and Lessons Learned" (http://bit.csc.lsu.edu/
~chen/pdf/Chen_Pioneers.pdf) by Peter Chen.
"English, Chinese and ER diagrams" (http://bit.csc.lsu.edu/~chen/pdf/ER_C.pdf) by Peter Chen.
Case study: E-R diagram for Acme Fashion Supplies (http://www.cilco.co.uk/briefing-studies/
acme-fashion-supplies-feasibility-study/slides/logical-data-structure.html) by Mark H. Ridley.
Logical Data Structures (LDSs) - Getting started (http://www.cems.uwe.ac.uk/~tdrewry/lds.htm) by Tony
Drewry.
Introduction to Data Modeling (http://www.utexas.edu/its/archive/windows/database/datamodeling/index.
html)
Lecture by Prof.Dr.Muhittin GKMEN (http://www3.itu.edu.tr/~gokmen/SE-lecture-5.pdf), Department of
Computer Engineering, Istanbul Technical University.
ER-Diagram Convention (http://www.scribd.com/doc/3053988/ER-Diagram-convention)
Crow's Foot Notation (http://www2.cs.uregina.ca/~bernatja/crowsfoot.html)
"Articulated Entity Relationship (AER) Diagram for Complete Automation of Relational Database
Normalization" (http://airccse.org/journal/ijdms/papers/0510ijdms06.pdf) P. S. Dhabe, Dr. M. S.
Patwardhan, Asavari A. Deshpande.
31
Database design
32
Database design
Database design is the process of producing a detailed data model of a database. This logical data model contains all
the needed logical and physical design choices and physical storage parameters needed to generate a design in a Data
Definition Language, which can then be used to create a database. A fully attributed data model contains detailed
attributes for each entity.
The term database design can be used to describe many different parts of the design of an overall database system.
Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the
data. In the relational model these are the tables and views. In an object database the entities and relationships map
directly to object classes and named relationships. However, the term database design could also be used to apply to
the overall process of designing, not just the base data structures, but also the forms and queries used as part of the
overall database application within the database management system (DBMS).[1]
The process of doing database design generally consists of a number of steps which will be carried out by the
database designer. Usually, the designer must:
Determine the relationships between the different data elements.
Superimpose a logical structure upon the data on the basis of these relationships.[2]
Database design
Normalization
In the field of relational database design, normalization is a systematic way of ensuring that a database structure is
suitable for general-purpose querying and free of certain undesirable characteristicsinsertion, update, and deletion
anomaliesthat could lead to a loss of data integrity.
A standard piece of database design guidance is that the designer should create a fully normalized design; selective
denormalization can subsequently be performed, but only for performance reasons. However, some modeling
disciplines, such as the dimensional modeling approach to data warehouse design, explicitly recommend
non-normalized designs, i.e. designs that in large part do not adhere to 3NF. Normalization consists of normal forms
that are 1NF,2NF,3NF,BOYCE-CODD NF (3.5NF),4NF and 5NF
33
Database design
Physical design
The physical design of the database specifies the physical configuration of the database on the storage media. This
includes detailed specification of data elements, data types, indexing options and other parameters residing in the
DBMS data dictionary. It is the detailed design of a system that includes modules & the database's hardware &
software specifications of the system.
References
[1] Gehani, N. (2006). The Database Book: Principles and practice using MySQL. 1st ed., Summit, NJ.: Silicon Press
[2] Teorey, T.J., Lightstone, S.S., et al., (2009). Database Design: Know it all.1st ed. Burlington, MA.: Morgan Kaufmann Publishers
[3] Database design basics. (n.d.). Database design basics. Retrieved May 1, 2010, from http:/ / office. microsoft. com/ en-us/ access/
HA012242471033. aspx
[4] Teorey, T.; Lightstone, S. and Nadeau, T.(2005) Database Modeling & Design: Logical Design, 4th edition, Morgan Kaufmann Press. ISBN
0-12-685352-5
34
Database design
35
Further reading
S. Lightstone, T. Teorey, T. Nadeau, Physical Database Design: the database professional's guide to exploiting
indexes, views, storage, and more, Morgan Kaufmann Press, 2007. ISBN 0-12-369389-6
External links
(http://www.sqlteam.com/article/database-design-and-modeling-fundamentals)
(http://office.microsoft.com/en-us/access/HA012242471033.aspx)
Database Normalization Basics (http://databases.about.com/od/specificproducts/a/normalization.htm) by
Mike Chapple (About.com)
Database Normalization Intro (http://www.databasejournal.com/sqletc/article.php/1428511), Part 2 (http://
www.databasejournal.com/sqletc/article.php/26861_1474411_1)
"An Introduction to Database Normalization" (http://web.archive.org/web/20110606025027/http://dev.
mysql.com/tech-resources/articles/intro-to-normalization.html). Archived from the original (http://dev.
mysql.com/tech-resources/articles/intro-to-normalization.html) on 2011-06-06. Retrieved 2012-02-25.
"Normalization" (http://web.archive.org/web/20100106115112/http://www.utexas.edu/its/archive/
windows/database/datamodeling/rm/rm7.html). Archived from the original (http://www.utexas.edu/its/
windows/database/datamodeling/rm/rm7.html) on 2010-01-06. Retrieved 2012-02-25.
Efficient Database Design (http://www.sum-it.nl/cursus/enindex.php3#dbdesign)
Data Modelers Community (http://www.datamodelers.com/)
Relational database design tutorial (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/intro.
html)
Database design (http://www.dmoz.org/Computers/Data_Formats/Database/) at the Open Directory Project
Relational database
A relational database is a collection of data items organised as a set of formally described tables from which data
can be accessed easily. A relational database is created using the relational model. The software used in a relational
database is called a relational database management system (RDBMS). A relational database is the predominant
choice in storing data, over other models like the hierarchical database model or the network model.
The relational database was first defined in 1970 by Edgar Codd, of IBM's San Jose Research Laboratory.[1]
Terminology
Relational database theory uses a set of
mathematical terms, which are roughly
equivalent
to
SQL
database
terminology.
The
table
below
summarizes some of the most
important relational database terms and
their SQL database equivalents.
Relational database
36
Relational term
SQL equivalent
tuple
row
attribute
column
Relations or Tables
A relation is defined as a set of tuples that have the same attributes. A tuple usually represents an object and
information about that object. Objects are typically physical objects or concepts. A relation is usually described as a
table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and
conform to the same constraints. The relational model specifies that the tuples of a relation have no specific order
and that the tuples, in turn, impose no order on the attributes. Applications access data by specifying queries, which
use operations such as select to identify tuples, project to identify attributes, and join to combine relations. Relations
can be modified using the insert, delete, and update operators. New tuples can supply explicit values or be derived
from a query. Similarly, queries identify tuples for updating or deleting. It is necessary for each tuple of a relation to
be uniquely identifiable by some combination (one or more) of its attribute values. This combination is referred to as
the primary key.
Domain
A domain describes the set of possible values for a given attribute, and can be considered a constraint on the value of
the attribute. Mathematically, attaching a domain to an attribute means that any value for the attribute must be an
element of the specified set. The character data value 'ABC', for instance, is not in the integer domain. The integer
value 123 satisfies the domain constraint.
Constraints
Constraints make it possible to further restrict the domain of an attribute. For instance, a constraint can restrict a
given integer attribute to values between 1 and 10. Constraints provide one method of implementing business rules
in the database. SQL implements constraint functionality in the form of check constraints. Constraints restrict the
data that can be stored in relations. These are usually defined using expressions that result in a boolean value,
indicating whether or not the data satisfies the constraint. Constraints can apply to single attributes, to a tuple
(restricting combinations of attributes) or to an entire relation. Since every attribute has an associated domain, there
are constraints (domain constraints). The two principal rules for the relational model are known as entity integrity
and referential integrity.
Relational database
Primary keys
A primary key uniquely defines a relationship within a database. In order for an attribute to be a good primary key it
must not repeat. While natural attributes are sometimes good primary keys, surrogate keys are often used instead. A
surrogate key is an artificial attribute assigned to an object which uniquely identifies it (for instance, in a table of
information about students at a school they might all be assigned a student ID in order to differentiate them). The
surrogate key has no intrinsic (inherent) meaning, but rather is useful through its ability to uniquely identify a tuple.
Another common occurrence, especially in regards to N:M cardinality is the composite key. A composite key is a
key made up of two or more attributes within a table that (together) uniquely identify a record. (For example, in a
database relating students, teachers, and classes. Classes could be uniquely identified by a composite key of their
room number and time slot, since no other class could have exactly the same combination of attributes. In fact, use of
a composite key such as this can be a form of data verification, albeit a weak one.)
Foreign key
A foreign key is a field in a relational table that matches the primary key column of another table. The foreign key
can be used to cross-reference tables. Foreign keys need not have unique values in the referencing relation. Foreign
keys effectively use the values of attributes in the referenced relation to restrict the domain of one or more attributes
in the referencing relation. A foreign key could be described formally as: "For all tuples in the referencing relation
projected over the referencing attributes, there must exist a tuple in the referenced relation projected over those same
attributes such that the values in each of the referencing attributes match the corresponding values in the referenced
attributes."
Stored procedures
A stored procedure is executable code that is associated with, and generally stored in, the database. Stored
procedures usually collect and customize common operations, like inserting a tuple into a relation, gathering
statistical information about usage patterns, or encapsulating complex business logic and calculations. Frequently
they are used as an application programming interface (API) for security or simplicity. Implementations of stored
procedures on SQL RDBMSs often allow developers to take advantage of procedural extensions (often
vendor-specific) to the standard declarative SQL syntax. Stored procedures are not part of the relational database
model, but all commercial implementations include them.
Index
An index is one way of providing quicker access to data. Indices can be created on any combination of attributes on
a relation. Queries that filter using those attributes can find matching tuples randomly using the index, without
having to check each tuple in turn. This is analogous to using the index of a book to go directly to the page on which
the information you are looking for is found, that is you do not have to read the entire book to find what you are
looking for. Relational databases typically supply multiple indexing techniques, each of which is optimal for some
combination of data distribution, relation size, and typical access pattern. Indices are usually implemented via B+
trees, R-trees, and bitmaps. Indices are usually not considered part of the database, as they are considered an
implementation detail, though indices are usually maintained by the same group that maintains the other parts of the
database. It should be noted that use of efficient indexes on both primary and foreign keys can dramatically improve
query performance. This is because B-tree indexes result in query times proportional to log(n) where N is the number
of rows in a table and hash indexes result in constant time queries (no size dependency so long as the relevant part of
the index fits into memory).
37
Relational database
Relational operations
Queries made against the relational database, and the derived relvars in the database are expressed in a relational
calculus or a relational algebra. In his original relational algebra, Codd introduced eight relational operators in two
groups of four operators each. The first four operators were based on the traditional mathematical set operations:
The union operator combines the tuples of two relations and removes all duplicate tuples from the result. The
relational union operator is equivalent to the SQL UNION operator.
The intersection operator produces the set of tuples that two relations share in common. Intersection is
implemented in SQL in the form of the INTERSECT operator.
The difference operator acts on two relations and produces the set of tuples from the first relation that do not exist
in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator.
The cartesian product of two relations is a join that is not restricted by any criteria, resulting in every tuple of the
first relation being matched with every tuple of the second relation. The cartesian product is implemented in SQL
as the CROSS JOIN operator.
The remaining operators proposed by Codd involve special operations specific to relational databases:
The selection, or restriction, operation retrieves tuples from a relation, limiting the results to only those that meet
a specific criteria, i.e. a subset in terms of set theory. The SQL equivalent of selection is the SELECT query
statement with a WHERE clause.
The projection operation extracts only the specified attributes from a tuple or set of tuples.
The join operation defined for relational databases is often referred to as a natural join. In this type of join, two
relations are connected by their common attributes. SQL's approximation of a natural join is the INNER JOIN
operator.
The relational division operation is a slightly more complex operation, which involves essentially using the tuples
of one relation (the dividend) to partition a second relation (the divisor). The relational division operator is
effectively the opposite of the cartesian product operator (hence the name).
Other operators have been introduced or proposed since Codd's introduction of the original eight including relational
comparison operators and extensions that offer support for nesting and hierarchical data, among others.
Normalization
Normalization was first proposed by Codd as an integral part of the relational model. It encompasses a set of
procedures designed to eliminate nonsimple domains (non-atomic values) and the redundancy (duplication) of data,
which in turn prevents data manipulation anomalies and loss of data integrity. The most common forms of
normalization applied to databases are called the normal forms.
38
Relational database
39
References
[1] Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks". Communications of the ACM 13 (6): 377387.
doi:10.1145/362384.362685.
[2] Gartner Says Worldwide Relational Database Market Increased 14 Percent in 2006 (http:/ / www. gartner. com/ it/ page. jsp?id=507466),
includes revenue estimates for leading database companies
Relational model
The relational model for database management is a database model based on first-order predicate logic, first
formulated and proposed in 1969 by Edgar F. Codd.[1][2] In the relational model of a database, all data is represented
in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational
database.
The purpose of the relational model is to provide
a declarative method for specifying data and
queries: users directly state what information the
database contains and what information they
want from it, and let the database management
system software take care of describing data
structures for storing the data and retrieval
procedures for answering queries.
Most implementations of the relational model
use the SQL data definition and query language.
A table in an SQL database schema corresponds
to a predicate variable; the contents of a table to
a relation; key constraints, other constraints, and
SQL queries correspond to predicates. However,
SQL databases, including DB2, deviate from the
relational model in many details; Codd fiercely
argued against deviations that compromise the
original principles.[4]
[3]
Overview
The relational model's central idea is to describe a database as a collection of predicates over a finite set of predicate
variables, describing constraints on the possible values and combinations of values. The content of the database at
any given time is a finite (logical) model of the database, i.e. a set of relations, one per predicate variable, such that
all predicates are satisfied. A request for information from the database (a database query) is also a predicate.
Relational model
40
In the relational model, related records are linked together with a "key".
A recent development is the Object-Relation type-Object model, which is based on the assumption that any fact can
be expressed in the form of one or more binary relationships. The model is used in Object Role Modeling (ORM),
RDF/Notation 3 (N3).
The relational model was the first database model to be described in formal mathematical terms. Hierarchical and
network databases existed before relational databases, but their specifications were relatively informal. After the
relational model was defined, there were many attempts to compare and contrast the different models, and this led to
the emergence of more rigorous descriptions of the earlier models; though the procedural nature of the data
manipulation interfaces for hierarchical and network databases limited the scope for formalization.
Relational model
Implementation
There have been several attempts to produce a true implementation of the relational database model as originally
defined by Codd and explained by Date, Darwen and others, but none have been popular successes so far. Rel is one
of the more recent attempts to do this.
History
The relational model was invented by E.F. (Ted) Codd as a general model of data, and subsequently maintained and
developed by Chris Date and Hugh Darwen among others. In The Third Manifesto (first published in 1995) Date and
Darwen show how the relational model can accommodate certain desired object-oriented features.
Controversies
Codd himself, some years after publication of his 1970 model, proposed a three-valued logic (True, False, Missing or
NULL) version of it to deal with missing information, and in his The Relational Model for Database Management
Version 2 (1990) he went a step further with a four-valued logic (True, False, Missing but Applicable, Missing but
Inapplicable) version. But these have never been implemented, presumably because of attending complexity. SQL's
NULL construct was intended to be part of a three-valued logic system, but fell short of that due to logical errors in
the standard and in its implementations.
41
Relational model
consequence of this distinguishing feature is that in the relational model the Cartesian product becomes
commutative.
A table is an accepted visual representation of a relation; a tuple is similar to the concept of row, but note that in the
database language SQL the columns and the rows of a table are ordered.
A relvar is a named variable of some specific relation type, to which at all times some relation of that type is
assigned, though the relation may contain zero tuples.
The basic principle of the relational model is the Information Principle: all information is represented by data values
in relations. In accordance with this Principle, a relational database is a set of relvars and the result of every query is
presented as a relation.
The consistency of a relational database is enforced, not by rules built into the applications that use it, but rather by
constraints, declared as part of the logical schema and enforced by the DBMS for all applications. In general,
constraints are expressed using relational comparison operators, of which just one, "is subset of" (), is theoretically
sufficient. In practice, several useful shorthands are expected to be available, of which the most important are
candidate key (really, superkey) and foreign key constraints.
Interpretation
To fully appreciate the relational model of data it is essential to understand the intended interpretation of a relation.
The body of a relation is sometimes called its extension. This is because it is to be interpreted as a representation of
the extension of some predicate, this being the set of true propositions that can be formed by replacing each free
variable in that predicate by a name (a term that designates something).
There is a one-to-one correspondence between the free variables of the predicate and the attribute names of the
relation heading. Each tuple of the relation body provides attribute values to instantiate the predicate by substituting
each of its free variables. The result is a proposition that is deemed, on account of the appearance of the tuple in the
relation body, to be true. Contrariwise, every tuple whose heading conforms to that of the relation but which does not
appear in the body is deemed to be false. This assumption is known as the closed world assumption: it is often
violated in practical databases, where the absence of a tuple might mean that the truth of the corresponding
proposition is unknown. For example, the absence of the tuple ('John', 'Spanish') from a table of language skills
cannot necessarily be taken as evidence that John does not speak Spanish.
For a formal exposition of these ideas, see the section Set-theoretic Formulation, below.
Application to databases
A data type as used in a typical relational database might be the set of integers, the set of character strings, the set of
dates, or the two boolean values true and false, and so on. The corresponding type names for these types might be
the strings "int", "char", "date", "boolean", etc. It is important to understand, though, that relational theory does not
dictate what types are to be supported; indeed, nowadays provisions are expected to be available for user-defined
types in addition to the built-in ones provided by the system.
Attribute is the term used in the theory for what is commonly referred to as a column. Similarly, table is commonly
used in place of the theoretical term relation (though in SQL the term is by no means synonymous with relation). A
table data structure is specified as a list of column definitions, each of which specifies a unique column name and the
type of the values that are permitted for that column. An attribute value is the entry in a specific column and row,
such as "John Doe" or "35".
A tuple is basically the same thing as a row, except in an SQL DBMS, where the column values in a row are
ordered. (Tuples are not ordered; instead, each attribute value is identified solely by the attribute name and never
by its ordinal position within the tuple.) An attribute name might be "name" or "age".
42
Relational model
A relation is a table structure definition (a set of column definitions) along with the data appearing in that structure.
The structure definition is the heading and the data appearing in it is the body, a set of rows. A database relvar
(relation variable) is commonly known as a base table. The heading of its assigned value at any time is as specified
in the table declaration and its body is that most recently assigned to it by invoking some update operator
(typically, INSERT, UPDATE, or DELETE). The heading and body of the table resulting from evaluation of some
query are determined by the definitions of the operators used in the expression of that query. (Note that in SQL the
heading is not always a set of column definitions as described above, because it is possible for a column to have no
name and also for two or more columns to have the same name. Also, the body is not always a set of rows because in
SQL it is possible for the same row to appear more than once in the same body.)
43
Relational model
comparison of NULL with itself does not yield true but instead yields the third truth value, unknown; similarly
the comparison NULL with something other than itself does not yield false but instead yields unknown. It is
because of this behaviour in comparisons that NULL is described as a mark rather than a value. The relational
model depends on the law of excluded middle under which anything that is not true is false and anything that
is not false is true; it also requires every tuple in a relation body to have a value for every attribute of that
relation. This particular deviation is disputed by some if only because E.F. Codd himself eventually advocated
the use of special marks and a 4-valued logic, but this was based on his observation that there are two distinct
reasons why one might want to use a special mark in place of a value, which led opponents of the use of such
logics to discover more distinct reasons and at least as many as 19 have been noted, which would require a
21-valued logic. SQL itself uses NULL for several purposes other than to represent "value unknown". For
example, the sum of the empty set is NULL, meaning zero, the average of the empty set is NULL, meaning
undefined, and NULL appearing in the result of a LEFT JOIN can mean "no value because there is no
matching row in the right-hand operand".
Relational operations
Users (or programs) request data from a relational database by sending it a query that is written in a special language,
usually a dialect of SQL. Although SQL was originally intended for end-users, it is much more common for SQL
queries to be embedded into software that provides an easier user interface. Many web sites, such as Wikipedia,
perform SQL queries when generating pages.
In response to a query, the database returns a result set, which is just a list of rows containing the answers. The
simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return
just the answer wanted.
Often, data from multiple tables are combined into one, by doing a join. Conceptually, this is done by taking all
possible combinations of rows (the Cartesian product), and then filtering out everything except the answer. In
practice, relational database management systems rewrite ("optimize") queries to perform faster, using a variety of
techniques.
There are a number of relational operations in addition to join. These include project (the process of eliminating
some of the columns), restrict (the process of eliminating some of the rows), union (a way of combining two tables
with similar structures), difference (which lists the rows in one table that are not found in the other), intersect (which
lists the rows found in both tables), and product (mentioned above, which combines each row of one table with each
row of the other). Depending on which other sources you consult, there are a number of other operators many of
which can be defined in terms of those listed above. These include semi-join, outer operators such as outer join and
outer union, and various forms of division. Then there are operators to rename columns, and summarizing or
aggregating operators, and if you permit relation values as attributes (RVA relation-valued attribute), then
operators such as group and ungroup. The SELECT statement in SQL serves to handle all of these except for the
group and ungroup operators.
The flexibility of relational databases allows programmers to write queries that were not anticipated by the database
designers. As a result, relational databases can be used by multiple applications in ways the original designers did
not foresee, which is especially important for databases that might be used for a long time (perhaps several decades).
This has made the idea and implementation of relational databases very popular with businesses.
44
Relational model
45
Database normalization
Relations are classified based upon the types of anomalies to which they're vulnerable. A database that's in the first
normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no
modification anomalies. Normal forms are hierarchical in nature. That is, the lowest level is the first normal form,
and the database cannot meet the requirements for higher level normal forms without first having met all the
requirements of the lesser normal forms.[6]
Examples
Database
An idealized, very simple example of a description of some relvars (relation variables) and their attributes:
Customer (Customer ID, Tax ID, Name, Address, City, State, Zip, Phone, Email)
Order (Order No, Customer ID, Invoice No, Date Placed, Date Promised, Terms, Status)
Order Line (Order No, Order Line No, Product Code, Qty)
Invoice (Invoice No, Customer ID, Order No, Date, Status)
Invoice Line (Invoice No, Invoice Line No, Product Code, Qty Shipped)
Customer relation
Customer ID
Tax ID
Name
Address
[More fields....]
==================================================================================================
1234567890
555-5512222
Munmun
323 Broadway
...
2223344556
555-5523232
Wile E.
...
3334445563
555-5533323
Ekta
...
4232342432
555-5325523
E. F. Codd
123 It Way
...
If we attempted to insert a new customer with the ID 1234567890, this would violate the design of the relvar since
Customer ID is a primary key and we already have a customer 1234567890. The DBMS must reject a transaction
such as this that would render the database inconsistent by a violation of an integrity constraint.
Foreign keys are integrity constraints enforcing that the value of the attribute set is drawn from a candidate key in
another relation. For example in the Order relation the attribute Customer ID is a foreign key. A join is the
operation that draws on information from several relations at once. By joining relvars from the example above we
could query the database for all of the Customers, Orders, and Invoices. If we only wanted the tuples for a specific
customer, we would specify this using a restriction condition.
If we wanted to retrieve all of the Orders for Customer 1234567890, we could query the database to return every row
in the Order table with Customer ID 1234567890 and join the Order table to the Order Line table based on Order
Relational model
46
No.
There is a flaw in our database design above. The Invoice relvar contains an Order No attribute. So, each tuple in the
Invoice relvar will have one Order No, which implies that there is precisely one Order for each Invoice. But in
reality an invoice can be created against many orders, or indeed for no particular order. Additionally the Order relvar
contains an Invoice No attribute, implying that each Order has a corresponding Invoice. But again this is not always
true in the real world. An order is sometimes paid through several invoices, and sometimes paid without an invoice.
In other words there can be many Invoices per Order and many Orders per Invoice. This is a many-to-many
relationship between Order and Invoice (also called a non-specific relationship). To represent this relationship in the
database a new relvar should be introduced whose role is to specify the correspondence between Orders and
Invoices:
OrderInvoice(Order No,Invoice No)
Now, the Order relvar has a one-to-many relationship to the OrderInvoice table, as does the Invoice relvar. If we
want to retrieve every Invoice for a particular Order, we can query for all orders where Order No in the Order
relation equals the Order No in OrderInvoice, and where Invoice No in OrderInvoice equals the Invoice No in
Invoice.
Set-theoretic formulation
Basic notions in the relational model are relation names and attribute names. We will represent these as strings such
as "Person" and "name" and we will usually use the variables
and
to range over them. Another
basic notion is the set of atomic values that contains values such as numbers and strings.
Our first definition concerns the notion of tuple, which formalizes the notion of row or record in a table:
Tuple
A tuple is a partial function from attribute names to atomic values.
Header
A header is a finite set of attribute names.
Projection
The projection of a tuple
is
The next definition defines relation which formalizes the contents of a table as it is defined in the relational model.
Relation
A relation is a tuple
with
.
Such a relation closely corresponds to what is usually called the extension of a predicate in first-order logic except
that here we identify the places in the predicate with attribute names. Usually in the relational model a database
schema is said to consist of a set of relation names, the headers that are associated with these names and the
constraints that should hold for every instance of the database schema.
Relation universe
A relation universe
over a header
Relation schema
A relation schema
with header
consists of a header
and a predicate
if it has header
and satisfies
Relational model
47
holds in a relation
if:
and
such that
over
if and only if
and
Candidate key
A superkey
no proper subset of
and there is
Functional dependency
A functional dependency (FD for short) is written as
for
A functional dependency
if:
holds in a relation
and
tuples
A functional dependency
if and only if
Closure
Armstrong's axioms: The closure of a set of FDs
superset of
under a header
, written as
, is the smallest
such that:
(reflexivity)
(transitivity) and
(augmentation)
Theorem: Armstrong's axioms are sound and complete; given a header
contain subsets of
which all FDs in
if and only if
and a set
in
hold.
Completion
The completion of a finite set of attributes
superset of
, written as
, is the smallest
such that:
The completion of an attribute set can be used to compute if a certain dependency is in the closure of a set of
FDs.
Theorem: Given a set
of FDs,
if and only if
Irreducible cover
An irreducible cover of a set
of FDs is a set
Relational model
there exists no
48
such that
is a singleton set and
.
References
[1] "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks", E.F. Codd, IBM Research Report, 1969
[2] "A Relational Model of Data for Large Shared Data Banks", in Communications of the ACM, 1970.
[3] Data Integration Glossary (http:/ / knowledge. fhwa. dot. gov/ tam/ aashto. nsf/ All+ Documents/ 4825476B2B5C687285256B1F00544258/
$FILE/ DIGloss. pdf), U.S. Department of Transportation, August 2001.
[4] E. F. Codd, The Relational Model for Database Management, Addison-Wesley Publishing Company, 1990, ISBN 0-201-14192-2
[5] Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks" (http:/ / www. acm. org/ classics/ nov95/ toc. html).
Communications of the ACM 13 (6): 377387. doi:10.1145/362384.362685. .
[6] David M. Kroenke, Database Processing: Fundamentals, Design, and Implementation (1997), Prentice-Hall, Inc., pages 130144
Relational model
Further reading
Date, C. J., Darwen, H. (2000). Foundation for Future Database Systems: The Third Manifesto, 2nd edition,
Addison-Wesley Professional. ISBN 0-201-70928-7.
Date, C. J. (2003). Introduction to Database Systems. 8th edition, Addison-Wesley. ISBN 0-321-19784-4.
External links
Feasibility of a set-theoretic data structure : a general structure based on a reconstituted definition of relation
(Childs' 1968 research cited by Codd's 1970 paper) (http://hdl.handle.net/2027.42/4164)
The Third Manifesto (TTM) (http://www.thethirdmanifesto.com/)
Relational Databases (http://www.dmoz.org/Computers/Software/Databases/Relational/) at the Open
Directory Project
Relational Model (http://c2.com/cgi/wiki?RelationalModel)
Binary relations and tuples compared with respect to the semantic web (http://blogs.sun.com/bblfish/entry/
why_binary_relations_beat_tuples)
Binary relation
In mathematics, a binary relation on a set A is a collection of ordered pairs of elements of A. In other words, it is a
subset of the Cartesian product A2 = A A. More generally, a binary relation between two sets A and B is a subset
of A B. The terms dyadic relation and 2-place relation are synonyms for binary relations.
An example is the "divides" relation between the set of prime numbers P and the set of integers Z, in which every
prime p is associated with every integer z that is a multiple of p (and not with any integer that is not a multiple of p).
In this relation, for instance, the prime 2 is associated with numbers that include 4, 0, 6, 10, but not 1 or 9; and the
prime 3 is associated with numbers that include 0, 6, and 9, but not 4 or 13.
Binary relations are used in many branches of mathematics to model concepts like "is greater than", "is equal to",
and "divides" in arithmetic, "is congruent to" in geometry, "is adjacent to" in graph theory, "is orthogonal to" in
linear algebra and many more. The concept of function is defined as a special kind of binary relation. Binary
relations are also heavily used in computer science.
A binary relation is the special case n = 2 of an n-ary relation RA1An, that is, a set of n-tuples where the jth
component of each n-tuple is taken from the jth domain Aj of the relation.
In some systems of axiomatic set theory, relations are extended to classes, which are generalizations of sets. This
extension is needed for, among other things, modeling the concepts of "is an element of" or "is a subset of" in set
theory, without running into logical inconsistencies such as Russell's paradox.
Formal definition
A binary relation R is usually defined as an ordered triple (X, Y, G) where X and Y are arbitrary sets (or classes), and
G is a subset of the Cartesian product X Y. The sets X and Y are called the domain (or the set of departure) and
codomain (or the set of destination), respectively, of the relation, and G is called its graph.
The statement (x,y) R is read "x is R-related to y", and is denoted by xRy or R(x,y). The latter notation corresponds
to viewing R as the characteristic function on "X" x "Y" for the set of pairs of G.
The order of the elements in each pair of G is important: if a b, then aRb and bRa can be true or false,
independently of each other.
49
Binary relation
50
A relation as defined by the triple (X, Y, G) is sometimes referred to as a correspondence instead.[1] In this case the
relation from X to Y is the subset G of XY, and "from X to Y" must always be either specified or implied by the
context when referring to the relation. In practice correspondence and relation tend to be used interchangeably.
. For example, if
, then
, and
are three
distinct relations.
Some mathematicians,especially in set theory, do not consider the sets
therefore define a binary relation as being a subset of
and
A special case of this difference in points of view applies to the notion of function. Many authors insist on
distinguishing between a function's codomain and its range. Thus, a single "rule," like mapping every real number x
to x2, can lead to distinct functions
and
, depending on whether the images under that
rule are understood to be reals or, more restrictively, non-negative reals. But others view functions as simply sets of
ordered pairs with unique first components. This difference in perspectives does raise some nontrivial issues. As an
example, the former camp considers surjectivityor being ontoas a property of functions, while the latter sees it
as a relationship that functions may bear to sets.
Either approach is adequate for most uses, provided that one attends to the necessary changes in language, notation,
and the definitions of concepts like restrictions, composition, inverse relation, and so on. The choice between the two
definitions usually matters only in very formal contexts, like category theory.
Example
Example: Suppose there are four objects {ball, car, doll, gun} and four persons {John, Mary, Ian, Venus}. Suppose
that John owns the ball, Mary owns the doll, and Venus owns the car. Nobody owns the gun and Ian owns nothing.
Then the binary relation "is owned by" is given as
R=({ball, car, doll, gun}, {John, Mary, Ian, Venus}, {(ball, John), (doll, Mary), (car, Venus)}).
Thus the first element of R is the set of objects, the second is the set of people, and the last element is a set of
ordered pairs of the form (object, owner).
The pair (ball, John), denoted by ballRJohn means that the ball is owned by John.
Two different relations could have the same graph. For example: the relation
({ball, car, doll, gun}, {John, Mary, Venus}, {(ball,John), (doll, Mary), (car, Venus)})
is different from the previous one as everyone is an owner. But the graphs of the two relations are the same.
Nevertheless, R is usually identified or even defined as G(R) and "an ordered pair (x, y) G(R)" is usually denoted
as "(x, y) R".
Binary relation
left-total[2]: for all x in X there exists a y in Y such that xRy (this property, although sometimes also referred to as
total, is different from the definition of total in the next section).
surjective (also called right-total[2]): for all y in Y there exists an x in X such that xRy.
Uniqueness and totality properties:
A function: a relation that is functional and left-total.
A bijection: a one-to-one correspondence; such a relation is a function and is said to be bijective.
51
Binary relation
Complement
If R is a binary relation over X and Y, then the following too:
The complement S is defined as x S y if not x R y.
The complement of the inverse is the inverse of the complement.
If X = Y the complement has the following properties:
If a relation is symmetric, the complement is too.
The complement of a reflexive relation is irreflexive and vice versa.
The complement of a strict weak order is a total preorder and vice versa.
The complement of the inverse has these same properties.
Restriction
The restriction of a binary relation on a set X to a subset S is the set of all pairs (x, y) in the relation for which x and y
are in S.
If a relation is reflexive, irreflexive, symmetric, antisymmetric, asymmetric, transitive, total, trichotomous, a partial
order, total order, strict weak order, total preorder (weak order), or an equivalence relation, its restrictions are too.
However, the transitive closure of a restriction is a subset of the restriction of the transitive closure, i.e., in general
not equal.
Also, the various concepts of completeness (not to be confused with being "total") do not carry over to restrictions.
For example, on the set of real numbers a property of the relation "" is that every non-empty subset S of R with an
upper bound in R has a least upper bound (also called supremum) in R. However, for a set of rational numbers this
52
Binary relation
53
supremum is not necessarily rational, so the same property does not hold on the restriction of the relation "" to the
set of rational numbers.
The left-restriction (right-restriction, respectively) of a binary relation between X and Y to a subset S of its domain
(codomain) is the set of all pairs (x, y) in the relation for which x (y) is an element of S.
all
transitive reflexive preorder partial order total preorder total order equivalence relation
16
13
512
171
64
29
19
13
65536
3994
4096
355
219
75
24
15
A000670
A000142
A000110
Notes:
Binary relation
54
The binary relations can be grouped into pairs (relation, complement), except that for n = 0 the relation is its own
complement. The non-symmetric ones can be grouped into quadruples (relation, complement, inverse, inverse
complement).
greater than
greater than or equal to
less than
less than or equal to
divides (evenly)
is a subset of
equivalence relations:
equality
is parallel to (for affine spaces)
is in bijection with
isomorphy
dependency relation, a finite, symmetric, reflexive relation.
independency relation, a symmetric, irreflexive relation which is the complement of some dependency relation.
example
undirected graph
No
Yes
tournament
No
No
dependency
Yes
Yes
weak order
preorder
Yes
partial order
Yes
partial equivalence
pecking order
Yes
Yes
preference
No
Yes
subset
Yes
Yes
equivalence relation
Yes
Yes
Yes
, , , equality
No
No
Yes
<
proper subset
Binary relation
55
Notes
[1] Encyclopedic dictionary of Mathematics (http:/ / books. google. co. uk/ books?id=azS2ktxrz3EC& pg=PA1331& hl=en& sa=X&
ei=glo6T_PmC9Ow8QPvwYmFCw& ved=0CGIQ6AEwBg#v=onepage& f=false). MIT. 2000. pp.13301331. ISBN0-262-59020-4. .
[2] Kilp, Knauer and Mikhalev: p. 3
[3] Yao, Y.Y.; Wong, S.K.M. (1995). "Generalization of rough sets using relationships between attribute values" (http:/ / www2. cs. uregina. ca/
~yyao/ PAPERS/ relation. pdf). Proceedings of the 2nd Annual Joint Conference on Information Sciences: 3033. ..
[4] Joseph G. Rosenstein, Linear orderings, Academic Press, 1982, ISBN 012597680, p. 4
[5] Tarski, Alfred; Givant, Steven (1987). A formalization of set theory without variables. American Mathematical Society. p.3.
ISBN0-8218-1041-3.
References
M. Kilp, U. Knauer, A.V. Mikhalev, Monoids, Acts and Categories: with Applications to Wreath Products and
Graphs, De Gruyter Expositions in Mathematics vol. 29, Walter de Gruyter, 2000, ISBN 3-11-015248-7.
Gunther Schmidt, 2010. Relational Mathematics. Cambridge University Press, ISBN 978-0-521-76268-7.
External links
Hazewinkel, Michiel, ed. (2001), "Binary relation" (http://www.encyclopediaofmath.org/index.php?title=p/
b016380), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4
Database normalization
Database normalization is the process of organizing the fields and tables of a relational database to minimize
redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant)
tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and
modifications of a field can be made in just one table and then propagated through the rest of the database via the
defined relationships.
Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know
as the First Normal Form (1NF) in 1970.[1] Codd went on to define the Second Normal Form (2NF) and Third
Normal Form (3NF) in 1971,[2] and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in
1974.[3] Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form.[4]
Most 3NF tables are free of insertion, update, and deletion anomalies.
A standard piece of database design guidance is that the designer should create a fully normalized design; selective
denormalization can subsequently be performed for performance reasons.[5]
Objectives of normalization
A basic objective of the first normal form defined by Codd in 1970 was to permit data to be queried and manipulated
using a "universal data sub-language" grounded in first-order logic.[6] (SQL is an example of such a data
sub-language, albeit one that Codd regarded as seriously flawed.)[7]
The objectives of normalization beyond 1NF (First Normal Form) were stated as follows by Codd:
1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
2. To reduce the need for restructuring the collection of relations, as new types of data are introduced,
and thus increase the life span of application programs;
3. To make the relational model more informative to users;
Database normalization
56
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to
change as time goes by.
E.F. Codd, "Further Normalization of the Data Base Relational Model"[8]
The sections below give details of each of these objectives.
Database normalization
57
last of the records on which that faculty member appears, effectively also deleting the faculty member. This
phenomenon is known as a deletion anomaly.
Example
Querying and manipulating the data within an unnormalized data structure, such as the following non-1NF
representation of customers' credit card transactions, involves more complexity than is really necessary:
Customer Jones Wilkins Stevens Transactions
Tr. ID
Date
Amount
12890 14-Oct-2003 87
12904 15-Oct-2003 50
Tr. ID
Date
Amount
12898 14-Oct-2003 21
Database normalization
58
Tr. ID
Date
Amount
12907 15-Oct-2003
18
14920 20-Nov-2003 70
15003 27-Nov-2003 60
To each customer corresponds a repeating group of transactions. The automated evaluation of any query relating to
customers' transactions therefore would broadly involve two stages:
1. Unpacking one or more customers' groups of transactions allowing the individual transactions in a group to be
examined, and
2. Deriving a query result based on the results of the first stage
For example, in order to find out the monetary sum of all transactions that occurred in October 2003 for all
customers, the system would have to know that it must first unpack the Transactions group of each customer, then
sum the Amounts of all transactions thus obtained where the Date of the transaction falls in October 2003.
One of Codd's important insights was that this structural complexity could always be removed completely, leading to
much greater power and flexibility in the way queries could be formulated (by users and applications) and evaluated
(by the DBMS). The normalized equivalent of the structure above would look like this:
Customer Tr. ID
Date
Amount
Jones
12890 14-Oct-2003
87
Jones
12904 15-Oct-2003
50
Wilkins
12898 14-Oct-2003
21
Stevens
12907 15-Oct-2003
18
Stevens
14920 20-Nov-2003 70
Stevens
15003 27-Nov-2003 60
Now each row represents an individual credit card transaction, and the DBMS can obtain the answer of interest,
simply by finding all rows with a Date falling in October, and summing their Amounts. The data structure places all
of the values on an equal footing, exposing each to the DBMS directly, so each can potentially participate directly in
queries; whereas in the previous situation some values were embedded in lower-level structures that had to be
handled specially. Accordingly, the normalized design lends itself to general-purpose query processing, whereas the
unnormalized design does not.
Database normalization
An attribute is fully functionally dependent on a set of attributes X if it is:
functionally dependent on X, and
not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on
{Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.
Transitive dependency
A transitive dependency is an indirect functional dependency, one in which XZ only by virtue of XY and
YZ.
Multivalued dependency
A multivalued dependency is a constraint according to which the presence of certain rows in a table implies
the presence of certain other rows.
Join dependency
A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a
subset of the attributes of T.
Superkey
A superkey is a combination of attributes that can be used to uniquely identify a database record. A table
might have many superkeys.
Candidate key
A candidate key is a special subset of superkeys that do not have any extraneous information in them: it is a
minimal superkey.
Examples: Imagine a table with the fields <Name>, <Age>, <SSN> and <Phone Extension>. This table has many
possible superkeys. Three of these are <SSN>, <Phone Extension, Name> and <SSN, Name>. Of those listed, only
<SSN> is a candidate key, as the others contain information not necessary to uniquely identify records ('SSN' here
refers to Social Security Number, which is unique to each person).
Non-prime attribute
A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a
non-prime attribute in the "Employees' Skills" table.
Prime attribute
A prime attribute, conversely, is an attribute that does occur in some candidate key.
Primary key
Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible
unique keys. A primary key is a key which the database designer has designated for this purpose.
Normal forms
The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of
vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less
vulnerable it is. Each table has a "highest normal form" (HNF): by definition, a table always meets the
requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the
requirements of any normal form higher than its HNF.
The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that
all of its tables are in normal form n.
Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF
design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization
typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is
59
Database normalization
60
overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually
require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to
meet the requirements of these higher normal forms.
The main normal forms are summarized below.
Normal form
Defined by
In
Brief definition
1NF
[1]
1970
and 2003
[9]
2NF
Second normal
form
E.F. Codd
1971
[2]
3NF
Third normal
form
[2]
1971
and 1982
[10]
C. Zaniolo
1982
BCNF BoyceCodd
normal form
Raymond F. Boyce
and E.F. Codd
1974
4NF
Fourth normal
form
Ronald Fagin
1977
5NF
1979
[13] Every non-trivial join dependency in the table is implied by the superkeys of the
table
[10] Every non-trivial functional dependency in the table is either the dependency of an
elementary key attribute or a dependency on a superkey
[11] Every non-trivial functional dependency in the table is a dependency on a superkey
DKNF Domain/key
normal form
Ronald Fagin
1981
[14] Every constraint on the table is a logical consequence of the table's domain
constraints and key constraints
6NF
2002
[15] Table features no non-trivial join dependencies at all (with reference to generalized
join operator)
Sixth normal
form
Denormalization
Databases intended for online transaction processing (OLTP) are typically more normalized than databases intended
for online analytical processing (OLAP). OLTP applications are characterized by a high volume of small
transactions such as updating a sales record at a supermarket checkout counter. The expectation is that each
transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are
primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a
long period of time. For such databases, redundant or "denormalized" data may facilitate business intelligence
applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or
redundant data must be carefully controlled during extract, transform, load (ETL) processing, and users should not
be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the
snowflake schema. In many cases, the need for denormalization has waned as computers and RDBMS software have
become more powerful, but since data volumes have generally increased along with hardware and software
performance, OLAP databases often still use denormalized schemas.
Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and
mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be
used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift
response is crucial.
Database normalization
61
blue
Bob
red
Jane
green
Jane
yellow
Jane
red
Assume a person has several favourite colours. Obviously, favourite colours consist of a set of colours modeled by
the given table. To transform a 1NF into an NF table a "nest" operator is required which extends the relational
algebra of the higher normal forms. Applying the "nest" operator to the 1NF table yields the following NF table:
Jane
Favourite Colour
green
yellow
red
To transform this NF table back into a 1NF an "unnest" operator is required which extends the relational algebra of
the higher normal forms. The unnest, in this case, would make "colours" into its own table.
Although "unnest" is the mathematical inverse to "nest", the operator "nest" is not always the mathematical inverse
of "unnest". Another constraint required is for the operators to be bijective, which is covered by the Partitioned
Normal Form (PNF).
Database normalization
Paper: "Non First Normal Form Relations" by G. Jaeschke, H. -J Schek ; IBM Heidelberg Scientific Center. ->
Paper studying normalization and denormalization operators nest and unnest as mildly described at the end of this
wiki page.
Further reading
Litt's Tips: Normalization (http://www.troubleshooters.com/littstip/ltnorm.html)
Date, C. J. (1999), An Introduction to Database Systems (http://www.aw-bc.com/catalog/academic/product/
0,1144,0321197844,00.html) (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings (http://www.dbdebunk.com)
H.-J. Schek, P. Pistor Data Structures for an Integrated Data Base Management and Information Retrieval System
62
Database normalization
63
External links
Database Normalization Basics (http://databases.about.com/od/specificproducts/a/normalization.htm) by
Mike Chapple (About.com)
Database Normalization Intro (http://www.databasejournal.com/sqletc/article.php/1428511), Part 2 (http://
www.databasejournal.com/sqletc/article.php/26861_1474411_1)
An Introduction to Database Normalization (http://mikehillyer.com/articles/
an-introduction-to-database-normalization/) by Mike Hillyer.
A tutorial on the first 3 normal forms (http://phlonx.com/resources/nf3/) by Fred Coulson
DB Normalization Examples (http://www.dbnormalization.com/)
Description of the database normalization basics (http://support.microsoft.com/kb/283878) by Microsoft
Database Normalization and Design Techniques (http://www.barrywise.com/2008/01/
database-normalization-and-design-techniques/) by Barry Wise, recommended reading for the Harvard MIS.
A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/Doc/simple5.
htm)
Examples
The following scenario illustrates how a database design might violate first normal form.
Customer
Customer ID First Name Surname Telephone Number
123
Robert
Ingram
555-861-2025
456
Jane
Wright
555-403-1659
789
Maria
Fernandez 555-808-9633
The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He
reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain
more than one value:
64
Customer
Customer ID First Name Surname Telephone Number
123
Robert
Ingram
555-861-2025
456
Jane
Wright
555-403-1659
555-776-4100
789
Maria
Fernandez 555-808-9633
Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as
the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first
normal form as a single field has been allowed to contain multiple values. A typical relational database management
system will not allow fields in a table to contain multiple values in this way.
Customer Name
Customer ID First Name Surname
123
Robert
Ingram
456
Jane
Wright
789
Maria
Fernandez
555-861-2025
456
555-403-1659
456
555-776-4100
789
555-808-9633
Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number
link appears on its own record. With Customer ID as key fields, a "parent-child" or one-to-many relationship exists
between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in
the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer. It
is worth noting that this design meets the additional requirements for second and third normal form.
Atomicity
Edgar F. Codd's definition of 1NF makes reference to the concept of 'atomicity'. Codd states that the "values in the
domains on which each relation is defined are required to be atomic with respect to the DBMS."[3] Codd defines an
atomic value as one that "cannot be decomposed into smaller pieces by the DBMS (excluding certain special
functions)."[4] Meaning a field should not be divided into parts with more than one kind of data in it such that what
one part means to the DBMS depends on another part of the same field.
Hugh Darwen and Chris Date have suggested that Codd's concept of an "atomic value" is ambiguous, and that this
ambiguity has led to widespread confusion about how 1NF should be understood.[5][6] In particular, the notion of a
Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in
first normal form.
Examples of tables (or views) that would not meet this definition of first normal form are:
A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in violation of
condition 3.
A view whose definition mandates that results be returned in a particular order, so that the row-ordering is an
intrinsic and meaningful aspect of the view.[10] This violates condition 1. The tuples in true relations are not
ordered with respect to each other.
A table with at least one nullable attribute. A nullable attribute would be in violation of condition 4, which
requires every field to contain exactly one value from its column's domain. It should be noted, however, that this
aspect of condition 4 is controversial. It marks an important departure from Codd's later vision of the relational
model,[11] which made explicit provision for nulls.[12]
65
References
[1] Elmasri, Ramez and Navathe, Shamkant B. (July 2003). Fundamentals of Database Systems, Fourth Edition. Pearson. p.315.
ISBN0321204484. "It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the value of any
attribute in a tuple must be a single value from the domain of that attribute."
[2] E. F. Codd (Oct 1972). "Further normalization of the database relational model". Data Base Systems. Courant Institute: Prentice-Hall.
ISBN013196741X. "A relation is in first normal form if it has the property that none of its domains has elements which are themselves sets."
[3] Codd, E. F. The Relational Model for Database Management Version 2 (Addison-Wesley, 1990).
[4] Codd, E. F. The Relational Model for Database Management Version 2 (Addison-Wesley, 1990), p. 6.
[5] Darwen, Hugh. "Relation-Valued Attributes; or, Will the Real First Normal Form Please Stand Up?", in C. J. Date and Hugh Darwen,
Relational Database Writings 1989-1991 (Addison-Wesley, 1992).
[6] "[F]or many years," writes Date, "I was as confused as anyone else. What's worse, I did my best (worst?) to spread that confusion through my
writings, seminars, and other presentations." Date, C. J. "What First Normal Form Really Means" (http:/ / www. dbdebunk. com/ page/ page/
629796. htm) in Date on Database: Writings 2000-2006 (Springer-Verlag, 2006), p. 108
[7] Date, C. J. "What First Normal Form Really Means" (http:/ / www. dbdebunk. com/ page/ page/ 629796. htm) p. 112.
[8] Date, C. J. "What First Normal Form Really Means" (http:/ / www. dbdebunk. com/ page/ page/ 629796. htm) pp. 121126.
[9] Date, C. J. "What First Normal Form Really Means" (http:/ / www. dbdebunk. com/ page/ page/ 629796. htm) pp. 127128.
[10] Such views cannot be created using SQL that conforms to the SQL:2003 standard.
[11] "Codd first defined the relational model in 1969 and didn't introduce nulls until 1979" Date, C. J. SQL and Relational Theory (O'Reilly,
2009), Appendix A.2.
[12] The third of Codd's 12 rules states that "Null values ... [must be] supported in a fully relational DBMS for representing missing information
and inapplicable information in a systematic way, independent of data type." Codd, E. F. "Is Your DBMS Really Relational?" Computerworld,
October 14, 1985.
Further reading
Litt's Tips: Normalization (http://www.troubleshooters.com/littstip/ltnorm.html)
Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (http://www.
elsevier.com/wps/product/cws_home/680662) (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
Date, C. J. (1999), An Introduction to Database Systems (http://www.aw-bc.com/catalog/academic/product/
0,1144,0321197844,00.html) (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
Date, C. J., & Darwen, H., & Pascal, F. Database Debunkings (http://www.dbdebunk.com)
66
67
Example
Consider a table describing employees' skills:
Employees' Skills
Employee
Skill
Jones
Typing
Jones
Shorthand
Jones
Whittling
Bravo
Ellis
Alchemy
73 Industrial Way
Ellis
Flying
73 Industrial Way
Harrison
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to
appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it
might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key
for the table.
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee.
Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are
told three times that Jones works at 114 Main Street, and twice that Ellis works at 73 Industrial Way. This
redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work
location on his "Typing" and "Shorthand" records and not update his "Whittling" record. The resulting data would
imply contradictory answers to the question "What is Jones' current work location?"
A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with
candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:
68
Employees
Employee Current Work Location
Jones
Bravo
73 Industrial Way
Ellis
73 Industrial Way
Harrison
73 Industrial Way
Employees' Skills
Employee
Skill
Jones
Typing
Jones
Shorthand
Jones
Whittling
Bravo
Light Cleaning
Ellis
Alchemy
Ellis
Flying
Harrison
Light Cleaning
Tournament Winners
Tournament
Year
Winner
1998 Al Fredrickson
21 July 1975
Cleveland Open
28 September 1968
21 July 1975
Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of
it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This leads to
an update anomaly: if updates are not carried out consistently, a particular winner could be shown as having two
different dates of birth.
The underlying problem is the transitive dependency to which the Winner Date of Birth attribute is subject. Winner
Date of Birth actually depends on Winner, which in turn depends on the key Tournament / Year.
This problem is addressed by third normal form (3NF).
69
Model
Manufacturer Country
Forte
X-Prime
Forte X-Prime
Italy
Forte
Ultraclean
Forte Ultraclean
Italy
Dent-o-Fresh
EZbrush
Kobayashi
ST-60
Kobayashi ST-60
Hoch
Germany
Hoch
X-Prime
Germany
Japan
Hoch X-Prime
Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF. {Manufacturer,
Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset of it: Manufacturer. To
make the design conform to 2NF, it is necessary to have two tables:
Italy
Dent-o-Fresh
USA
Kobayashi
Japan
Hoch
Germany
Model
Forte
X-Prime
Forte X-Prime
Forte
Ultraclean
Forte Ultraclean
Dent-o-Fresh
EZbrush
Dent-o-Fresh BananaBrush-2000
Kobayashi
ST-60
Kobayashi ST-60
Hoch
Hoch
X-Prime
Hoch X-Prime
References
[1] Codd, E.F. "Further Normalization of the Data Base Relational Model." (Presented at Courant Computer Science Symposia Series 6, "Data
Base Systems," New York City, May 24th-25th, 1971.) IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin
(ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
Further reading
Litt's Tips: Normalization (http://www.troubleshooters.com/littstip/ltnorm.html)
Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (http://www.
elsevier.com/wps/product/cws_home/680662) (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
C.J.Date (2004). Introduction to Database Systems (8th ed.). Boston: Addison-Wesley. ISBN978-0-321-19784-9.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings (http://www.dbdebunk.com)
External links
Database Normalization Basics (http://databases.about.com/od/specificproducts/a/normalization.htm) by
Mike Chapple (About.com)
An Introduction to Database Normalization (http://mikehillyer.com/articles/
an-introduction-to-database-normalization/) by Mike Hillyer.
A tutorial on the first 3 normal forms (http://phlonx.com/resources/nf3/) by Fred Coulson
Description of the database normalization basics (http://support.microsoft.com/kb/283878) by Microsoft
70
71
Tournament Winners
Tournament
Year
Winner
Indiana Invitational
1998 Al Fredrickson
21 July 1975
Cleveland Open
28 September 1968
21 July 1975
Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite
key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament,
Year} is a candidate key for the table.
The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the
candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is
functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop
the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:
Tournament Winners
Tournament
Year
Winner
Indiana Invitational
1998 Al Fredrickson
Cleveland Open
72
Date of Birth
21 July 1975
Bob Albertson
28 September 1968
Update anomalies cannot occur in these tables, which are both in 3NF.
References
[1] Codd, E.F. "Further Normalization of the Data Base Relational Model." (Presented at Courant Computer Science Symposia Series 6, "Data
Base Systems," New York City, May 24th25th, 1971.) IBM Research Report RJ909 (August 31st, 1971). Republished in Randall J. Rustin
(ed.), Data Base Systems: Courant Computer Science Symposia Series 6. Prentice-Hall, 1972.
[2] Codd, p. 43.
[3] Codd, p. 4546.
[4] Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata." ACM Transactions on Database Systems 7(3),
September 1982.
[5] Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts (http:/ / www. db-book. com/ ) (5th edition), p. 276-277
[6] Kent, William. "A Simple Guide to Five Normal Forms in Relational Database Theory" (http:/ / www. bkent. net/ Doc/ simple5. htm),
Communications of the ACM 26 (2), Feb. 1983, pp. 120125.
[7] The author of a 1989 book on database management credits one of his students with coming up with the "so help me Codd" addendum. Diehr,
George. Database Management (Scott, Foresman, 1989), p. 331.
[8] Date, C.J. An Introduction to Database Systems (7th ed.) (Addison Wesley, 2000), p. 379.
[9] Zaniolo, p. 494.
Further reading
Date, C. J. (1999), An Introduction to Database Systems (http://www.aw-bc.com/catalog/academic/product/
0,1144,0321197844,00.html) (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
External links
Litt's Tips: Normalization (http://www.troubleshooters.com/littstip/ltnorm.html)
Database Normalization Basics (http://databases.about.com/od/specificproducts/a/normalization.htm) by
Mike Chapple (About.com)
73
Rate Type
09:30
10:30
SAVER
11:00
12:00
SAVER
14:00
15:30
STANDARD
10:00
11:30
PREMIUM-B
11:30
13:30
PREMIUM-B
15:00
16:30
PREMIUM-A
Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass
court (Court 2)
A booking is defined by its Court and the period for which the Court is reserved
Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:
74
Note that, Court 1 (normal quality) is less expensive than Court 2 (high quality)
The table's superkeys are:
Note that even though in the above table Start Time and End Time attributes have no duplicate values for each of
them, we still have to admit that in some other days two different bookings on court 1 and court 2 could start at the
same time or end at the same time. This is the reason why {Start Time} and {End Time} cannot be considered as the
table's superkeys.
However, only S1, S2, S3 and S4 are candidate keys (that is, minimal superkeys for that relation) because e.g. S1 S5,
so S5 cannot be a candidate key.
Recall that 2NF prohibits partial functional dependencies of non-prime attributes (i.e. an attribute that does not occur
in ANY candidate key) on candidate keys, and that 3NF prohibits transitive functional dependencies of non-prime
attributes on candidate keys.
In Today's Court Bookings table, there are no non-prime attributes: that is, all attributes belong to some candidate
key. Therefore the table adheres to both 2NF and 3NF.
The table does not adhere to BCNF. This is because of the dependency Rate Type Court, in which the
determining attribute (Rate Type) is neither a candidate key nor a superset of a candidate key.
Dependency Rate Type Court is respected as a Rate Type should only ever apply to a single Court.
The design can be amended so that it meets BCNF:
Rate Types
Rate Type
SAVER
Yes
STANDARD
No
PREMIUM-A 2
Yes
PREMIUM-B 2
No
75
Today's Bookings
Rate Type
SAVER
09:30
10:30
SAVER
11:00
12:00
STANDARD
14:00
15:30
PREMIUM-B 10:00
11:30
PREMIUM-B 11:30
13:30
PREMIUM-A 15:00
16:30
The candidate keys for the Rate Types table are {Rate Type} and {Court, Member Flag}; the candidate keys for the
Today's Bookings table are {Rate Type, Start Time} and {Rate Type, End Time}. Both tables are in BCNF. Having
one Rate Type associated with two different Courts is now impossible, so the anomaly affecting the original table
has been eliminated.
Achievability of BCNF
In some cases, a non-BCNF table cannot be decomposed into tables that satisfy BCNF and preserve the
dependencies that held in the original table. Beeri and Bernstein showed in 1979 that, for example, a set of functional
dependencies {AB C, C B} cannot be represented by a BCNF schema.[6] Thus, unlike the first three normal
forms, BCNF is not always achievable.
Consider the following non-BCNF table whose functional dependencies follow the {AB C, C B} pattern:
Nearest Shops
Person
Shop Type
Davidson Optician
Nearest Shop
Eagle Eye
Bookshop
Merlin Books
Fuller
Bakery
Doughy's
Fuller
Fuller
Optician
Eagle Eye
For each Person / Shop Type combination, the table tells us which shop of this type is geographically nearest to the
person's home. We assume for simplicity that a single shop cannot be of more than one type.
The candidate keys of the table are:
{Person, Shop Type}
{Person, Nearest Shop}
Because all three attributes are prime attributes (i.e. belong to candidate keys), the table is in 3NF. The table is not in
BCNF, however, as the Shop Type attribute is functionally dependent on a non-superkey: Nearest Shop.
The violation of BCNF means that the table is subject to anomalies. For example, Eagle Eye might have its Shop
Type changed to "Optometrist" on its "Fuller" record while retaining the Shop Type "Optician" on its "Davidson"
record. This would imply contradictory answers to the question: "What is Eagle Eye's Shop Type?" Holding each
shop's Shop Type only once would seem preferable, as doing so would prevent such anomalies from occurring:
76
Shop
Merlin Books
Fuller
Doughy's
Fuller
Sweeney Todd's
Fuller
Eagle Eye
Shop
Shop
Shop Type
Eagle Eye
Optician
Snippets
Hairdresser
Merlin Books
Bookshop
Doughy's
Bakery
In this revised design, the "Shop Near Person" table has a candidate key of {Person, Shop}, and the "Shop" table has
a candidate key of {Shop}. Unfortunately, although this design adheres to BCNF, it is unacceptable on different
grounds: it allows us to record multiple shops of the same type against the same person. In other words, its candidate
keys do not guarantee that the functional dependency {Person, Shop Type} {Shop} will be respected.
A design that eliminates all of these anomalies (but does not conform to BCNF) is possible.[7] This design consists of
the original "Nearest Shops" table supplemented by the "Shop" table described above.
Nearest Shops
Person
Shop Type
Davidson Optician
Nearest Shop
Eagle Eye
Bookshop
Merlin Books
Fuller
Bakery
Doughy's
Fuller
Fuller
Optician
Eagle Eye
77
Shop
Shop
Shop Type
Eagle Eye
Optician
Snippets
Hairdresser
Merlin Books
Bookshop
Doughy's
Bakery
If a referential integrity constraint is defined to the effect that {Shop Type, Nearest Shop} from the first table must
refer to a {Shop Type, Shop} from the second table, then the data anomalies described previously are prevented.
References
[1] Codd, E. F. "Recent Investigations into Relational Data Base Systems." IBM Research Report RJ1385 (April 23, 1974). Republished in Proc.
1974 Congress (Stockholm, Sweden, 1974). New York, N.Y.: North-Holland (1974).
[2] Heath, I. "Unacceptable File Operations in a Relational Database." Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and
Control, San Diego, Calif. (November 11th12th, 1971).
[3] Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 142.
[4] Silberschatz, Abraham (2006). Database System Concepts (6th ed.). McGraw-Hill. pp.333. ISBN978-0-07-352332-3.
[5] Vincent, M.W. and B. Srinivasan. "A Note on Relation Schemes Which Are in 3NF But Not in BCNF." Information Processing Letters
48(6), 1993, pp. 28183.
[6] Beeri, Catriel and Bernstein, Philip A. "Computational problems related to the design of normal form relational schemas." ACM Transactions
on Database Systems 4(1), March 1979, p. 50.
[7] Zaniolo, Carlo. "A New Normal Form for the Design of Relational Database Schemata." ACM Transactions on Database Systems 7(3),
September 1982, pp. 493.
Bibliography
Date, C. J. (1999). An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN
0-321-19784-4.
External links
Rules Of Data Normalization (http://web.archive.org/web/20080805014412/http://www.datamodel.org/
NormalizationRules.html)
Advanced Normalization (http://web.archive.org/web/20080423014733/http://www.utexas.edu/its/
archive/windows/database/datamodeling/rm/rm8.html) by ITS, University of Texas.
78
Multivalued dependencies
If the column headings in a relational database table are divided into three disjoint groupings X, Y, and Z, then, in the
context of a particular row, we can refer to the data beneath each group of headings as x, y, and z respectively. A
multivalued dependency X
Y signifies that if we choose any x actually occurring in the table (call this choice xc),
and compile a list of all the xcyz combinations that occur in the table, we will find that xc is associated with the same
y entries regardless of z.
A trivial multivalued dependency X
whole set of attributes of the relation.
Example
Consider the following example:
A1 Pizza
Thick Crust
Springfield
A1 Pizza
Thick Crust
Shelbyville
A1 Pizza
Thick Crust
Capital City
A1 Pizza
Stuffed Crust
Springfield
A1 Pizza
Stuffed Crust
Shelbyville
A1 Pizza
Stuffed Crust
Capital City
Elite Pizza
Thin Crust
Capital City
Elite Pizza
Stuffed Crust
Capital City
Springfield
Shelbyville
Springfield
Shelbyville
Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.
The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it
meets all normal forms up to BCNF. If we assume, however, that pizza varieties offered by a restaurant are not
affected by delivery area, then it does not meet 4NF. The problem is that the table features two non-trivial
multivalued dependencies on the {Restaurant} attribute (which is not a superkey). The dependencies are:
79
{Pizza Variety}
{Delivery Area}
These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant
offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the
table: for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza starts producing
Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is,
moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1
Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant}
{Pizza Variety}.
To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table
from the facts about delivery areas, yielding two tables that are both in 4NF:
Varieties By Restaurant
Restaurant
Pizza Variety
A1 Pizza
Thick Crust
A1 Pizza
Stuffed Crust
Elite Pizza
Thin Crust
Elite Pizza
Stuffed Crust
Delivery Area
A1 Pizza
Springfield
A1 Pizza
Shelbyville
A1 Pizza
Capital City
Elite Pizza
Capital City
In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to
another, the original three-column table would satisfy 4NF.
Ronald Fagin demonstrated that it is always possible to achieve 4NF.[2] Rissanen's theorem is also applicable on
multivalued dependencies.
4NF in practice
A 1992 paper by Margaret S. Wu notes that the teaching of database normalization typically stops short of 4NF,
perhaps because of a belief that tables violating 4NF (but meeting all lower normal forms) are rarely encountered in
business applications. This belief may not be accurate, however. Wu reports that in a study of forty organizational
databases, over 20% contained one or more tables that violated 4NF while meeting all lower normal forms.[3]
References
[1] "A relation schema R* is in fourth normal form (4NF) if, whenever a nontrivial multivalued dependency X
Y holds for R*, then so does
the functional dependency X A for every column name A of R*. Intuitively all dependencies are the result of keys." Fagin, Ronald
(September 1977). "Multivalued Dependencies and a New Normal Form for Relational Databases" (http:/ / www. almaden. ibm. com/ cs/
people/ fagin/ tods77. pdf). ACM Transactions on Database Systems 2 (1): 267. doi:10.1145/320557.320571. .
[2] Fagin, p. 268
[3] Wu, Margaret S. (March 1992). "The Practical Need for Fourth Normal Form". ACM SIGCSE Bulletin 24 (1): 1923.
doi:10.1145/135250.134515.
Further reading
Date, C. J. (1999), An Introduction to Database Systems (http://www.aw-bc.com/catalog/academic/product/
0,1144,0321197844,00.html) (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings (http://www.dbdebunk.com)
Advanced Normalization (http://www.utexas.edu/its/windows/database/datamodeling/rm/rm8.html) by
ITS, University of Texas.
Example
Consider the following example:
80
81
Brand
Product Type
Jack Schneider
Acme
Vacuum Cleaner
Jack Schneider
Acme
Breadbox
Willy Loman
Willy Loman
Willy Loman
Robusto Breadbox
Willy Loman
Louis Ferguson
Louis Ferguson
Robusto Telescope
Louis Ferguson
Acme
Vacuum Cleaner
Louis Ferguson
Acme
Lava Lamp
Louis Ferguson
The table's predicate is: Products of the type designated by Product Type, made by the brand designated by Brand,
are available from the travelling salesman designated by Travelling Salesman.
In the absence of any rules restricting the valid possible combinations of Travelling Salesman, Brand, and Product
Type, the three-attribute table above is necessary in order to model the situation correctly.
Suppose, however, that the following rule applies: A Travelling Salesman has certain Brands and certain Product
Types in his repertoire. If Brand B is in his repertoire, and Product Type P is in his repertoire, then (assuming
Brand B makes Product Type P), the Travelling Salesman must offer only the products of Product Type P made by
Brand B.
In that case, it is possible to split the table into three:
Product Type
Jack Schneider
Vacuum Cleaner
Jack Schneider
Breadbox
Willy Loman
Pruning Shears
Willy Loman
Vacuum Cleaner
Willy Loman
Breadbox
Willy Loman
Umbrella Stand
Louis Ferguson
Telescope
Louis Ferguson
Vacuum Cleaner
Louis Ferguson
Lava Lamp
Louis Ferguson
Tie Rack
82
Brand
Jack Schneider
Acme
Willy Loman
Robusto
Louis Ferguson
Robusto
Louis Ferguson
Acme
Louis Ferguson
Nimbus
Product Type
Acme
Vacuum Cleaner
Acme
Breadbox
Acme
Lava Lamp
Note how this setup helps to remove redundancy. Suppose that Jack Schneider starts selling Robusto's products. In
the previous setup we would have to add two new entries since Jack Schneider is able to sell two Product Types
covered by Robusto: Breadboxes and Vacuum Cleaners. With the new setup we need only add a single entry (in
Brands By Travelling Salesman).
Usage
Only in rare situations does a 4NF table not conform to 5NF. These are situations in which a complex real-world
constraint governing the valid combinations of attribute values in the 4NF table is not implicit in the structure of that
table. If such a table is not normalized to 5NF, the burden of maintaining the logical consistency of the data within
the table must be carried partly by the application responsible for insertions, deletions, and updates to it; and there is
a heightened risk that the data within the table will become inconsistent. In contrast, the 5NF design excludes the
possibility of such inconsistencies..
References
[1] Analysis of normal forms for anchor-tables (http:/ / www. anchormodeling. com/ wp-content/ uploads/ 2010/ 08/ 6nf. pdf)
Further reading
Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory (http://www.bkent.net/
Doc/simple5.htm), Communications of the ACM, vol. 26, pp.120125
Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings (http://www.dbdebunk.com)
Example
A violation of DKNF occurs in the following table:
83
84
Wealthy Person
Wealthy Person Wealthy Person Type Net Worth in Dollars
Steve
Eccentric Millionaire
124,543,621
Roderick
Evil Billionaire
6,553,228,893
Katrina
Eccentric Billionaire
8,829,462,998
Gary
Evil Millionaire
495,565,211
(Assume that the domain for Wealthy Person consists of the names of all wealthy people in a pre-defined sample of
wealthy people; the domain for Wealthy Person Type consists of the values 'Eccentric Millionaire', 'Eccentric
Billionaire', 'Evil Millionaire', and 'Evil Billionaire'; and the domain for Net Worth in Dollars consists of all integers
greater than or equal to 1,000,000.)
There is a constraint linking Wealthy Person Type to Net Worth in Dollars, even though we cannot deduce one from
the other. The constraint dictates that an Eccentric Millionaire or Evil Millionaire will have a net worth of 1,000,000
to 999,999,999 inclusive, while an Eccentric Billionaire or Evil Billionaire will have a net worth of 1,000,000,000 or
higher. This constraint is neither a domain constraint nor a key constraint; therefore we cannot rely on domain
constraints and key constraints to guarantee that an inconsistent Wealthy Person Type / Net Worth in Dollars
combination does not make its way into the database.
The DKNF violation could be eliminated by altering the Wealthy Person Type domain to make it consist of just two
values, 'Evil' and 'Eccentric' (the wealthy person's status as a millionaire or billionaire is implicit in their Net Worth
in Dollars, so no useful information is lost).
Wealthy Person
Wealthy Person Wealthy Person Type Net Worth in Dollars
Steve
Eccentric
124,543,621
Roderick
Evil
6,553,228,893
Katrina
Eccentric
8,829,462,998
Gary
Evil
495,565,211
Wealthiness Status
Status
Minimum
Millionaire 1,000,000
Maximum
999,999,999
References
Fagin, Ronald (1981). "A Normal Form for Relational Databases That Is Based on Domains and Keys" [1]. ACM
Transactions on Database Systems 6: 387415. doi:10.1145/319587.319592.
[1] http:/ / www. almaden. ibm. com/ cs/ people/ fagin/ tods81. pdf
External links
Database Normalization Basics (http://databases.about.com/od/specificproducts/a/normalization.htm) by
Mike Chapple (About.com)
An Introduction to Database Normalization (http://dev.mysql.com/tech-resources/articles/
intro-to-normalization.html) by Mike Hillyer.
Normalization (http://www.utexas.edu/its-archive/windows/database/datamodeling/rm/rm7.html) by ITS,
University of Texas.
A tutorial on the first 3 normal forms (http://phlonx.com/resources/nf3/) by Fred Coulson
Description of the database normalization basics (http://support.microsoft.com/kb/283878) by Microsoft
85
DKNF
Some authors use the term sixth normal form differently, namely, as a synonym for Domain/key normal form
(DKNF). This usage predates Date et al.'s work.[6]
Usage
The sixth normal form is currently being used in some data warehouses where the benefits outweigh the
drawbacks,[7] for example using Anchor Modeling. Although using 6NF leads to an explosion of tables, modern
databases can prune the tables from select queries (using a process called 'table elimination') where they are not
required and thus speed up queries that only access several attributes.
References
[1]
[2]
[3]
[4]
[5]
[6] See www.dbdebunk.com for a discussion on this topic (http:/ / www. dbdebunk. com/ page/ page/ 621935. htm)
[7] See the Anchor Modeling website (http:/ / www. anchormodeling. com) for a website that describes a data warehouse modelling method
based on the sixth normal form
Further reading
Date, C.J. (2006). The relational database dictionary: a comprehensive glossary of relational terms and concepts,
with illustrative examples. O'Reilly Series Pocket references. O'Reilly Media, Inc.. p.90.
ISBN978-0-596-52798-3.
Date, Chris J.; Hugh Darwen, Nikos A. Lorentzos (January 2003). Temporal Data and the Relational Model: A
Detailed Investigation into the Application of Interval and Relation Theory to the Problem of Temporal Database
Management. Oxford: Elsevier LTD. ISBN1-55860-855-9.
Zimanyi,, E. (June 2006). "Temporal Aggregates and Temporal Universal Quantification in Standard SQL" (http:/
/www.sigmod.org/publications/sigmod-record/0606/sigmod-record.june2006.pdf) (PDF). ACM SIGMOD
Record, volume 35, number 2, page 16. ACM.
86
Relation (database)
87
Relation (database)
In a relational database, a relation is a set of tuples (d1,d2,...,dj),
where each element dn is a member of Dn, a data domain.[1] Each
distinct domain used in the definition of a relation is called an
attribute, and each attribute may be named.
In SQL, a query language for relational databases, relations are
represented by tables, where each row of a table represents a single
tuple, and where the values of each attribute form a column.
E. F. Codd originally used the term in its mathematical sense of a finitary relation, a set of tuples on some set of n
sets S1,S2,.... ,Sn.[2] In this sense, the term was used by Augustus De Morgan in 1858.[3]
Where all values of every attribute of a relation are atomic, that relation is said to be in first normal form.
Examples
Below is an example of a relation having three named attributes: 'ID' from the domain of integers, and 'Name' and
'Address' from the domain of strings:
ID (Integer)
Name (String)
Address (String)
102
Yonezawa Akinori
Naha, Okinawa
202
Murata Makoto
Sendai, Miyagi
104
Sakamura Ken
Kumamoto, Kumamoto
152
The tuples are unordered - one cannot say "The tuple of 'Murata Makoto' is above the tuple of 'Matsumoto
Yukihiro'", nor can one say "The tuple of 'Yonezawa Akinori' is the first tuple."
Relation (database)
Data Definition Language (DDL) is used to define derived relation variables. In SQL, CREATE VIEW syntax is
used to define derived relation variables. The following is an example.
CREATE VIEW List_of_Okinawa_people AS (
SELECT ID, Name, Address
FROM List_of_people
WHERE Address LIKE '%, Okinawa'
)
References
[1] E. F. Codd (Oct 1972). "Further normalization of the database relational model". Data Base Systems. Courant Institute: Prentice-Hall.
ISBN013196741X. "R is a relation on these n sets if it is a set of elements of the form (d1, d2, ..., dj) where dj Dj for each j=1,2,...,n ."
[2] Codd, Edgar F (June 1970). "A Relational Model of Data for Large Shared Data Banks" (http:/ / www. seas. upenn. edu/ ~zives/ 03f/ cis550/
codd. pdf). Communications of the ACM (Association for Computing Machinery) 13 (6): 37787. doi:10.1145/362384.362685. . "The term
relation is used here in its accepted mathematical sense"
[3] Augustus De Morgan (1858). "On the Syllogism, No. III". Transactions of the Cambridge Philosophical Society 10: 208. "When two objects,
qualities, classes, or attributes, viewed together by the mind, are seen under some connexion, that connexion is called a relation."
Functional dependency
A functional dependency is a constraint between two sets of attributes in a relation from a database.
Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written
X Y) if, and only if, each X value is associated with precisely one Y value. Customarily we call X the determinant
set and Y the dependent attribute. Thus, given a tuple and the values of the attributes in X, one can determine the
corresponding value of the Y attribute. In simple words, if X value is known, Y value is certainly known. For the
purposes of simplicity, given that X and Y are sets of attributes in R, X Y denotes that X functionally determines
each of the members of Yin this case Y is known as the dependent set. Thus, a candidate key is a minimal set of
attributes that functionally determine all of the attributes in a relation. The concept of functional dependency arises
when one attribute is dependent on another attribute and it also uniquely determines the other attribute.
(Note: the "function" being discussed in "functional dependency" is the function of identification.)
A functional dependency FD: X Y is called trivial if Y is a subset of X.
The determination of functional dependencies is an important part of designing databases in the relational model, and
in database normalization and denormalization. The functional dependencies, along with the attribute domains, are
selected so as to generate constraints that would exclude as much data inappropriate to the user domain from the
system as possible.
For example, suppose one is designing a system to track vehicles and the capacity of their engines. Each vehicle has
a unique vehicle identification number (VIN). One would write VIN EngineCapacity because it would be
inappropriate for a vehicle's engine to have more than one capacity. (Assuming, in this case, that vehicles only have
one engine.) However, EngineCapacity VIN, is incorrect because there could be many vehicles with the same
engine capacity.
This functional dependency may suggest that the attribute EngineCapacity be placed in a relation with candidate key
VIN. However, that may not always be appropriate. For example, if that functional dependency occurs as a result of
the transitive functional dependencies VIN VehicleModel and VehicleModel EngineCapacity then that would
88
Functional dependency
89
Example
This example illustrates the concept of functional dependency. The situation modeled is that of college students
visiting one or more lectures in each of which they are assigned a teaching assistant (TA). Let's further assume that
every student is in some semester and is identified by a unique integer ID.
StudentID Semester
Lecture
TA
1234
2380
1234
Visual Computing
1201
1201
Physics II
Amina
Simone
We notice that whenever two rows in this table feature the same StudentID, they also necessarily have the same
Semester values. This basic fact can be expressed by a functional dependency:
StudentID Semester.
Other nontrivial functional dependencies can be identified, for example:
{StudentID, Lecture} TA
{StudentID, Lecture} {TA, Semester}
The latter expresses the fact that the set {StudentID, Lecture} is a superkey of the relation.
Functional dependency
90
External links
Gary Burt (summer, 1999). "CS 461 (Database Management Systems) lecture notes" [1]. University of Maryland
Baltimore County Department of Computer Science and Electrical Engineering.
Jeffrey D. Ullman. "CS345 Lecture Notes" [2] (PostScript). Stanford University.
Osmar Zaiane (June 9, 1998). "CMPT 354 (Database Systems I) lecture notes" [3]. Simon Fraser University
Department of Computing Science.
References
[1] http:/ / www. cs. umbc. edu/ courses/ 461/ current/ burt/ lectures/ lec14/
[2] http:/ / www-db. stanford. edu/ ~ullman/ cs345notes/ slides01-1. ps
[3] http:/ / www. cs. sfu. ca/ CC/ 354/ zaiane/ material/ notes/ Chapter6/ node10. html
Multivalued dependency
In database theory, multivalued dependency is a full constraint between two sets of attributes in a relation.
In contrast to the functional dependency, the multivalued dependency requires that certain tuples be present in a
relation. Therefore, a multivalued dependency is a special case of tuple-generating dependency. The multivalued
dependency plays a role in the 4NF database normalization.
Formal definition
The formal definition is given as follows. [1]
Let
and
in
and
multidetermines
) holds on
such that
and
in
such that
In more simple words the above condition can be expressed as follows: if we denote by
values for
and
collectively equal to
exist in
, the tuples
and
Example
Consider this example of a database of teaching courses, the books recommended for the course, and the lecturers
who will be teaching the course:
Multivalued dependency
91
Course
Book
Lecturer
AHA
Silberschatz John D
AHA
Nederpelt
AHA
Silberschatz William M
AHA
Nederpelt
AHA
Silberschatz Christian G
AHA
Nederpelt
OSO
Silberschatz John D
OSO
Silberschatz William M
William M
John D
Christian G
Because the lecturers attached to the course and the books attached to the course are independent of each other, this
database design has a multivalued dependency; if we were to add a new book to the AHA course, we would have to
add one record for each of the lecturers on that course, and vice versa.
Put formally, there are two multivalued dependencies in this relation: {course}
{book} and equivalently
{course}
{lecturer}.
Databases with multivalued dependencies thus exhibit redundancy. In database normalization, fourth normal form
requires that either every multivalued dependency X
Y is trivial or for every nontrivial multivalued dependency
X
Y, X is a superkey.
Interesting properties
If
If
If
, Then
and
and
, Then
, then
, then
and
, then
Y holds
Definitions
full constraint
A constraint which expresses something about all attributes in a database. (In contrast to an embedded
constraint.) That a multivalued dependency is a full constraint follows from its definition,as where it says
something about the attributes
.
tuple-generating dependency
A dependency which explicitly requires certain tuples to be present in the relation.
trivial multivalued dependency 1
A multivalued dependency which involves all the attributes of a relation i.e.
multivalued dependency implies, for tuples
and
, tuples
and
. A trivial
and
Multivalued dependency
92
References
[1] Silberschatz, Abraham; Korth, Sudarshan (2006). Database System Concepts (5th ed.). McGraw-Hill. p.295. ISBN0-07-124476-X.
External links
Multivalued dependencies and a new Normal form for Relational Databases (http://www.almaden.ibm.com/
cs/people/fagin/tods77.pdf) (PDF) - Ronald Fagin, IBM Research Lab
Join dependency
A join dependency is a constraint on the set of legal relations over a database scheme. A table T is subject to a join
dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T. If one
of the tables in the join has all the attributes of the table T, the join dependency is called trivial.
The join dependency plays an important role in the Fifth normal form, also known as project-join normal form,
because it can be proven that if you decompose a scheme
in tables
to
, the decomposition will be a
lossless-join decomposition if you restrict the legal relations on
to a join dependency on
called
.
Another way to describe a join dependency is to say that the set of relationships in the join dependency is
independent of each other.
Formal definition
Let
The relation
be a decomposition of
if
.
.
Example
Given a pizza-chain that models purchases in table Customer = { order-number, customer-name, pizza-name,
courier}. It is obvious that you can derive the following relations:
customer-name depends on order-number
pizza-name depends on order-number
courier depends on order-number
Since the relationships are independent you can say there is a join dependency as follows: *((order-number,
customer-name), (order-number, pizza-name), (order-number,courier)).
If each customer has his own courier however, you could have a join-dependency like this: *((order-number,
customer-name), (order-number, courier), (customer-name, courier), (order-number,pizza-name)), but
*((order-number, customer-name, courier), (order-number,pizza-name)) would be valid as well. This makes it
obvious that just having a join dependency is not enough to normalize a database scheme.
Join dependency
References
[1] Silberschatz, Korth. Database System Concepts, 1st Edition
Concurrency control
In information technology and computer science, especially in the fields of computer programming, operating
systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations
are generated, while getting those results as quickly as possible.
Computer systems, both software and hardware, consist of modules, or components. Each component is designed to
operate correctly, i.e., to obey to or meet certain consistency rules. When components that operate concurrently
interact by messaging or by sharing accessed data (in memory or storage), a certain component's consistency may be
violated by another component. The general area of concurrency control provides rules, methods, design
methodologies, and theories to maintain the consistency of components operating concurrently while interacting, and
thus the consistency and correctness of the whole system. Introducing concurrency control into a system means
applying operation constraints which typically result in some performance reduction. Operation consistency and
correctness should be achieved with as good as possible efficiency, without reducing performance below reasonable.
For example, a failure in concurrency control can result in data corruption from torn read or write operations.
93
Concurrency control
become distributed, or needed to cooperate in distributed environments (e.g., Federated databases in the early 1990,
and Cloud computing currently), the effective distribution of concurrency control mechanisms has received special
attention.
94
Concurrency control
3. The incorrect summary problem: While one transaction takes a summary over the values of all the instances of a
repeated data-item, a second transaction updates some instances of that data-item. The resulting summary does
not reflect a correct result for any (usually needed for correctness) precedence order between the two transactions
(if one is executed before the other), but rather some random result, depending on the timing of the updates, and
whether certain update results have been included in the summary or not.
Most high-performance transactional systems need to run transactions concurrently to meet their performance
requirements. Thus, without concurrency control such systems can neither provide correct results nor maintain their
databases consistent.
95
Concurrency control
Other major concurrency control types that are utilized in conjunction with the methods above include:
Multiversion concurrency control (MVCC) - Increasing concurrency and performance by generating a new
version of a database object each time the object is written, and allowing transactions' read operations of several
last relevant versions (of each object) depending on scheduling method.
Index concurrency control - Synchronizing access operations to indexes, rather than to user data. Specialized
methods provide substantial performance gains.
Private workspace model (Deferred update) - Each transaction maintains a private workspace for its accessed
data, and its changed data become visible outside the transaction only upon its commit (e.g., Weikum and Vossen
2001). This model provides a different concurrency control behavior with benefits in many cases.
The most common mechanism type in database systems since their early days in the 1970s has been Strong strict
Two-phase locking (SS2PL; also called Rigorous scheduling or Rigorous 2PL) which is a special case (variant) of
both Two-phase locking (2PL) and Commitment ordering (CO). It is pessimistic. In spite of its long name (for
historical reasons) the idea of the SS2PL mechanism is simple: "Release all locks applied by a transaction only after
the transaction has ended." SS2PL (or Rigorousness) is also the name of the set of all schedules that can be generated
by this mechanism, i.e., these are SS2PL (or Rigorous) schedules, have the SS2PL (or Rigorousness) property.
96
Concurrency control
Recoverability
See Recoverability in Serializability
Comment: While in the general area of systems the term "recoverability" may refer to the ability of a system to
recover from failure or from an incorrect/forbidden state, within concurrency control of database systems this term
has received a specific meaning.
Concurrency control typically also ensures the Recoverability property of schedules for maintaining correctness in
cases of aborted transactions (which can always happen for many reasons). Recoverability (from abort) means that
no committed transaction in a schedule has read data written by an aborted transaction. Such data disappear from the
database (upon the abort) and are parts of an incorrect database state. Reading such data violates the consistency rule
of ACID. Unlike Serializability, Recoverability cannot be compromised, relaxed at any case, since any relaxation
results in quick database integrity violation upon aborts. The major methods listed above provide serializability
mechanisms. None of them in its general form automatically provides recoverability, and special considerations and
mechanism enhancements are needed to support recoverability. A commonly utilized special case of recoverability is
Strictness, which allows efficient database recovery from failure (but excludes optimistic implementations; e.g.,
Strict CO (SCO) cannot have an optimistic implementation, but has semi-optimistic ones).
Comment: Note that the Recoverability property is needed even if no database failure occurs and no database
recovery from failure is needed. It is rather needed to correctly automatically handle transaction aborts, which may
be unrelated to database failure and recovery from it.
Distribution
With the fast technological development of computing the difference between local and distributed computing over
low latency networks or buses is blurring. Thus the quite effective utilization of local techniques in such distributed
environments is common, e.g., in computer clusters and multi-core processors. However the local techniques have
their limitations and use multi-processes (or threads) supported by multi-processors (or multi-cores) to scale. This
often turns transactions into distributed ones, if they themselves need to span multi-processes. In these cases most
local concurrency control techniques do not scale well.
Distributed serializability and Commitment ordering
See Distributed serializability in Serializability
As database systems have become distributed, or started to cooperate in distributed environments (e.g., Federated
databases in the early 1990s, and nowadays Grid computing, Cloud computing, and networks with smartphones),
some transactions have become distributed. A distributed transaction means that the transaction spans processes, and
may span computers and geographical sites. This generates a need in effective distributed concurrency control
mechanisms. Achieving the Serializability property of a distributed system's schedule (see Distributed serializability
and Global serializability (Modular serializability)) effectively poses special challenges typically not met by most of
the regular serializability mechanisms, originally designed to operate locally. This is especially due to a need in
costly distribution of concurrency control information amid communication and computer latency. The only known
general effective technique for distribution is Commitment ordering, which was disclosed publicly in 1991 (after
being patented). Commitment ordering (Commit ordering, CO; Raz 1992) means that transactions' chronological
order of commit events is kept compatible with their respective precedence order. CO does not require the
distribution of concurrency control information and provides a general effective solution (reliable, high-performance,
and scalable) for both distributed and global serializability, also in a heterogeneous environment with database
systems (or other transactional objects) with different (any) concurrency control mechanisms.[1] CO is indifferent to
which mechanism is utilized, since it does not interfere with any transaction operation scheduling (which most
mechanisms control), and only determines the order of commit events. Thus, CO enables the efficient distribution of
all other mechanisms, and also the distribution of a mix of different (any) local mechanisms, for achieving
97
Concurrency control
distributed and global serializability. The existence of such a solution has been considered "unlikely" until 1991, and
by many experts also later, due to misunderstanding of the CO solution (see Quotations in Global serializability). An
important side-benefit of CO is automatic distributed deadlock resolution. Contrary to CO, virtually all other
techniques (when not combined with CO) are prone to distributed deadlocks (also called global deadlocks) which
need special handling. CO is also the name of the resulting schedule property: A schedule has the CO property if the
chronological order of its transactions' commit events is compatible with the respective transactions' precedence
(partial) order.
SS2PL mentioned above is a variant (special case) of CO and thus also effective to achieve distributed and global
serializability. It also provides automatic distributed deadlock resolution (a fact overlooked in the research literature
even after CO's publication), as well as Strictness and thus Recoverability. Possessing these desired properties
together with known efficient locking based implementations explains SS2PL's popularity. SS2PL has been utilized
to efficiently achieve Distributed and Global serializability since the 1980, and has become the de facto standard for
it. However, SS2PL is blocking and constraining (pessimistic), and with the proliferation of distribution and
utilization of systems different from traditional database systems (e.g., as in Cloud computing), less constraining
types of CO (e.g., Optimistic CO) may be needed for better performance.
Comments:
1. The Distributed conflict serializability property in its general form is difficult to achieve efficiently, but it is
achieved efficiently via its special case Distributed CO: Each local component (e.g., a local DBMS) needs both to
provide some form of CO, and enforce a special vote ordering strategy for the Two-phase commit protocol (2PC:
utilized to commit distributed transactions). Differently from the general Distributed CO, Distributed SS2PL
exists automatically when all local components are SS2PL based (in each component CO exists, implied, and the
vote ordering strategy is now met automatically). This fact has been known and utilized since the 1980s (i.e., that
SS2PL exists globally, without knowing about CO) for efficient Distributed SS2PL, which implies Distributed
serializability and strictness (e.g., see Raz 1992, page 293; it is also implied in Bernstein et al. 1987, page 78).
Less constrained Distributed serializability and strictness can be efficiently achieved by Distributed Strict CO
(SCO), or by a mix of SS2PL based and SCO based local components.
2. About the references and Commitment ordering: (Bernstein et al. 1987) was published before the discovery of
CO in 1990. The CO schedule property is called Dynamic atomicity in (Lynch et al. 1993, page 201). CO is
described in (Weikum and Vossen 2001, pages 102, 700), but the description is partial and misses CO's essence.
(Raz 1992) was the first refereed and accepted for publication article about CO algorithms (however, publications
about an equivalent Dynamic atomicity property can be traced to 1988). Other CO articles followed. (Bernstein
and Newcomer 2009)[1] note CO as one of the four major concurrency control methods, and CO's ability to
provide interoperability among other methods.
Distributed recoverability
Unlike Serializability, Distributed recoverability and Distributed strictness can be achieved efficiently in a
straightforward way, similarly to the way Distributed CO is achieved: In each database system they have to be
applied locally, and employ a vote ordering strategy for the Two-phase commit protocol (2PC; Raz 1992, page 307).
As has been mentioned above, Distributed SS2PL, including Distributed strictness (recoverability) and Distributed
commitment ordering (serializability), automatically employs the needed vote ordering strategy, and is achieved
(globally) when employed locally in each (local) database system (as has been known and utilized for many years; as
a matter of fact locality is defined by the boundary of a 2PC participant (Raz 1992) ).
98
Concurrency control
Other major subjects of attention
The design of concurrency control mechanisms is often influenced by the following subjects:
Recovery
All systems are prone to failures, and handling recovery from failure is a must. The properties of the generated
schedules, which are dictated by the concurrency control mechanism, may have an impact on the effectiveness and
efficiency of recovery. For example, the Strictness property (mentioned in the section Recoverability above) is often
desirable for an efficient recovery.
Replication
For high availability database objects are often replicated. Updates of replicas of a same database object need to be
kept synchronized. This may affect the way concurrency control is done (e.g., Gray et al. 1996[2]).
References
Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman (1987): Concurrency Control and Recovery in
Database Systems [3] (free PDF download), Addison Wesley Publishing Company, 1987, ISBN 0-201-10715-5
Gerhard Weikum, Gottfried Vossen (2001): Transactional Information Systems [4], Elsevier, ISBN
1-55860-508-8
Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete (1993): Atomic Transactions in Concurrent and
Distributed Systems [5], Morgan Kauffman (Elsevier), August 1993, ISBN 978-1-55860-104-8, ISBN
1-55860-104-X
Yoav Raz (1992): "The Principle of Commitment Ordering, or Guaranteeing Serializability in a Heterogeneous
Environment of Multiple Autonomous Resource Managers Using Atomic Commitment." [6] (PDF [7]),
Proceedings of the Eighteenth International Conference on Very Large Data Bases (VLDB), pp. 292-312,
Vancouver, Canada, August 1992. (also DEC-TR 841, Digital Equipment Corporation, November 1990)
Footnotes
[1] Philip A. Bernstein, Eric Newcomer (2009): Principles of Transaction Processing, 2nd Edition (http:/ / www. elsevierdirect. com/ product.
jsp?isbn=9781558606234), Morgan Kaufmann (Elsevier), June 2009, ISBN 978-1-55860-623-4 (page 145)
[2] Gray, J.; Helland, P.; ONeil, P.; Shasha, D. (1996). "The dangers of replication and a solution" (ftp:/ / ftp. research. microsoft. com/ pub/ tr/
tr-96-17. pdf). Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. pp.173182.
doi:10.1145/233269.233330. .
[3] http:/ / research. microsoft. com/ en-us/ people/ philbe/ ccontrol. aspx
[4] http:/ / www. elsevier. com/ wps/ find/ bookdescription. cws_home/ 677937/ description#description
[5] http:/ / www. elsevier. com/ wps/ find/ bookdescription. cws_home/ 680521/ description#description
[6] http:/ / www. informatik. uni-trier. de/ ~ley/ db/ conf/ vldb/ Raz92. html
[7] http:/ / www. vldb. org/ conf/ 1992/ P292. PDF
99
Concurrency control
References
Andrew S. Tanenbaum, Albert S Woodhull (2006): Operating Systems Design and Implementation, 3rd Edition,
Prentice Hall, ISBN 0-13-142938-8
Silberschatz, Avi; Galvin, Peter; Gagne, Greg (2008). Operating Systems Concepts, 8th edition. John Wiley &
Sons. ISBN0-470-12872-0.
100
101
102
103
104
License
License
Creative Commons Attribution-Share Alike 3.0 Unported
//creativecommons.org/licenses/by-sa/3.0/
105