This document provides an overview of relational database theory and normalization for developers. It defines key terms like relational databases, logical and physical data models, database schemas, and data normalization. It explains the concepts of first, second, third and Boyce-Codd normal forms and how to normalize data to these forms by removing redundant and unnecessary data through a multi-step process. The goal of normalization is to organize data to minimize duplication and ensure integrity. An example demonstrates normalizing a dog owner database from first to third normal form.
Report
Share
Report
Share
1 of 77
Download to read offline
More Related Content
Relational Theory for Budding Einsteins -- LonestarPHP 2016
2. Relational Theory for Budding Einsteins
Dave Stokes
David.Stokes@Oracle.com
@Stoker
Opensourcedba.wordpress.com
elephantdolphin.blogger.com
2
3. Session Description
Relational Databases have been around for decades but very few PHP developers have any
formal training in SQL, set theory, or databases. This session is a crash course in efficiently using
a relational database, thinking in sets (better know as avoiding the N+1 problem), how simple Venn
Diagrams can help you understand JOINing tables, how to normalize your data, smart query
design, and more. If you are a developer who wonders why your queries run poorly, want a better
understanding of query optimization, or just learn some of those 'dark arts' this tutorial is for you.
3
4. Syllabus, more or less
● efficiently using a relational database,
● thinking in sets (better know as avoiding the N+1 problem),
● how simple Venn Diagrams can help you understand JOINing tables,
● how to normalize your data,
● smart query design,
● and more
4
6. Relational Database Model
The relational model of data permits the database
designer to create a consistent, logical
representation of information. Consistency is
achieved by including declared constraints in the
database design, which is usually referred to as the
logical schema. The theory includes a process of
database normalization whereby a design with
certain desirable properties can be selected from a
set of logically equivalent alternatives. The access
plans and other implementation and operation
details are handled by the DBMS engine, and are
not reflected in the logical model.
https://en.wikipedia.org/wiki/SQL
6
7. Logical Data Models
Logical data models represent the abstract structure
of a domain of information. They are often
diagrammatic in nature and are most typically used
in business processes that seek to capture things of
importance to an organization and how they relate
to one another. Once validated and approved, the
logical data model can become the basis of a
physical data model and form the design of a
database.
https://en.wikipedia.org/wiki/Logical_data_model
7
8. Physical Data Model
A physical data model (or database design) is a
representation of a data design which takes into
account the facilities and constraints of a given
database management system. A complete physical
data model will include all the database artifacts
required to create relationships between tables or to
achieve performance goals, such as indexes,
constraint definitions, linking tables, partitioned
tables or clusters. Analysts can usually use a
physical data model to calculate storage estimates;
it may include specific storage allocation details for
a given database system.
https://en.wikipedia.org/wiki/Physical_data_model
8
9. Database Schema
A database schema (/ˈski.mə/ SKEE-mə) of a
database system is its structure described in a
formal language supported by the database
management system (DBMS). The term
"schema" refers to the organization of data as a
blueprint of how the database is constructed
(divided into database tables in the case of
relational databases).
https://en.wikipedia.org/wiki/Database_schema
9
11. Database Normalization
Database normalization (or normalisation) is the
process of organizing the columns (attributes) and
tables (relations) of a relational database to
minimize data redundancy.
Normalization involves decomposing a table into
less redundant (and smaller) tables without losing
information; defining foreign keys in the old table
referencing the primary keys of the new ones. The
objective is to isolate data so that additions,
deletions, and modifications of an attribute can be
made in just one table and then propagated through
the rest of the database using the defined foreign
keys.
https://en.wikipedia.org/wiki/Database_normalization
11
12. 1NF
First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and
only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute
contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar
Codd, defined a relation to be in first normal form when none of its domains have any sets as elements.
First normal form is an essential property of a relation in a relational database. Database normalization is the
process of representing a database in terms of relations in standard normal forms, where first normal is a minimal
requirement.
First normal form enforces these criteria:
● Eliminate repeating groups in individual tables.
● Create a separate table for each set of related data.
● Identify each set of related data with a primary key
https://en.wikipedia.org/wiki/First_normal_form
12
13. 2NF
Second normal form (2NF) is a normal form used in database normalization. A table that is in first normal form
(1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if it is in
1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non-
prime attribute of a table is an attribute that is not a part of any candidate key of the table.
Put simply, a table is in 2NF if it is in 1NF and every non-prime attribute of the table is dependent on the whole of
every candidate key.
https://en.wikipedia.org/wiki/Second_normal_form
13
14. 3NF
Third normal form is a normal form that is used in normalizing a database design to reduce the duplication of
data and ensure referential integrity by ensuring that (1) the entity is in second normal form, and (2) all the
attributes in a table are determined only by the candidate keys of that table and not by any non-prime attributes.
3NF was designed to improve database processing while minimizing storage costs. 3NF data modeling was ideal
for online transaction processing (OLTP) applications with heavy order entry type of needs.
https://en.wikipedia.org/wiki/Third_normal_form
14
15. 4NF
Fourth normal form (4NF) is a normal form used in database normalization.
Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization
after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–
Codd normal forms are concerned with functional dependencies, 4NF is
concerned with a more general type of dependency known as a multivalued
dependency. A Table is in 4NF if and only if, for every one of its non-trivial
multivalued dependencies X ->> Y, X is a superkey—that is, X is either a
candidate key or a superset thereof.
https://en.wikipedia.org/wiki/Fourth_normal_form
15
16. 5NF
Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization
designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically
related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is
implied by the candidate keys.
https://en.wikipedia.org/wiki/Fifth_normal_form
16
17. 6NF
Sixth normal form is intended to decompose relation variables to irreducible components. Though this may be
relatively unimportant for non-temporal relation variables, it can be important when dealing with temporal variables
or other interval data. For instance, if a relation comprises a supplier's name, status, and city, we may also want to
add temporal data, such as the time during which these values are, or were, valid (e.g., for historical data) but the
three values may vary independently of each other and at different rates. We may, for instance, wish to trace the
history of changes to Status.
https://en.wikipedia.org/wiki/Sixth_normal_form
17
18. Boyce-Codd Normal Form
Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly
stronger version of the third normal form (3NF).
If a relational schema is in BCNF then all redundancy based on functional dependency has been removed,
although other types of redundancy may still exist. A relational schema R is in Boyce–Codd normal form if and
only if for every one of its dependencies X → Y, at least one of the following conditions hold:
● X → Y is a trivial functional dependency (Y ⊆ X)
● X is a super key for schema R
https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form
18
19. The Good News
You very rarely need to go past third normal form or BCNF.
19
20. Okay, how do we get data into 3NF/BCNF??
Have to start with a look at your data!
20
21. What to do!
1NF Remove Repeating Groups - Make a separate table for each set of related attributes, and give each table a
primary key.
2NF Remove Redundant Data - If an attribute depends on only part of a multi-valued key, remove it to a separate
table.
3NF Remove Columns Not Dependant on a Key - If attributes do not contribute to a description of the key, remove
them to a separate table.
BCNF - If there are non-trivial dependencies between candidate key attributes, separate them out into distinct
tables.
21
22. Example - 1NF
Dogs & Owners
Owner Age (years) Type Name
Mark 16 White Lab Roman
Dave 8 Beagle Jack
Carrie .5 Black Mouth Cur Boo
Dave 6 Beagle Lucy
22
23. Example 1 - 2NF
Owners and Dogs
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Owner Type
1 Roman 16 1 White
Lab
2 Jack 8 3 Beagle
3 Boo .5 2 Black
Mouth
Cur
4 Lucy 6 3 Beagle
Each of the owner records
and each of the dog
records have one entry
23
24. Example 1 - 3NF
Owners and Dogs + Type
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Owner Type
1 Roma
n
16 1 1
2 Jack 8 3 2
3 Boo .5 2 3
4 Lucy 6 3 2
More redundant info
removed
Type
1 White Lab
2 Beagle
3 Black Mouth Cur
24
25. Example 1 - 4NF - More than one owner per dog
Owners and Dogs + Type
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
Id Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
Owner_ID Pet_ID
1 1
3 2
2 3
3 4
2 3
25
26. Exercise and Break
● Break into a small group
● Pick one of the following two by recording
○ Persons in group and their favorite pizza (Crust, toppings, etc)
○ Persons in group and details on their car/truck
● Normalize Data as much as possible
Back in twenty minutes
26
28. Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20; UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
28
29. Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20;
START TRANSACTION;
UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
COMMIT;
29
30. Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20;
START TRANSACTION;
UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
COMMIT;
Questions:
Which one is crash proof?
Which takes more clock time?
Which one is boss proof = ‘Hey,
before you start that process what
is the total effect on payroll?’ Or
‘Could you make it 18.77%
percent?’
30
31. Databases are good at heavy lifting
Make big or little changes to data as one transaction
Statistical functions usually excellent
Sorting / grouping (usually database servers have more memory, faster disk
than application server)
Learn to use Pentaho or BIRT to write ad hoc reports (can be set to run on a
schedule) **
31
33. Data Set
A data set (or dataset) is a collection of data.
Most commonly a data set corresponds to the
contents of a single database table, or a single
statistical data matrix, where every column of the
table represents a particular variable, and each row
corresponds to a given member of the data set in
question. The data set lists values for each of the
variables, such as height and weight of an object,
for each member of the data set. Each value is
known as a datum. The data set may comprise data
for one or more members, corresponding to the
number of rows.
https://en.wikipedia.org/wiki/Data_set
33
35. 35
The Classic N+1 Problem
SELECT parent_record
For each child
SELECT child_record
…..
36. Back to Example 1
Remember the 4NF tables of dogs and
owners?
We have information on the dogs, the owners,
types of dogs, and dog/owner relationships.
● Each stands individually
● Can be tied to each other easily
JOINing is the process of combining rows
from two or more tables using common fields
36
37. DOGS with Types
37
ID Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
ID Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
38. DOGS with Types
38
ID Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
ID Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
When matching dogs to their type, we need to use
common columns between the tables.
46. What Happens When You Send a Query to Server
1. Check connection permissions
2. Is SYNTAX of query correct
3. Does user have permissions:
a. Database
b. Table
c. Column
4. Develop Query Plan
5. Return Data
46
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
47. Query Plan
A query plan (or query execution plan) is an ordered set of steps used to access data in a SQL relational
database management system. This is a specific case of the relational model concept of access plans.
Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with
widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of
the different, correct possible plans for executing the query and returns what it considers the best option. Because
query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune
the plans produced by the optimizer to get better performance.
https://en.wikipedia.org/wiki/Query_plan
47
48. Rough Analog
Designated Driver: You have to pick up three friends for an evening out and
get them home.
What info do you need?
48
49. Query Optimizer is like a GPS
It has statistical information on how to get some place but may not know
about traffic jams (locks), accidents (deletions, etc.). It also know how long it
should take to get something but that may change with load or over time.
And the optimizer occasionally will make bad choices and send you to a dead
end.
49
50. Remember Example JOIN?
Optimizer realizes it had to get four columns
from two tables. The 4! (24) possible
combinations.
50
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
51. Cost Based Query Optimization
Cost-based query optimizers evaluate the resource footprint of various query plans and use this as the basis for
plan selection. These assign an estimated "cost" to each possible query plan, and choose the plan with the
smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O
operations required, CPU path length, amount of disk buffer space, disk storage service time, and interconnect
usage between units of parallelism, and other factors determined from the data dictionary. The set of query plans
examined is formed by examining the possible access paths (e.g., primary index access, secondary index access,
full file scan) and various relational table join techniques (e.g., merge join, hash join,product join). The search
space can become quite large depending on the complexity of the SQL query. There are two types of optimization.
These consist of logical optimization—which generates a sequence of relational algebra to solve the query—and
physical optimization—which is used to determine the means of carrying out each operation.
https://en.wikipedia.org/wiki/Query_optimization
51
52. What is COST?
Cost until recently was the cost of doing disk
INPUT-OUTPUT. Note data on disk is 100,000
times SLOWER to read than from memory.
100,000 seconds is 27.77778 hours
100,000 milliseconds is 1.6666667 minutes
52
This is starting to change as vendors try to
accommodate newer technologies, storage
units of different speeds, etcetera -- and you
will be able in the future to assign costs toa
device.
53. Full Table Scan
Full table scan (also known as sequential scan) is a scan made on a database where each row of the table
under scan is read in a sequential (serial) order and the columns encountered are checked for the validity of a
condition. Full table scans are usually the slowest method of scanning a table due to the heavy amount of I/O
reads required from the disk which consists of multiple seeks as well as costly disk to memory transfers.
Sequential scan takes place usually when the column or group of columns of a table (the table may be on disk or
may be an intermediate table created by the join of two or more tables) needed for the scan do not contain an
index which can be used for the purpose.
https://en.wikipedia.org/wiki/Full_table_scan
53
54. Full Table Scan Analog
Pretend you are given a dictionary to look up the plural of the word ‘moose’.
But the dictionary is not alphabetized (random order) and you need to also
beware of homonyms, synonyms, and homographs.
54
55. Index
An index is an indirect shortcut derived from and pointing into a greater volume of values, data, information or
knowledge.
https://en.wikipedia.org/wiki/Index
In other words, we get the exact record(s) needed without having to search all the others
55
56. Primary Key
The primary key for a table represents the column or set of columns that you use in your most vital queries. It
has an associated index, for fast query performance. Query performance benefits from the NOT NULL
optimization, because it cannot include any NULL values.
http://dev.mysql.com/doc/refman/5.7/en/optimizing-primary-keys.html
56
57. NULL
Null is a special marker used in Structured Query Language (SQL) to indicate that a data value does not exist in
the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the
requirement that all true relational database management systems (RDBMS) support a representation of "missing
information and inapplicable information".
https://en.wikipedia.org/wiki/Null_(SQL)
57
59. Multi-Columns Indexes
MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 16
columns. For certain data types, you can index a prefix of the column (see Section 8.3.4, “Column Indexes”).
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test
just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the
right order in the index definition, a single composite index can speed up several kinds of queries on the same
table.
http://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html
A Y-M-D index works for Y-M-D, Y-M, or Y but not D, M-D.
Sometimes you can get information from index without dive into data. State-
Zip-CityName can get you all the cities within State/Zip
59
60. Optimize Data Structure
● For unique IDs or other values that can be represented as either strings or numbers, prefer
numeric columns to string columns. Since large numeric values can be stored in fewer bytes
than the corresponding strings, it is faster and takes less memory to transfer and compare
them.
● When comparing values from different columns, declare those columns with the same
character set and collation wherever possible, to avoid string conversions while running the
query.
● If a table contains string columns such as name and address, but many queries do not retrieve
those columns, consider splitting the string columns into a separate table and using join
queries with a foreign key when necessary.
60
61. DDL & DML
● Data Description Language
○ Describes the structure of the data
● Data Manipulation Language
○ A family of syntax elements used for selecting, inserting, deleting and updating data in a database.
61
62. DDL
CREATE TABLE lonestar ( id INTEGER NOT NULL.
name CHAR(10),
PRIMARY KEY (ID));
Be very careful for Reserved Words when naming
columns and check on case sensitivity too!
62
63. DML
SELECT id, name FROM lonestar;
SELECT lonestar.id, lonestar.name FROM lonestar;
SELECT id AS ‘Identification Nbr’, name AS ‘Full Name’ FROM lonestar;
SELECT name FROM lonestar WHERE name = ‘Smith’;
SELECT name FROM lonestar WHERE id IN (10,20,30,40,50) OR name < ‘Jones’;
UPDATE lonestate SET name = ‘None’ WHERE id = 10 AND name=’Jones’;
DELTE FROM lonestar WHERE ID > 99;
63
64. EXPLAIN
The EXPLAIN statement can be used to obtain information about how MySQL executes a statement:
Permitted explainable statements for EXPLAIN are SELECT, DELETE, INSERT, REPLACE, and UPDATE.
When EXPLAIN is used with an explainable statement, MySQL displays information from the optimizer about
the statement execution plan. That is, MySQL explains how it would process the statement, including information
about how tables are joined and in which order.
When EXPLAIN is used with FOR CONNECTION connection_id rather than an explainable statement, it displays
the execution plan for the statement executing in the named connection
64
65. Explain -- Tabular Classic View
65
EXPLAIN select name from city where CountryCode = ‘ USA’
66. Explain -- Tabular Classic View
66
EXPLAIN select name from city where CountryCode = ‘ USA’
The query is using an key (index) named
CountryCode.
The optimizer estimates it will take 274 rows
read to make requested query.
67. Explain -- Tabular Classic View
67
EXPLAIN select name from city where CountryCode = ‘ USA’
‘Ref’ - reference -- join type
68. EXPLAIN JOIN Types from best to worst
68
System The table has only one row (= system table). This is a special case of the const join type.
Const The table has at most one matching row, which is read at the start of the query. Because there is only one
row, values from the column in this row can be regarded as constants by the rest of the optimizer. const tables
are very fast because they are read only once.
Eq_ref One row is read from this table for each combination of rows from the previous tables. It is used when
all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE NOT NULL index.
Ref All rows with matching index values are read from this table for each combination of rows from the
previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or
UNIQUE index
Fulltext The join is performed using a FULLTEXT index.
Ref_or_null This join type is like ref, but with the addition that MySQL does an extra search for rows that
contain NULL values. This join type optimization is used most often in resolving subqueries.
69. EXPLAIN JOIN Types from best to worst
69
Index_merge This join type indicates that the Index Merge optimization is used. In this case, the key column in
the output row contains a list of indexes used, and key_len contains a list of the longest key parts for the
indexes used. WHERE X = 20 or X = 30 AND Z = 5
Unique_subquery This type replaces eq_ref for some IN subqueries of the following form:
● value IN (SELECT primary_key FROM single_table WHERE some_expr)
unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency.
Index_subquery This join type is similar to unique_subquery. It replaces IN subqueries, but it works for
nonunique indexes in subqueries
70. EXPLAIN JOIN Types from best to worst
70
Range Only rows that are in a given range are retrieved, using an index to select the rows. The key column in
the output row indicates which index is used. The key_len contains the longest key part that was used. The ref
column is NULL for this type.
Index The index join type is the same as ALL, except that the index tree is scanned. This occurs two ways:
○ If the index is a covering index for the queries and can be used to satisfy all data required
from the table, only the index tree is scanned. In this case, the Extra column says Using
index. An index-only scan usually is faster than ALL because the size of the index usually is
smaller than the table data.
○ A full table scan is performed using reads from the index to look up data rows in index
order. Uses index does not appear in the Extra column.
MySQL can use this join type when the query uses only columns that are part of a single index.
ALL A full table scan is done for each combination of rows from the previous tables. MAY NOT BE BAD!**
71. Explain -- Tabular Classic View 2
71
EXPLAIN select City.Name, Country.name FROM City join
Country on (City.CountryCode = Country.Code) where
CountryCode = 'USA' AND City.Name LIKE 'New%'
The Extra column of EXPLAIN output contains additional information about how MySQL
resolves the query. The following list explains the values that can appear in this column.
If you want to make your queries as fast as possible, look out for Extra values of Using
filesort and Using temporary.
72. Explain -- Tabular Classic View
72
EXPLAIN select name from city where CountryCode = ‘ USA’
Mainly look at the rows column
75. Start to Wrap Up
75
Books:
SQL Antipatterns --- Bill Karwin
Effective MySQL Optimizing SQL Statements -- Ronald Bradford
SQL and Relational Theory -- CJ Date
76. Wrap Up
Work on most frequently run queries
Re-check as data grows
De-normalize at your own risk -- will eventually bite you in rear
Let database do the heavy lifting
Think in sets
76