Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Relational Theory for
Budding Einsteins
How To Write Database
Queries that Perform
Relational Theory for Budding Einsteins
Dave Stokes
David.Stokes@Oracle.com
@Stoker
Opensourcedba.wordpress.com
elephantdolphin.blogger.com
2
Session Description
Relational Databases have been around for decades but very few PHP developers have any
formal training in SQL, set theory, or databases. This session is a crash course in efficiently using
a relational database, thinking in sets (better know as avoiding the N+1 problem), how simple Venn
Diagrams can help you understand JOINing tables, how to normalize your data, smart query
design, and more. If you are a developer who wonders why your queries run poorly, want a better
understanding of query optimization, or just learn some of those 'dark arts' this tutorial is for you.
3
Syllabus, more or less
● efficiently using a relational database,
● thinking in sets (better know as avoiding the N+1 problem),
● how simple Venn Diagrams can help you understand JOINing tables,
● how to normalize your data,
● smart query design,
● and more
4
So, some definitions
To give us common vocabulary
5
Relational Database Model
The relational model of data permits the database
designer to create a consistent, logical
representation of information. Consistency is
achieved by including declared constraints in the
database design, which is usually referred to as the
logical schema. The theory includes a process of
database normalization whereby a design with
certain desirable properties can be selected from a
set of logically equivalent alternatives. The access
plans and other implementation and operation
details are handled by the DBMS engine, and are
not reflected in the logical model.
https://en.wikipedia.org/wiki/SQL
6
Logical Data Models
Logical data models represent the abstract structure
of a domain of information. They are often
diagrammatic in nature and are most typically used
in business processes that seek to capture things of
importance to an organization and how they relate
to one another. Once validated and approved, the
logical data model can become the basis of a
physical data model and form the design of a
database.
https://en.wikipedia.org/wiki/Logical_data_model
7
Physical Data Model
A physical data model (or database design) is a
representation of a data design which takes into
account the facilities and constraints of a given
database management system. A complete physical
data model will include all the database artifacts
required to create relationships between tables or to
achieve performance goals, such as indexes,
constraint definitions, linking tables, partitioned
tables or clusters. Analysts can usually use a
physical data model to calculate storage estimates;
it may include specific storage allocation details for
a given database system.
https://en.wikipedia.org/wiki/Physical_data_model
8
Database Schema
A database schema (/ˈski.mə/ SKEE-mə) of a
database system is its structure described in a
formal language supported by the database
management system (DBMS). The term
"schema" refers to the organization of data as a
blueprint of how the database is constructed
(divided into database tables in the case of
relational databases).
https://en.wikipedia.org/wiki/Database_schema
9
Data Normalization
10
Database Normalization
Database normalization (or normalisation) is the
process of organizing the columns (attributes) and
tables (relations) of a relational database to
minimize data redundancy.
Normalization involves decomposing a table into
less redundant (and smaller) tables without losing
information; defining foreign keys in the old table
referencing the primary keys of the new ones. The
objective is to isolate data so that additions,
deletions, and modifications of an attribute can be
made in just one table and then propagated through
the rest of the database using the defined foreign
keys.
https://en.wikipedia.org/wiki/Database_normalization
11
1NF
First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and
only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute
contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar
Codd, defined a relation to be in first normal form when none of its domains have any sets as elements.
First normal form is an essential property of a relation in a relational database. Database normalization is the
process of representing a database in terms of relations in standard normal forms, where first normal is a minimal
requirement.
First normal form enforces these criteria:
● Eliminate repeating groups in individual tables.
● Create a separate table for each set of related data.
● Identify each set of related data with a primary key
https://en.wikipedia.org/wiki/First_normal_form
12
2NF
Second normal form (2NF) is a normal form used in database normalization. A table that is in first normal form
(1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if it is in
1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non-
prime attribute of a table is an attribute that is not a part of any candidate key of the table.
Put simply, a table is in 2NF if it is in 1NF and every non-prime attribute of the table is dependent on the whole of
every candidate key.
https://en.wikipedia.org/wiki/Second_normal_form
13
3NF
Third normal form is a normal form that is used in normalizing a database design to reduce the duplication of
data and ensure referential integrity by ensuring that (1) the entity is in second normal form, and (2) all the
attributes in a table are determined only by the candidate keys of that table and not by any non-prime attributes.
3NF was designed to improve database processing while minimizing storage costs. 3NF data modeling was ideal
for online transaction processing (OLTP) applications with heavy order entry type of needs.
https://en.wikipedia.org/wiki/Third_normal_form
14
4NF
Fourth normal form (4NF) is a normal form used in database normalization.
Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization
after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce–
Codd normal forms are concerned with functional dependencies, 4NF is
concerned with a more general type of dependency known as a multivalued
dependency. A Table is in 4NF if and only if, for every one of its non-trivial
multivalued dependencies X ->> Y, X is a superkey—that is, X is either a
candidate key or a superset thereof.
https://en.wikipedia.org/wiki/Fourth_normal_form
15
5NF
Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization
designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically
related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is
implied by the candidate keys.
https://en.wikipedia.org/wiki/Fifth_normal_form
16
6NF
Sixth normal form is intended to decompose relation variables to irreducible components. Though this may be
relatively unimportant for non-temporal relation variables, it can be important when dealing with temporal variables
or other interval data. For instance, if a relation comprises a supplier's name, status, and city, we may also want to
add temporal data, such as the time during which these values are, or were, valid (e.g., for historical data) but the
three values may vary independently of each other and at different rates. We may, for instance, wish to trace the
history of changes to Status.
https://en.wikipedia.org/wiki/Sixth_normal_form
17
Boyce-Codd Normal Form
Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly
stronger version of the third normal form (3NF).
If a relational schema is in BCNF then all redundancy based on functional dependency has been removed,
although other types of redundancy may still exist. A relational schema R is in Boyce–Codd normal form if and
only if for every one of its dependencies X → Y, at least one of the following conditions hold:
● X → Y is a trivial functional dependency (Y ⊆ X)
● X is a super key for schema R
https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form
18
The Good News
You very rarely need to go past third normal form or BCNF.
19
Okay, how do we get data into 3NF/BCNF??
Have to start with a look at your data!
20
What to do!
1NF Remove Repeating Groups - Make a separate table for each set of related attributes, and give each table a
primary key.
2NF Remove Redundant Data - If an attribute depends on only part of a multi-valued key, remove it to a separate
table.
3NF Remove Columns Not Dependant on a Key - If attributes do not contribute to a description of the key, remove
them to a separate table.
BCNF - If there are non-trivial dependencies between candidate key attributes, separate them out into distinct
tables.
21
Example - 1NF
Dogs & Owners
Owner Age (years) Type Name
Mark 16 White Lab Roman
Dave 8 Beagle Jack
Carrie .5 Black Mouth Cur Boo
Dave 6 Beagle Lucy
22
Example 1 - 2NF
Owners and Dogs
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Owner Type
1 Roman 16 1 White
Lab
2 Jack 8 3 Beagle
3 Boo .5 2 Black
Mouth
Cur
4 Lucy 6 3 Beagle
Each of the owner records
and each of the dog
records have one entry
23
Example 1 - 3NF
Owners and Dogs + Type
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Owner Type
1 Roma
n
16 1 1
2 Jack 8 3 2
3 Boo .5 2 3
4 Lucy 6 3 2
More redundant info
removed
Type
1 White Lab
2 Beagle
3 Black Mouth Cur
24
Example 1 - 4NF - More than one owner per dog
Owners and Dogs + Type
Id Name
1 Mark
2 Carrie
3 Dave
Id Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
Id Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
Owner_ID Pet_ID
1 1
3 2
2 3
3 4
2 3
25
Exercise and Break
● Break into a small group
● Pick one of the following two by recording
○ Persons in group and their favorite pizza (Crust, toppings, etc)
○ Persons in group and details on their car/truck
● Normalize Data as much as possible
Back in twenty minutes
26
Efficient Use of a Database
27
Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20; UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
28
Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20;
START TRANSACTION;
UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
COMMIT;
29
Two Examples -- Give sales department 20% raise
foreach (sale_emp in sales_employees)
$pay = $pay * .20;
START TRANSACTION;
UPDATE employees
SET pay_rate = pay_rate * .20
WHERE department = ‘sales’;
COMMIT;
Questions:
Which one is crash proof?
Which takes more clock time?
Which one is boss proof = ‘Hey,
before you start that process what
is the total effect on payroll?’ Or
‘Could you make it 18.77%
percent?’
30
Databases are good at heavy lifting
Make big or little changes to data as one transaction
Statistical functions usually excellent
Sorting / grouping (usually database servers have more memory, faster disk
than application server)
Learn to use Pentaho or BIRT to write ad hoc reports (can be set to run on a
schedule) **
31
Thinking in Sets
32
Data Set
A data set (or dataset) is a collection of data.
Most commonly a data set corresponds to the
contents of a single database table, or a single
statistical data matrix, where every column of the
table represents a particular variable, and each row
corresponds to a given member of the data set in
question. The data set lists values for each of the
variables, such as height and weight of an object,
for each member of the data set. Each value is
known as a datum. The data set may comprise data
for one or more members, corresponding to the
number of rows.
https://en.wikipedia.org/wiki/Data_set
33
34
35
The Classic N+1 Problem
SELECT parent_record
For each child
SELECT child_record
…..
Back to Example 1
Remember the 4NF tables of dogs and
owners?
We have information on the dogs, the owners,
types of dogs, and dog/owner relationships.
● Each stands individually
● Can be tied to each other easily
JOINing is the process of combining rows
from two or more tables using common fields
36
DOGS with Types
37
ID Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
ID Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
DOGS with Types
38
ID Type
1 White Lab
2 Beagle
3 Black Mouth
Cur
ID Name Age Type
1 Roman 16 1
2 Jack 8 2
3 Boo .5 3
4 Lucy 6 2
When matching dogs to their type, we need to use
common columns between the tables.
JOIN
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
39
JOIN
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
40
What we want
JOIN
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
41
Where we will get it from
JOIN
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
42
How they link
together
SQL Join Venn Digrams
Our JOIN
43
Dog Dog
Type
Go to Google, search SQL Venn Diagrams, & GET!
44
SQL Optimizer
45
What Happens When You Send a Query to Server
1. Check connection permissions
2. Is SYNTAX of query correct
3. Does user have permissions:
a. Database
b. Table
c. Column
4. Develop Query Plan
5. Return Data
46
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
Query Plan
A query plan (or query execution plan) is an ordered set of steps used to access data in a SQL relational
database management system. This is a specific case of the relational model concept of access plans.
Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with
widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of
the different, correct possible plans for executing the query and returns what it considers the best option. Because
query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune
the plans produced by the optimizer to get better performance.
https://en.wikipedia.org/wiki/Query_plan
47
Rough Analog
Designated Driver: You have to pick up three friends for an evening out and
get them home.
What info do you need?
48
Query Optimizer is like a GPS
It has statistical information on how to get some place but may not know
about traffic jams (locks), accidents (deletions, etc.). It also know how long it
should take to get something but that may change with load or over time.
And the optimizer occasionally will make bad choices and send you to a dead
end.
49
Remember Example JOIN?
Optimizer realizes it had to get four columns
from two tables. The 4! (24) possible
combinations.
50
SELECT dog.name, type.type
FROM dog
JOIN type
ON dog.type = type.id
Cost Based Query Optimization
Cost-based query optimizers evaluate the resource footprint of various query plans and use this as the basis for
plan selection. These assign an estimated "cost" to each possible query plan, and choose the plan with the
smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O
operations required, CPU path length, amount of disk buffer space, disk storage service time, and interconnect
usage between units of parallelism, and other factors determined from the data dictionary. The set of query plans
examined is formed by examining the possible access paths (e.g., primary index access, secondary index access,
full file scan) and various relational table join techniques (e.g., merge join, hash join,product join). The search
space can become quite large depending on the complexity of the SQL query. There are two types of optimization.
These consist of logical optimization—which generates a sequence of relational algebra to solve the query—and
physical optimization—which is used to determine the means of carrying out each operation.
https://en.wikipedia.org/wiki/Query_optimization
51
What is COST?
Cost until recently was the cost of doing disk
INPUT-OUTPUT. Note data on disk is 100,000
times SLOWER to read than from memory.
100,000 seconds is 27.77778 hours
100,000 milliseconds is 1.6666667 minutes
52
This is starting to change as vendors try to
accommodate newer technologies, storage
units of different speeds, etcetera -- and you
will be able in the future to assign costs toa
device.
Full Table Scan
Full table scan (also known as sequential scan) is a scan made on a database where each row of the table
under scan is read in a sequential (serial) order and the columns encountered are checked for the validity of a
condition. Full table scans are usually the slowest method of scanning a table due to the heavy amount of I/O
reads required from the disk which consists of multiple seeks as well as costly disk to memory transfers.
Sequential scan takes place usually when the column or group of columns of a table (the table may be on disk or
may be an intermediate table created by the join of two or more tables) needed for the scan do not contain an
index which can be used for the purpose.
https://en.wikipedia.org/wiki/Full_table_scan
53
Full Table Scan Analog
Pretend you are given a dictionary to look up the plural of the word ‘moose’.
But the dictionary is not alphabetized (random order) and you need to also
beware of homonyms, synonyms, and homographs.
54
Index
An index is an indirect shortcut derived from and pointing into a greater volume of values, data, information or
knowledge.
https://en.wikipedia.org/wiki/Index
In other words, we get the exact record(s) needed without having to search all the others
55
Primary Key
The primary key for a table represents the column or set of columns that you use in your most vital queries. It
has an associated index, for fast query performance. Query performance benefits from the NOT NULL
optimization, because it cannot include any NULL values.
http://dev.mysql.com/doc/refman/5.7/en/optimizing-primary-keys.html
56
NULL
Null is a special marker used in Structured Query Language (SQL) to indicate that a data value does not exist in
the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the
requirement that all true relational database management systems (RDBMS) support a representation of "missing
information and inapplicable information".
https://en.wikipedia.org/wiki/Null_(SQL)
57
UNIQUE KEY
There can be just one!!
58
Multi-Columns Indexes
MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 16
columns. For certain data types, you can index a prefix of the column (see Section 8.3.4, “Column Indexes”).
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test
just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the
right order in the index definition, a single composite index can speed up several kinds of queries on the same
table.
http://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html
A Y-M-D index works for Y-M-D, Y-M, or Y but not D, M-D.
Sometimes you can get information from index without dive into data. State-
Zip-CityName can get you all the cities within State/Zip
59
Optimize Data Structure
● For unique IDs or other values that can be represented as either strings or numbers, prefer
numeric columns to string columns. Since large numeric values can be stored in fewer bytes
than the corresponding strings, it is faster and takes less memory to transfer and compare
them.
● When comparing values from different columns, declare those columns with the same
character set and collation wherever possible, to avoid string conversions while running the
query.
● If a table contains string columns such as name and address, but many queries do not retrieve
those columns, consider splitting the string columns into a separate table and using join
queries with a foreign key when necessary.
60
DDL & DML
● Data Description Language
○ Describes the structure of the data
● Data Manipulation Language
○ A family of syntax elements used for selecting, inserting, deleting and updating data in a database.
61
DDL
CREATE TABLE lonestar ( id INTEGER NOT NULL.
name CHAR(10),
PRIMARY KEY (ID));
Be very careful for Reserved Words when naming
columns and check on case sensitivity too!
62
DML
SELECT id, name FROM lonestar;
SELECT lonestar.id, lonestar.name FROM lonestar;
SELECT id AS ‘Identification Nbr’, name AS ‘Full Name’ FROM lonestar;
SELECT name FROM lonestar WHERE name = ‘Smith’;
SELECT name FROM lonestar WHERE id IN (10,20,30,40,50) OR name < ‘Jones’;
UPDATE lonestate SET name = ‘None’ WHERE id = 10 AND name=’Jones’;
DELTE FROM lonestar WHERE ID > 99;
63
EXPLAIN
The EXPLAIN statement can be used to obtain information about how MySQL executes a statement:
Permitted explainable statements for EXPLAIN are SELECT, DELETE, INSERT, REPLACE, and UPDATE.
When EXPLAIN is used with an explainable statement, MySQL displays information from the optimizer about
the statement execution plan. That is, MySQL explains how it would process the statement, including information
about how tables are joined and in which order.
When EXPLAIN is used with FOR CONNECTION connection_id rather than an explainable statement, it displays
the execution plan for the statement executing in the named connection
64
Explain -- Tabular Classic View
65
EXPLAIN select name from city where CountryCode = ‘ USA’
Explain -- Tabular Classic View
66
EXPLAIN select name from city where CountryCode = ‘ USA’
The query is using an key (index) named
CountryCode.
The optimizer estimates it will take 274 rows
read to make requested query.
Explain -- Tabular Classic View
67
EXPLAIN select name from city where CountryCode = ‘ USA’
‘Ref’ - reference -- join type
EXPLAIN JOIN Types from best to worst
68
System The table has only one row (= system table). This is a special case of the const join type.
Const The table has at most one matching row, which is read at the start of the query. Because there is only one
row, values from the column in this row can be regarded as constants by the rest of the optimizer. const tables
are very fast because they are read only once.
Eq_ref One row is read from this table for each combination of rows from the previous tables. It is used when
all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE NOT NULL index.
Ref All rows with matching index values are read from this table for each combination of rows from the
previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or
UNIQUE index
Fulltext The join is performed using a FULLTEXT index.
Ref_or_null This join type is like ref, but with the addition that MySQL does an extra search for rows that
contain NULL values. This join type optimization is used most often in resolving subqueries.
EXPLAIN JOIN Types from best to worst
69
Index_merge This join type indicates that the Index Merge optimization is used. In this case, the key column in
the output row contains a list of indexes used, and key_len contains a list of the longest key parts for the
indexes used. WHERE X = 20 or X = 30 AND Z = 5
Unique_subquery This type replaces eq_ref for some IN subqueries of the following form:
● value IN (SELECT primary_key FROM single_table WHERE some_expr)
unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency.
Index_subquery This join type is similar to unique_subquery. It replaces IN subqueries, but it works for
nonunique indexes in subqueries
EXPLAIN JOIN Types from best to worst
70
Range Only rows that are in a given range are retrieved, using an index to select the rows. The key column in
the output row indicates which index is used. The key_len contains the longest key part that was used. The ref
column is NULL for this type.
Index The index join type is the same as ALL, except that the index tree is scanned. This occurs two ways:
○ If the index is a covering index for the queries and can be used to satisfy all data required
from the table, only the index tree is scanned. In this case, the Extra column says Using
index. An index-only scan usually is faster than ALL because the size of the index usually is
smaller than the table data.
○ A full table scan is performed using reads from the index to look up data rows in index
order. Uses index does not appear in the Extra column.
MySQL can use this join type when the query uses only columns that are part of a single index.
ALL A full table scan is done for each combination of rows from the previous tables. MAY NOT BE BAD!**
Explain -- Tabular Classic View 2
71
EXPLAIN select City.Name, Country.name FROM City join
Country on (City.CountryCode = Country.Code) where
CountryCode = 'USA' AND City.Name LIKE 'New%'
The Extra column of EXPLAIN output contains additional information about how MySQL
resolves the query. The following list explains the values that can appear in this column.
If you want to make your queries as fast as possible, look out for Extra values of Using
filesort and Using temporary.
Explain -- Tabular Classic View
72
EXPLAIN select name from city where CountryCode = ‘ USA’
Mainly look at the rows column
Visual Explain -- MySQL Workbench
73
Optimizer trace
74
USe MySQL Workbench -- too far in the weeds for 99% of developers
Start to Wrap Up
75
Books:
SQL Antipatterns --- Bill Karwin
Effective MySQL Optimizing SQL Statements -- Ronald Bradford
SQL and Relational Theory -- CJ Date
Wrap Up
Work on most frequently run queries
Re-check as data grows
De-normalize at your own risk -- will eventually bite you in rear
Let database do the heavy lifting
Think in sets
76
Q/A
david.stokes@oracle.com
@stoker
Opensourcedba.wordpress.com & http://elephantdolphin.blogspot.com/
77

More Related Content

Relational Theory for Budding Einsteins -- LonestarPHP 2016

  • 1. Relational Theory for Budding Einsteins How To Write Database Queries that Perform
  • 2. Relational Theory for Budding Einsteins Dave Stokes David.Stokes@Oracle.com @Stoker Opensourcedba.wordpress.com elephantdolphin.blogger.com 2
  • 3. Session Description Relational Databases have been around for decades but very few PHP developers have any formal training in SQL, set theory, or databases. This session is a crash course in efficiently using a relational database, thinking in sets (better know as avoiding the N+1 problem), how simple Venn Diagrams can help you understand JOINing tables, how to normalize your data, smart query design, and more. If you are a developer who wonders why your queries run poorly, want a better understanding of query optimization, or just learn some of those 'dark arts' this tutorial is for you. 3
  • 4. Syllabus, more or less ● efficiently using a relational database, ● thinking in sets (better know as avoiding the N+1 problem), ● how simple Venn Diagrams can help you understand JOINing tables, ● how to normalize your data, ● smart query design, ● and more 4
  • 5. So, some definitions To give us common vocabulary 5
  • 6. Relational Database Model The relational model of data permits the database designer to create a consistent, logical representation of information. Consistency is achieved by including declared constraints in the database design, which is usually referred to as the logical schema. The theory includes a process of database normalization whereby a design with certain desirable properties can be selected from a set of logically equivalent alternatives. The access plans and other implementation and operation details are handled by the DBMS engine, and are not reflected in the logical model. https://en.wikipedia.org/wiki/SQL 6
  • 7. Logical Data Models Logical data models represent the abstract structure of a domain of information. They are often diagrammatic in nature and are most typically used in business processes that seek to capture things of importance to an organization and how they relate to one another. Once validated and approved, the logical data model can become the basis of a physical data model and form the design of a database. https://en.wikipedia.org/wiki/Logical_data_model 7
  • 8. Physical Data Model A physical data model (or database design) is a representation of a data design which takes into account the facilities and constraints of a given database management system. A complete physical data model will include all the database artifacts required to create relationships between tables or to achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters. Analysts can usually use a physical data model to calculate storage estimates; it may include specific storage allocation details for a given database system. https://en.wikipedia.org/wiki/Physical_data_model 8
  • 9. Database Schema A database schema (/ˈski.mə/ SKEE-mə) of a database system is its structure described in a formal language supported by the database management system (DBMS). The term "schema" refers to the organization of data as a blueprint of how the database is constructed (divided into database tables in the case of relational databases). https://en.wikipedia.org/wiki/Database_schema 9
  • 11. Database Normalization Database normalization (or normalisation) is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy. Normalization involves decomposing a table into less redundant (and smaller) tables without losing information; defining foreign keys in the old table referencing the primary keys of the new ones. The objective is to isolate data so that additions, deletions, and modifications of an attribute can be made in just one table and then propagated through the rest of the database using the defined foreign keys. https://en.wikipedia.org/wiki/Database_normalization 11
  • 12. 1NF First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar Codd, defined a relation to be in first normal form when none of its domains have any sets as elements. First normal form is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. First normal form enforces these criteria: ● Eliminate repeating groups in individual tables. ● Create a separate table for each set of related data. ● Identify each set of related data with a primary key https://en.wikipedia.org/wiki/First_normal_form 12
  • 13. 2NF Second normal form (2NF) is a normal form used in database normalization. A table that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically: a table is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table. A non- prime attribute of a table is an attribute that is not a part of any candidate key of the table. Put simply, a table is in 2NF if it is in 1NF and every non-prime attribute of the table is dependent on the whole of every candidate key. https://en.wikipedia.org/wiki/Second_normal_form 13
  • 14. 3NF Third normal form is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that (1) the entity is in second normal form, and (2) all the attributes in a table are determined only by the candidate keys of that table and not by any non-prime attributes. 3NF was designed to improve database processing while minimizing storage costs. 3NF data modeling was ideal for online transaction processing (OLTP) applications with heavy order entry type of needs. https://en.wikipedia.org/wiki/Third_normal_form 14
  • 15. 4NF Fourth normal form (4NF) is a normal form used in database normalization. Introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). Whereas the second, third, and Boyce– Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency. A Table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X ->> Y, X is a superkey—that is, X is either a candidate key or a superset thereof. https://en.wikipedia.org/wiki/Fourth_normal_form 15
  • 16. 5NF Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys. https://en.wikipedia.org/wiki/Fifth_normal_form 16
  • 17. 6NF Sixth normal form is intended to decompose relation variables to irreducible components. Though this may be relatively unimportant for non-temporal relation variables, it can be important when dealing with temporal variables or other interval data. For instance, if a relation comprises a supplier's name, status, and city, we may also want to add temporal data, such as the time during which these values are, or were, valid (e.g., for historical data) but the three values may vary independently of each other and at different rates. We may, for instance, wish to trace the history of changes to Status. https://en.wikipedia.org/wiki/Sixth_normal_form 17
  • 18. Boyce-Codd Normal Form Boyce–Codd normal form (or BCNF or 3.5NF) is a normal form used in database normalization. It is a slightly stronger version of the third normal form (3NF). If a relational schema is in BCNF then all redundancy based on functional dependency has been removed, although other types of redundancy may still exist. A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold: ● X → Y is a trivial functional dependency (Y ⊆ X) ● X is a super key for schema R https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form 18
  • 19. The Good News You very rarely need to go past third normal form or BCNF. 19
  • 20. Okay, how do we get data into 3NF/BCNF?? Have to start with a look at your data! 20
  • 21. What to do! 1NF Remove Repeating Groups - Make a separate table for each set of related attributes, and give each table a primary key. 2NF Remove Redundant Data - If an attribute depends on only part of a multi-valued key, remove it to a separate table. 3NF Remove Columns Not Dependant on a Key - If attributes do not contribute to a description of the key, remove them to a separate table. BCNF - If there are non-trivial dependencies between candidate key attributes, separate them out into distinct tables. 21
  • 22. Example - 1NF Dogs & Owners Owner Age (years) Type Name Mark 16 White Lab Roman Dave 8 Beagle Jack Carrie .5 Black Mouth Cur Boo Dave 6 Beagle Lucy 22
  • 23. Example 1 - 2NF Owners and Dogs Id Name 1 Mark 2 Carrie 3 Dave Id Name Age Owner Type 1 Roman 16 1 White Lab 2 Jack 8 3 Beagle 3 Boo .5 2 Black Mouth Cur 4 Lucy 6 3 Beagle Each of the owner records and each of the dog records have one entry 23
  • 24. Example 1 - 3NF Owners and Dogs + Type Id Name 1 Mark 2 Carrie 3 Dave Id Name Age Owner Type 1 Roma n 16 1 1 2 Jack 8 3 2 3 Boo .5 2 3 4 Lucy 6 3 2 More redundant info removed Type 1 White Lab 2 Beagle 3 Black Mouth Cur 24
  • 25. Example 1 - 4NF - More than one owner per dog Owners and Dogs + Type Id Name 1 Mark 2 Carrie 3 Dave Id Name Age Type 1 Roman 16 1 2 Jack 8 2 3 Boo .5 3 4 Lucy 6 2 Id Type 1 White Lab 2 Beagle 3 Black Mouth Cur Owner_ID Pet_ID 1 1 3 2 2 3 3 4 2 3 25
  • 26. Exercise and Break ● Break into a small group ● Pick one of the following two by recording ○ Persons in group and their favorite pizza (Crust, toppings, etc) ○ Persons in group and details on their car/truck ● Normalize Data as much as possible Back in twenty minutes 26
  • 27. Efficient Use of a Database 27
  • 28. Two Examples -- Give sales department 20% raise foreach (sale_emp in sales_employees) $pay = $pay * .20; UPDATE employees SET pay_rate = pay_rate * .20 WHERE department = ‘sales’; 28
  • 29. Two Examples -- Give sales department 20% raise foreach (sale_emp in sales_employees) $pay = $pay * .20; START TRANSACTION; UPDATE employees SET pay_rate = pay_rate * .20 WHERE department = ‘sales’; COMMIT; 29
  • 30. Two Examples -- Give sales department 20% raise foreach (sale_emp in sales_employees) $pay = $pay * .20; START TRANSACTION; UPDATE employees SET pay_rate = pay_rate * .20 WHERE department = ‘sales’; COMMIT; Questions: Which one is crash proof? Which takes more clock time? Which one is boss proof = ‘Hey, before you start that process what is the total effect on payroll?’ Or ‘Could you make it 18.77% percent?’ 30
  • 31. Databases are good at heavy lifting Make big or little changes to data as one transaction Statistical functions usually excellent Sorting / grouping (usually database servers have more memory, faster disk than application server) Learn to use Pentaho or BIRT to write ad hoc reports (can be set to run on a schedule) ** 31
  • 33. Data Set A data set (or dataset) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. https://en.wikipedia.org/wiki/Data_set 33
  • 34. 34
  • 35. 35 The Classic N+1 Problem SELECT parent_record For each child SELECT child_record …..
  • 36. Back to Example 1 Remember the 4NF tables of dogs and owners? We have information on the dogs, the owners, types of dogs, and dog/owner relationships. ● Each stands individually ● Can be tied to each other easily JOINing is the process of combining rows from two or more tables using common fields 36
  • 37. DOGS with Types 37 ID Type 1 White Lab 2 Beagle 3 Black Mouth Cur ID Name Age Type 1 Roman 16 1 2 Jack 8 2 3 Boo .5 3 4 Lucy 6 2
  • 38. DOGS with Types 38 ID Type 1 White Lab 2 Beagle 3 Black Mouth Cur ID Name Age Type 1 Roman 16 1 2 Jack 8 2 3 Boo .5 3 4 Lucy 6 2 When matching dogs to their type, we need to use common columns between the tables.
  • 39. JOIN SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id 39
  • 40. JOIN SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id 40 What we want
  • 41. JOIN SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id 41 Where we will get it from
  • 42. JOIN SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id 42 How they link together
  • 43. SQL Join Venn Digrams Our JOIN 43 Dog Dog Type
  • 44. Go to Google, search SQL Venn Diagrams, & GET! 44
  • 46. What Happens When You Send a Query to Server 1. Check connection permissions 2. Is SYNTAX of query correct 3. Does user have permissions: a. Database b. Table c. Column 4. Develop Query Plan 5. Return Data 46 SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id
  • 47. Query Plan A query plan (or query execution plan) is an ordered set of steps used to access data in a SQL relational database management system. This is a specific case of the relational model concept of access plans. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best option. Because query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune the plans produced by the optimizer to get better performance. https://en.wikipedia.org/wiki/Query_plan 47
  • 48. Rough Analog Designated Driver: You have to pick up three friends for an evening out and get them home. What info do you need? 48
  • 49. Query Optimizer is like a GPS It has statistical information on how to get some place but may not know about traffic jams (locks), accidents (deletions, etc.). It also know how long it should take to get something but that may change with load or over time. And the optimizer occasionally will make bad choices and send you to a dead end. 49
  • 50. Remember Example JOIN? Optimizer realizes it had to get four columns from two tables. The 4! (24) possible combinations. 50 SELECT dog.name, type.type FROM dog JOIN type ON dog.type = type.id
  • 51. Cost Based Query Optimization Cost-based query optimizers evaluate the resource footprint of various query plans and use this as the basis for plan selection. These assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, CPU path length, amount of disk buffer space, disk storage service time, and interconnect usage between units of parallelism, and other factors determined from the data dictionary. The set of query plans examined is formed by examining the possible access paths (e.g., primary index access, secondary index access, full file scan) and various relational table join techniques (e.g., merge join, hash join,product join). The search space can become quite large depending on the complexity of the SQL query. There are two types of optimization. These consist of logical optimization—which generates a sequence of relational algebra to solve the query—and physical optimization—which is used to determine the means of carrying out each operation. https://en.wikipedia.org/wiki/Query_optimization 51
  • 52. What is COST? Cost until recently was the cost of doing disk INPUT-OUTPUT. Note data on disk is 100,000 times SLOWER to read than from memory. 100,000 seconds is 27.77778 hours 100,000 milliseconds is 1.6666667 minutes 52 This is starting to change as vendors try to accommodate newer technologies, storage units of different speeds, etcetera -- and you will be able in the future to assign costs toa device.
  • 53. Full Table Scan Full table scan (also known as sequential scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order and the columns encountered are checked for the validity of a condition. Full table scans are usually the slowest method of scanning a table due to the heavy amount of I/O reads required from the disk which consists of multiple seeks as well as costly disk to memory transfers. Sequential scan takes place usually when the column or group of columns of a table (the table may be on disk or may be an intermediate table created by the join of two or more tables) needed for the scan do not contain an index which can be used for the purpose. https://en.wikipedia.org/wiki/Full_table_scan 53
  • 54. Full Table Scan Analog Pretend you are given a dictionary to look up the plural of the word ‘moose’. But the dictionary is not alphabetized (random order) and you need to also beware of homonyms, synonyms, and homographs. 54
  • 55. Index An index is an indirect shortcut derived from and pointing into a greater volume of values, data, information or knowledge. https://en.wikipedia.org/wiki/Index In other words, we get the exact record(s) needed without having to search all the others 55
  • 56. Primary Key The primary key for a table represents the column or set of columns that you use in your most vital queries. It has an associated index, for fast query performance. Query performance benefits from the NOT NULL optimization, because it cannot include any NULL values. http://dev.mysql.com/doc/refman/5.7/en/optimizing-primary-keys.html 56
  • 57. NULL Null is a special marker used in Structured Query Language (SQL) to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". https://en.wikipedia.org/wiki/Null_(SQL) 57
  • 58. UNIQUE KEY There can be just one!! 58
  • 59. Multi-Columns Indexes MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 16 columns. For certain data types, you can index a prefix of the column (see Section 8.3.4, “Column Indexes”). MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table. http://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html A Y-M-D index works for Y-M-D, Y-M, or Y but not D, M-D. Sometimes you can get information from index without dive into data. State- Zip-CityName can get you all the cities within State/Zip 59
  • 60. Optimize Data Structure ● For unique IDs or other values that can be represented as either strings or numbers, prefer numeric columns to string columns. Since large numeric values can be stored in fewer bytes than the corresponding strings, it is faster and takes less memory to transfer and compare them. ● When comparing values from different columns, declare those columns with the same character set and collation wherever possible, to avoid string conversions while running the query. ● If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. 60
  • 61. DDL & DML ● Data Description Language ○ Describes the structure of the data ● Data Manipulation Language ○ A family of syntax elements used for selecting, inserting, deleting and updating data in a database. 61
  • 62. DDL CREATE TABLE lonestar ( id INTEGER NOT NULL. name CHAR(10), PRIMARY KEY (ID)); Be very careful for Reserved Words when naming columns and check on case sensitivity too! 62
  • 63. DML SELECT id, name FROM lonestar; SELECT lonestar.id, lonestar.name FROM lonestar; SELECT id AS ‘Identification Nbr’, name AS ‘Full Name’ FROM lonestar; SELECT name FROM lonestar WHERE name = ‘Smith’; SELECT name FROM lonestar WHERE id IN (10,20,30,40,50) OR name < ‘Jones’; UPDATE lonestate SET name = ‘None’ WHERE id = 10 AND name=’Jones’; DELTE FROM lonestar WHERE ID > 99; 63
  • 64. EXPLAIN The EXPLAIN statement can be used to obtain information about how MySQL executes a statement: Permitted explainable statements for EXPLAIN are SELECT, DELETE, INSERT, REPLACE, and UPDATE. When EXPLAIN is used with an explainable statement, MySQL displays information from the optimizer about the statement execution plan. That is, MySQL explains how it would process the statement, including information about how tables are joined and in which order. When EXPLAIN is used with FOR CONNECTION connection_id rather than an explainable statement, it displays the execution plan for the statement executing in the named connection 64
  • 65. Explain -- Tabular Classic View 65 EXPLAIN select name from city where CountryCode = ‘ USA’
  • 66. Explain -- Tabular Classic View 66 EXPLAIN select name from city where CountryCode = ‘ USA’ The query is using an key (index) named CountryCode. The optimizer estimates it will take 274 rows read to make requested query.
  • 67. Explain -- Tabular Classic View 67 EXPLAIN select name from city where CountryCode = ‘ USA’ ‘Ref’ - reference -- join type
  • 68. EXPLAIN JOIN Types from best to worst 68 System The table has only one row (= system table). This is a special case of the const join type. Const The table has at most one matching row, which is read at the start of the query. Because there is only one row, values from the column in this row can be regarded as constants by the rest of the optimizer. const tables are very fast because they are read only once. Eq_ref One row is read from this table for each combination of rows from the previous tables. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE NOT NULL index. Ref All rows with matching index values are read from this table for each combination of rows from the previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or UNIQUE index Fulltext The join is performed using a FULLTEXT index. Ref_or_null This join type is like ref, but with the addition that MySQL does an extra search for rows that contain NULL values. This join type optimization is used most often in resolving subqueries.
  • 69. EXPLAIN JOIN Types from best to worst 69 Index_merge This join type indicates that the Index Merge optimization is used. In this case, the key column in the output row contains a list of indexes used, and key_len contains a list of the longest key parts for the indexes used. WHERE X = 20 or X = 30 AND Z = 5 Unique_subquery This type replaces eq_ref for some IN subqueries of the following form: ● value IN (SELECT primary_key FROM single_table WHERE some_expr) unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency. Index_subquery This join type is similar to unique_subquery. It replaces IN subqueries, but it works for nonunique indexes in subqueries
  • 70. EXPLAIN JOIN Types from best to worst 70 Range Only rows that are in a given range are retrieved, using an index to select the rows. The key column in the output row indicates which index is used. The key_len contains the longest key part that was used. The ref column is NULL for this type. Index The index join type is the same as ALL, except that the index tree is scanned. This occurs two ways: ○ If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data. ○ A full table scan is performed using reads from the index to look up data rows in index order. Uses index does not appear in the Extra column. MySQL can use this join type when the query uses only columns that are part of a single index. ALL A full table scan is done for each combination of rows from the previous tables. MAY NOT BE BAD!**
  • 71. Explain -- Tabular Classic View 2 71 EXPLAIN select City.Name, Country.name FROM City join Country on (City.CountryCode = Country.Code) where CountryCode = 'USA' AND City.Name LIKE 'New%' The Extra column of EXPLAIN output contains additional information about how MySQL resolves the query. The following list explains the values that can appear in this column. If you want to make your queries as fast as possible, look out for Extra values of Using filesort and Using temporary.
  • 72. Explain -- Tabular Classic View 72 EXPLAIN select name from city where CountryCode = ‘ USA’ Mainly look at the rows column
  • 73. Visual Explain -- MySQL Workbench 73
  • 74. Optimizer trace 74 USe MySQL Workbench -- too far in the weeds for 99% of developers
  • 75. Start to Wrap Up 75 Books: SQL Antipatterns --- Bill Karwin Effective MySQL Optimizing SQL Statements -- Ronald Bradford SQL and Relational Theory -- CJ Date
  • 76. Wrap Up Work on most frequently run queries Re-check as data grows De-normalize at your own risk -- will eventually bite you in rear Let database do the heavy lifting Think in sets 76