Data Engineer - Course
Data Engineer - Course
Data Engineer - Course
Exercise:
Query information_schema with SELECT
information_schema is a meta-database that holds information about your current database.
information_schema has multiple tables you can query with the known SELECT * FROM syntax:
In this exercise, you'll only need information from the 'public' schema, which is specified as the
column table_schema of the tables and columns tables. The 'public' schema holds information
about user-defined tables and databases. The other types of table_schema hold system information
– for this course, you're only interested in user-defined stuff.
-- Query the right table in information_schema
SELECT table_name
FROM information_schema.tables
Instructions 2/4
Now have a look at the columns in university_professors by selecting all entries in
information_schema.columns that correspond to that table.
FROM information_schema.columns
select *
from university_professors
LIMIT 5;
Transcript
For example, you could create a "weather" table with three aptly named columns. After
each column name, you must specify the data type. There are many different types, and
you will discover some in the remainder of this course. For example, you could specify a
text column, a numeric column, and a column that requires fixed-length character
strings with 5 characters each. These data types will be explained in more detail in the
next chapter.
Instructions 1/2
Create a table professors with two text columns: firstname and lastname.
Create a table universities with three text columns: university_shortname, university, and
university_city.
Oops! We forgot to add the university_shortname column to the professors table. You've
probably already noticed:
To add columns you can use the following SQL query:
00:00 - 00:33
So far you've learned how to set up a simple database that consists of multiple
tables. Apart from storing different entity types, such as professors, in different
tables, you haven't made much use of database features. In the end, the idea of a
database is to push data into a certain structure – a pre-defined model, where you
enforce data types, relationships, and other rules. Generally, these rules are
called integrity constraints, although different names exist.
2. Integrity constraints
00:33 - 01:20
Integrity constraints can roughly be divided into three types. The most simple
ones are probably the so-called attribute constraints. For example, a certain
attribute, represented through a database column, could have the integer data
type, allowing only for integers to be stored in this column. They'll be the subject
of this chapter. Secondly, there are so-called key constraints. Primary keys, for
example, uniquely identify each record, or row, of a database table. They'll be
discussed in the next chapter. Lastly, there are referential integrity constraints. In
short, they glue different database tables together. You'll learn about them in the
last chapter of this course.
3. Why constraints?
01:20 - 02:08
So why should you know about constraints? Well, they press the data into a
certain form. With good constraints in place, people who type in birthdates, for
example, have to enter them in always the same form. Data entered by humans is
often very tedious to pre-process. So constraints give you consistency, meaning
that a row in a certain table has exactly the same form as the next row, and so
forth. All in all, they help to solve a lot of data quality issues. While enforcing
constraints on human-entered data is difficult and tedious, database management
systems can be a great help. In the next chapters and exercises, you'll explore
how.
02:43 - 03:23
Data types also restrict possible SQL operations on the stored data. For example,
it is impossible to calculate a product from an integer *and* a text column, as
shown here in the example. The text column "wind_speed" may store numbers,
but PostgreSQL doesn't know how to use text in a calculation. The solution for
this is type casts, that is, on-the-fly type conversions. In this case, you can use
the "CAST" function, followed by the column name, the AS keyword, and the
desired data type, and PostgreSQL will turn "wind_speed" into an integer right
before the calculation.
SQL aggregate functions
00:45 - 01:25
Here are the most common types in PostgreSQL. Note that these types are
specific to PostgreSQL but appear in many other database management systems
as well, and they mostly conform to the SQL standard. The "text" type allows
characters strings of any length, while the "varchar" and "char" types specify a
maximum number of characters, or a character string of fixed length,
respectively. You'll use these two for your database. The "boolean" type allows
for two boolean values, for example, "true" and "false" or "1" and "0", and for a
third unknown value, expressed through "NULL".
01:25 - 01:45
Then there are various formats for date and time calculations, also with timezone
support. "numeric" is a general type for any sort of numbers with arbitrary
precision, while "integer" allows only whole numbers in a certain range. If that
range is not enough for your numbers, there's also "bigint" for larger numbers.
5. Specifying types upon table creation
01:45 - 02:30
Here's an example of how types are specified upon table creation. Let's say the
social security number, "ssn", should be stored as an integer as it only contains
whole numbers. The name may be a string with a maximum of 64 characters,
which might or might not be enough. The date of birth, "dob", is naturally stored
as a date, while the average grade is a numeric value with a precision of 3 and a
scale of 2, meaning that numbers with a total of three digits and two digits after
the fractional point are allowed. Lastly, the information whether the tuition of the
student was paid is, of course, a boolean one, as it can be either true or false.
02:30 - 03:16
Altering types after table creation is also straightforward, just use the shown
"ALTER TABLE ALTER COLUMN" statement. In this case, the maximum name
length is extended to 128 characters. Sometimes it may be necessary to truncate
column values or transform them in any other way, so they fit with the new data
type. Then you can use the "USING" keyword, and specify a transformation that
should happen before the type is altered. Let's say you'd want to turn the
"average_grade" column into an integer type. Normally, PostgreSQL would just
keep the part of the number before the fractional point. With "USING", you can tell
it to round the number to the nearest integer, for example.
From 1 for 16
00:00 - 00:09
In the last part of this chapter, you'll get to know two special attribute constraints:
the not-null and unique constraints.
00:09 - 00:34
As the name already says, the not-null constraint disallows any "NULL" values on
a given column. This must hold true for the existing state of the database, but
also for any future state. Therefore, you can only specify a not-null constraint on
a column that doesn't hold any "NULL" values yet. And: It won't be possible to
insert "NULL" values in the future.
00:34 - 00:58
Before I go on explaining how to specify not-null constraints, I want you to think
about "NULL" values. What do they actually mean to you? There's no clear
definition. "NULL" can mean a couple of things, for example, that the value is
unknown, or does not exist at all. It can also be possible that a value does not
apply to the column. Let's look into an example.
00:58 - 01:49
Let's say we define a table "students". The first two columns for the social
security number and the last name cannot be "NULL", which makes sense: this
should be known and apply to every student. The "home_phone" and
"office_phone" columns though should allow for null values – which is the
default, by the way. Why? First of all, these numbers can be unknown, for any
reason, or simply not exist, because a student might not have a phone. Also,
some values just don't apply: Some students might not have an office, so they
don't have an office phone, and so forth. So, one important take away is that two
"NULL" values must not have the same meaning. This also means that comparing
"NULL" with "NULL" always results in a "FALSE" value.
01:49 - 02:21
You've just seen how to add a not-null constraint to certain columns when
creating a table. Just add "not null" after the respective columns. But you can
also add and remove not-null constraints to and from existing tables. To add a
not-null constraint to an existing table, you can use the "ALTER COLUMN SET
NOT NULL" syntax as shown here. Similarly, to remove a not-null constraint, you
can use "ALTER COLUMN DROP NOT NULL".
6. The unique constraint
02:21 - 02:57
The unique constraint on a column makes sure that there are no duplicates in a
column. So any given value in a column can only exist once. This, for example,
makes sense for university short names, as storing universities more than once
leads to unnecessary redundancy. However, it doesn't make sense for university
cities, as two universities can co-exist in the same city. Just as with the not-null
constraint, you can only add a unique constraint if the column doesn't hold any
duplicates before you apply it.
7. Adding unique constraints
02:57 - 03:26
Here's how to create columns with unique constraints. Just add the "UNIQUE"
keyword after the respective table column. You can also add a unique constraint
to an existing table. For that, you have to use the "ADD CONSTRAINT" syntax.
This is different from adding a "NOT NULL" constraint. However, it's a pattern
that frequently occurs. You'll see plenty of other examples of "ADD
CONSTRAINT" in the remainder of this course.
1. Keys and superkeys
2. The current database model
00:10 - 00:30
Let's have a look at your current database model first. In the last chapter, you
specified attribute constraints, first and foremost data types. You also set not-null
and unique constraints on certain attributes. This didn't actually change the
structure of the model, so it still looks the same.
3. The database model with primary keys
00:30 - 00:56
By the end of this chapter, the database will look slightly different. You'll add so-
called primary keys to three different tables. You'll name them "id". In the entity-
relationship diagram, keys are denoted by underlined attribute names. Notice that
you'll add a whole new attribute to the "professors" table, and you'll modify
existing columns of the "organizations" and "universities" tables.
4. What is a key?
00:56 - 01:52
Before we go into the nitty-gritty of what a primary key actually is, let's look at
keys in general. Typically a database table has an attribute, or a combination of
multiple attributes, whose values are unique across the whole table. Such
attributes identify a record uniquely. Normally, a table, as a whole, only contains
unique records, meaning that the combination of all attributes is a key in itself.
However, it's not called a key, but a superkey, if attributes from that combination
can be removed, and the attributes still uniquely identify records. If all possible
attributes have been removed but the records are still uniquely identifiable by the
remaining attributes, we speak of a minimal superkey. This is the actual key. So a
key is always minimal. Let's look at an example.
5. An example
01:52 - 02:17
Here's an example that I found in a textbook on database systems. Obviously, the
table shows six different cars, so the combination of all attributes is a superkey. If
we remove the "year" attribute from the superkey, the six records are still unique,
so it's still a superkey. Actually, there are a lot of possible superkeys in this
example.
6. An example (contd.)
02:17 - 03:06
However, there are only four minimal superkeys, and these are "license_no",
"serial_no", and "model", as well as the combination of "make" and "year".
Remember that superkeys are minimal if no attributes can be removed without
losing the uniqueness property. This is trivial for K1 to 3, as they only consist of a
single attribute. Also, if we remove "year" from K4, "make" would contain
duplicates, and would, therefore, be no longer suited as key. These four minimal
superkeys are also called candidate keys. Why candidate keys? In the end, there
can only be one key for the table, which has to be chosen from the candidates.
More on that in the next video.
Primary keys
1. Primary keys
Okay, now it's time to look at an actual use case for superkeys, keys, and candidate
keys.
Primary keys are one of the most important concepts in database design. Almost every
database table should have a primary key – chosen by you from the set of candidate
keys. The main purpose, as already explained, is uniquely identifying records in a table.
This makes it easier to reference these records from other tables, for instance – a
concept you will go through in the next and last chapter. You might have already
guessed it, but primary keys need to be defined on columns that don't accept duplicate
or null values. Lastly, primary key constraints are time-invariant, meaning that they must
hold for the current data in the table – but also for any future data that the table might
hold. It is therefore wise to choose columns where values will always be unique and not
null.
3. Specifying primary keys
So these two tables accept exactly the same data, however, the latter has an explicit
primary key specified. As you can see, specifying primary keys upon table creation is
very easy. Primary keys can also be specified like so: This notation is necessary if you
want to designate more than one column as the primary key. Beware, that's still only
one primary key, it is just formed by the combination of two columns. Ideally, though,
primary keys consist of as few columns as possible!
Surrogate keys
1. Surrogate keys
Surrogate keys are sort of an artificial primary key. In other words, they are not based
on a native column in your data, but on a column that just exists for the sake of having a
primary key. Why would you need that?
2. Surrogate keys
There are several reasons for creating an artificial surrogate key. As mentioned before,
a primary key is ideally constructed from as few columns as possible. Secondly, the
primary key of a record should never change over time. If you define an artificial primary
key, ideally consisting of a unique number or string, you can be sure that this number
stays the same for each record. Other attributes might change, but the primary key
always has the same value for a given record.
3. An example
Let's look back at the example in the first video of this chapter. I altered it slightly and
added the "color" column. In this table, the "license_no" column would be suited as the
primary key – the license number is unlikely to change over time, not like the color
column, for example, which might change if the car is repainted. So there's no need for
a surrogate key here. However, let's say there were only these three attributes in the
table. The only sensible primary key would be the combination of "make" and "model",
but that's two columns for the primary key.
7. Your database
In the exercises, you'll add a surrogate key to the "professors" table, because the
existing attributes are not really suited as primary key. Theoretically, there could be
more than one professor with the same name working for one university, resulting in
duplicates. With an auto-incrementing "id" column as the primary key, you make sure
that each professor can be uniquely referred to. This was not necessary for
organizations and universities, as their names can be assumed to be unique across
these tables. In other words: It is unlikely that two organizations with the same name
exist, solely for trademark reasons. The same goes for universities.
Chapter 4:
5. A query
00:12 - 00:56
So you've added a 1:N-relationship between professors and universities. Such
relationships have to be implemented with one foreign key in the table that has at most
one foreign entity associated. In this case, that's the "professors" table, as professors
cannot have more than one university associated. Now, what about affiliations? We
know that a professor can have more than one affiliation with organizations, for
instance, as a chairman of a bank and as a president of a golf club. On the other hand,
organizations can also have more than one professor connected to them. Let's look at
the entity-relationship diagram that models this.
3. The final database model
There are a couple of things that are new. First of all, a new relationship between
organizations and professors was added. This is an N:M relationship, not an 1:N
relationship as with professors and universities. This depicts the fact that a professor
can be affiliated with more than one organization and vice versa. Also, it has an own
attribute, the function. Remember that each affiliation comes with a function, for
instance, "chairman". The second thing you'll notice is that the affiliations entity type
disappeared altogether. For clarity, I still included it in the diagram, but it's no longer
needed. However, you'll still have four tables: Three for the entities "professors",
"universities" and "organizations", and one for the N:M-relationship between
"professors" and "organizations".
Such a relationship is implemented with an ordinary database table that contains two
foreign keys that point to both connected entities. In this case, that's a foreign key
pointing to the "professors.id" column, and one pointing to the "organizations.id"
column. Also, additional attributes, in this case "function", need to be included. If you
were to create that relationship table from scratch, you would define it as shown. Note
that "professor_id" is stored as "integer", as the primary key it refers to has the type
"serial", which is also an integer. On the other hand, "organization_id" has
"varchar(256)" as type, conforming to the primary key in the "organizations" table. One
last thing: Notice that no primary key is defined here because a professor can
theoretically have multiple functions in one organization. One could define the
combination of all three attributes as the primary key in order to have some form of
unique constraint in that table, but that would be a bit over the top.
Referential integrity
We'll now talk about one of the most important concepts in database systems: referential integrity.
It's a very simple concept…
2.Referential integrity
...that states that a record referencing another record in another table must always refer to an
existing record. In other words: A record in table A cannot point to a record in table B that does not
exist. Referential integrity is a constraint that always concerns two tables, and is enforced through
foreign keys, as you've seen in the previous lessons of this chapter. So if you define a foreign key in
the table "professors" referencing the table "universities", referential integrity is held from
"professors" to "universities".
However, throwing an error is not the only option. If you specify a foreign key on a column, you can
actually tell the database system what should happen if an entry in the referenced table is deleted.
By default, the "ON DELETE NO ACTION" keyword is automatically appended to a foreign key
definition, like in the example here. This means that if you try to delete a record in table B which is
referenced from table A, the system will throw an error. However, there are other options. For
example, there's the "CASCADE" option, which will first allow the deletion of the record in table B,
and then will automatically delete all referencing records in table A. So that deletion is cascaded.
There are even more options. The "RESTRICT" option is almost identical to the "NO ACTION"
option. The differences are technical and beyond the scope of this course. More interesting is the
"SET NULL" option. It will set the value of the foreign key for this record to "NULL". The "SET
DEFAULT" option only works if you have specified a default value for a column. It automatically
changes the referencing column to a certain default value if the referenced record is deleted. Setting
default values is also beyond the scope of this course, but this option is still good to know.
6. Let's look at some examples!
Let's practice this a bit and change the referential integrity behavior of your database.
Roundup
1. Roundup
Congratulations, you're almost done. Let's quickly revise what you've done throughout
this course.
Curso 6
Database Design
1. OLTP and OLAP
Hello! My name is Lis, I'm a Curriculum Manager here at DataCamp. In this course, we'll
be talking about database design. So, what does that entail exactly?
2. How should we organize and manage data?
To put it simply, in this course we're asking the question: How should we organize and
manage data? To answer this, we have to consider the different schemas, management
options, and objects that make up a database. Some examples are listed here, and they
are covered throughout the course. These topics all affect the way data is stored and
accessed. Some enable faster query speeds. Some take up less memory than others.
And notably, some cost more money than others.
7. Working together
OLAP and OLTP systems work together; in fact, they need each other. OLTP data is
usually stored in an operational database that is pulled and cleaned to create an OLAP
data warehouse. We'll get more into data warehouses and other storage solutions in the
next video. Without transactional data, no analyses can be done in the first place.
Analyses from OLAP systems are used to inform business practices and day-to-day
activity, thereby influencing the OLTP databases.
8. Takeaways
To wrap up, here's what you should take away from this video: Before implementing
anything, figure out your business requirements because there are many design
decisions you'll have to make. The way you set up your database now will affect how it
can be effectively used in the future. Start by figuring out if you need an OLAP or OLTP
approach, or perhaps both! You should now be comfortable with the differences
between both. These are the two most common approaches. However, they are not
exhaustive, but they are an excellent start to get you on the right path to designing your
database. In later videos, we'll learn more about the technical differences between both
approaches.
Storing data
2. Structuring data
00:03 - 00:55
Data can be stored in three different levels. The first is structured data, which is usually
defined by schemas. Data types and tables are not only defined, but relationships
between tables are also defined, using concepts like foreign keys. The second is
unstructured data, which is schemaless and data in its rawest form, meaning it's not
clean. Most data in the world is unstructured. Examples include media files and raw
text. The third is semi-structured data, which does not follow a larger schema, rather it
has an ad-hoc self-describing structure. Therefore, it has some structure. This is an
inherently vague definition as there can be a lot of variation between structured and
unstructured data. Examples include NoSQL, XML, and JSON, which is shown here on
the right.
3. Structuring data
00:55 - 01:11
Because its clean and organized, structured data is easier to analyze. However, it's not
as flexible because it needs to follow a schema, which makes it less scalable. These
are trade-offs to consider as you move between structured and unstructured data.
1 Flower by Sam Oth and Database Diagram by Nick Jenkins via Wikimedia Commons
https://commons.wikimedia.org/wiki/File:Languages_xml.png
5. Data warehouses
Data warehouses are optimized for read-only analytics. They combine data from
multiple sources and use massively parallel processing for faster queries. In their
database design, they typically use dimensional modeling and a denormalized schema.
We will walk through both of these terms later in the course. Amazon, Google, and
Microsoft all offer data warehouse solutions, known as Redshift, Big Query, and Azure
SQL Data Warehouse, respectively. A data mart is a subset of a data warehouse
dedicated to a specific topic. Data marts allow departments to have easier access to the
data that matters to them.
6. Data lakes
02:30 - 03:45
Technically, traditional databases and warehouses can store unstructured data, but not
cost-effectively. Data Lake storage is cheaper because it uses object storage as
opposed to the traditional block or file storage. This allows massive amounts of data to
be stored effectively of all types, from streaming data to operational databases. Lakes
are massive because they store all the data that might be used. Data lakes are often
petabytes in size - that's 1,000 terabytes! Unstructured data is the most scalable, which
permits this size. Lakes are schema-on-read, meaning the schema is created as data is
read. Warehouses and traditional databases are classified as schema-on-write because
the schema is predefined. Data lakes have to be organized and cataloged well;
otherwise, it becomes an aptly named "data swamp." Data lakes aren't only limited to
storage. It's becoming popular to run analytics on data lakes. This is especially true for
tasks like deep learning and data discovery, which needs a lot of data that doesn't need
to be that "clean." Again, the big three cloud providers all offer a data lake solution.
7. Extract, Transform, Load or Extract, Load, Transform
When we think about where to store data, we have to think about how data will get there
and in what form. Extract Transform Load and Extract Load Transform are two different
approaches for describing data flows. They get into the intricacies of building data
pipelines, which we will not get into. ETL is the more traditional approach for
warehousing and smaller-scale analytics. But, ELT has become common with big data
projects. In ETL, data is transformed before loading into storage - usually to follow the
storage's schema, as is the case with warehouses. In ELT, the data is stored in its
native form in a storage solution like a data lake. Portions of data are transformed for
different purposes, from building a data warehouse to doing deep learning.
1. Database design
Now, let's learn more about what database design means.
Database design determines how data is logically stored. This is crucial because it affects how the
database will be queried, whether for reading data or updating data. There are two important
concepts to know when it comes to database design: Database models and schemas. Database
models are high-level specifications for database structure. The relational model, which is the most
popular, is the model used to make relational databases. It defines rows as records and columns as
attributes. It calls for rules such as each row having unique keys. There are other models that exist
that do not enforce the same rules. A schema is a database's blueprint. In other words, the
implementation of the database model. It takes the logical structure more granularly by defining the
specific tables, fields, relationships, indexes, and views a database will have. Schemas must be
respected when inserting structured data into a relational database.
3. Data modeling
The first step to database design is data modeling. This is the abstract design phase, where we
define a data model for the data to be stored. There are three levels to a data model: A conceptual
data model describes what the database contains, such as its entities, relationships, and attributes.
A logical data model decides how these entities and relationships map to tables. A physical data
model looks at how data will be physically stored at the lowest level of abstraction. These three
levels of a data model ensure consistency and provide a plan for implementation and use.
1 https://en.wikipedia.org/wiki/Data_model
4. An example
Here is a simplified example of where we want to store songs. In this case, the entities are songs,
albums, and artists with various pink attributes. Their relationships are denoted by blue rhombuses.
Here we have a conceptual idea of the data we want to store. Here is a corresponding schema using
the relational model. The fastest way to create a schema is to translate the entities into tables. But
just because it's the easiest, doesn't mean it's the best. Let's look at some other ways this ER
diagram could be converted.
From the prerequisites, you should be familiar with the relational model. Dimensional modeling is an
adaptation of the relational model specifically for data warehouses. It's optimized for OLAP type of
queries that aim to analyze rather than update. To do this, it uses the star schema. In the next
chapter, we'll delve into that more. As we will see in the next slide, the schema of a dimensional
model tends to be easy to interpret and extend. This is a big plus for analysts working on the
warehouse.
7. Elements of dimensional modeling
Dimensional models are made up of two types of tables: fact and dimension tables. What the fact
table holds is decided by the business use-case. It contains records of a key metric, and this metric
changes often. Fact tables also hold foreign keys to dimension tables. Dimension tables hold
descriptions of specific attributes and these do not change as often. So what does that mean? Let's
bring back our example, where we're analyzing songs. The turquoise table is a fact table called
songs. It contains foreign keys to purple dimension tables. These dimension tables expand on the
attributes of a fact table, such as the album it is in and the artist who made it. The records in fact
tables often change as new songs get inserted. Albums, labels, artists, and genres will be shared by
more than one song - hence records in dimension tables won't change as much. Summing it up, to
decide the fact table in a dimensional model, consider what is being analyzed and how often entities
change.
Congrats on finishing the first chapter! We're now going to jump in where we left off with
the star schema.
2. Star schema
The star schema is the simplest form of the dimensional model. Some use the terms
"star schema" and "dimensional model" interchangeably. Remember that the star
schema is made up of two tables: fact and dimension tables. Fact tables hold records of
metrics that are described further by dimension tables. Throughout this chapter, we are
going to use another bookstore example. However, this time, you work for a company
that sells books in bulk to bookstores across the US and Canada. You have a database
to keep track of book sales. Let's take a look at the star schema for this database.
3. Star schema example
00:40 - 01:17
Excluding primary and foreign keys, the fact table holds the sales amount and quantity
of books. It's connected to dimension tables with details on the books sold, the time the
sale took place, and the store buying the books. You may notice the lines connecting
these tables have a special pattern. These lines represent a one-to-many relationship.
For example, a store can be part of many book sales, but one sale can only belong to
one store. The star schema got its name because it tends to look like a star with its
different extension points.
Now that we have a good grasp of the star schema, let's look at the snowflake schema.
The snowflake schema is an extension of the star schema. Off the bat, we see that it
has more tables. You may not be able to see all the details in this slide, but don't worry
it will be broken down in later slides. The information contained in this schema is the
same as the star schema. In fact, the fact table is the same, but the way the dimension
tables are structured is different. We see that they extend more, hence it's namesake.
5. Same fact table, different dimensions
01:52 - 02:03
The star schema extends one dimension, while the snowflake schema extends over
more than one dimension. This is because the dimension tables are normalized.
6. What is normalization?
02:03 - 02:16
The goal is to reduce redundancy and increase data integrity. So how does this
happen? There are several forms of normalization, which we'll delve into later. But the
basic idea is to identify repeating groups of data and create new tables for them. Let's
go back to our example and to see how these tables were normalized.
Here's the book dimension in the star schema. What could be repeating here? Primary
keys are inherently unique. For book titles, although there is possible repeat here, it is
not common. On the other hand, authors often publish more than one book, publishers
definitely publish many books, and a lot of books share genres. We can create new
tables for them, and it results in the following snowflake schema:
Do you see how these repeating groups now have their own table?
10. Store dimension of the star schema
03:03 - 03:11
On to the store dimension! City, states, and countries can definitely have more than one
book stores within them.
11. Store dimension of the snowflake schema
03:11 - 03:36
Here are the normalized dimension tables representing the book stores. Do you notice
that the way we structure these repeating groups is a bit different from the book
dimension? An author can have published in different genres and with various
publishers, hence why they were different dimensions. However, a city stays in the
same state and country; thus, they extend each other over three dimensions.
The same is done for the time dimension. A day is part of a month that is part of a
quarter, and so on!
13. Snowflake schema
03:42 - 03:48
And here we put all the normalized dimensions together to get the snowflake schema.
14. Let's practice!
03:48 - 03:53
Welcome back! Now that we have a grasp on normalization, let's talk about why we
would want to normalize a database.
2. Back to our book store example
00:07 - 00:38
You should be familiar with these two schemas by now. They're both storing fictional
company data on the sales of books in bulk to stores across the US and Canada. On
the left, you have the star schema with denormalized dimension tables. On the right,
you have the snowflake schema with normalized dimension tables. The normalized
database looks way more complicated. And it is in some ways. For example, let's say
you wanted to get the quantity of all books by Octavia E. Butler sold in Vancouver in Q4
of 2018.
3. Denormalized query
00:38 - 00:50
Based on the denormalized schema, you can run the following query to accomplish this.
It's composed of 3 joins, which makes sense based on the three dimension tables in the
star schema.
4. Normalized query
00:50 - 00:56
What would the query look like on the normalized schema? A lot longer. It doesn't even
fit one slide!
5. Normalized query (continued)
00:56 - 01:13
There's a total of 8 inner joins. This makes sense based on the snowflake schema
diagram. The normalized snowflake schema has considerably more tables. This means
more joins, which means slower queries. So why would we want to normalize a
database?
6. Normalization saves space
01:13 - 01:36
Normalization saves space. This isn't intuitive seeing how normalized databases have
more tables. Let's take a look at the store table in our denormalized database. Here we
see a lot of repeated information in bold - such as USA, California, New York, and
Brooklyn. This type of denormalized structure enables a lot of data redundancy.
7. Normalization saves space
01:36 - 02:03
If we normalize that previous schema, we get this: We see that although we are using
more tables, there is no data redundancy. The string, Brooklyn, is only stored once. And
the state records are stored separately because many cities share the same state, and
country. We don't need to repeat that information, instead, we can have one record
holding the string California. Here we see how normalization eliminates data
redundancy.
8. Normalization ensures better data integrity
02:03 - 03:03
Normalization ensures better data integrity through its design. First, it enforces data
consistency. Data entry can get messy, and at times people will fill out fields differently.
For example, when referring to California, someone might enter the initials "CA". Since
the states are already entered in a table, we can ensure naming conventions through
referential integrity. Secondly, because duplicates are reduced, modification of any data
becomes safer and simpler. Say in the previous example, you wanted to update the
spelling of a state - you wouldn't have to find each record referring to the state, instead,
you could make that change in the states table by altering one record. From there, you
can be confident that the new spelling will be enacted for all stores in that state. Lastly,
since tables are smaller and organized more by object, its easier to alter the database
schema. You can extend a smaller table without having to alter a larger table holding all
the vital data.
9. Database normalization
03:03 - 03:27
To recap, here are the pros and cons of normalization. Now normalization seems
appealing, especially for database maintenance. However, normalization requires a lot
more joins making queries more complicated, which can make indexing and reading of
data slower. Deciding between normalization and denormalization comes down to how
read- or write- intensive your database is going to be.
10. Remember OLTP and OLAP?
03:27 - 04:05
Remember OLTP and OLAP? Can you guess which prefers normalization? Take a
pause and think about it. Did you get it right? OLTP is write-intensive meaning we're
updating and writing often. Normalization makes sense because we want to add data
quickly and consistently. OLAP is read-intensive because we're running analytics on the
data. This means we want to prioritize quicker read queries. Remember how much
more joins the normalized query had over the denormalized query? OLAP should avoid
that.
11. Let's practice!
04:05 - 04:08