sql_patterns_v1.5
sql_patterns_v1.5
Copyright 1
Introduction 2
Who am I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why I wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Who this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What you’ll learn in this book . . . . . . . . . . . . . . . . . . . . . . 4
How this book is organized . . . . . . . . . . . . . . . . . . . . . . . . 5
ii
Contents
iii
Contents
iv
Copyright
1
Introduction
This is a book about SQL Patterns. Patterns describe problems that occur over
and over in our professional settings. A pattern is like a template. Once you
learn them you can apply them to solve problems faster and make your code
better. Learning and applying patterns is how you level up in your career. We can
illustrate this with an example.
In fiction writing, authors rarely write from books scratch. They use character
patterns like: “antihero”, “sidekick”, “mad scientist”, “girl next door” and plot
patterns like: “romantic comedy,” “melodrama”, “red herring”, “foreshadowing”,
“cliffhangers”, etc. This helps them write better books, movies and TV shows
faster.
Each pattern in the book has one of these elements:
Who am I
I’ve been writing SQL for nearly 20 years. I’ve seen and written hundreds of
thousands of lines of code. Over time I noticed a set of patterns and best practices
2
Introduction
I always came back to when writing queries. These patterns made my code more
efficient, easier to understand and a breeze to maintain.
This book is for anyone who is familiar with SQL and wants to take their skills to
the next level. I assume you’re already familiar basic SQL syntax and you know
how to join tables and do basic filtering and aggregation.
3
Introduction
I’m a huge fan of project-based learning. You can learn anything if you can come
up with an interesting project to use what you’re learning. I used this exact
method to teach myself data science. I came up with a work-related project that
was both valuable to the company and used the things I was learning.
That’s why for this book I came up with an interesting and useful data project
to organize it around. I’ll explain each pattern as I walk you through the project.
This will ensure that you learn the material better and remember it the next time
you need to apply it.
In the previous edition of this book I used the StackOverflow dataset that’s
publicly available in BigQuery. Realizing that not everyone has access to this and
that it could disappear at any moment I decided to make a few changes
First of all I made the tables available as parquet files in GitHub. Second I decided
to use the freely available (and quite amazing) DuckDB. The instructions for
setting everything up are available on this repo: github.com/ergest/sql_patterns
I’ve also included all the chapter code listings in the repo so you can copy/paste
it and run it. I strongly encourage you to type it yourself. You’ll learn better that
way. Using this dataset we’re going to build a table which calculates reputation
4
Introduction
metrics. You can use this same type of table to calculate a customer engagement
score or a customer 360 table.
As we go through the project, we’ll cover each pattern when it arises. That will
help you understand why we’re using that pattern at that exact moment. Each
chapter will cover a select group of patterns while building on the previous
chapters.
In Chapter 2 we cover Core Concepts and Patterns. In this chapter we’re going
to cover some of the core concepts of querying data and building tables for
analysis and data science. We’ll start with the most important but underrated
concept in SQL; granularity.
In Chapter 3 we cover Modularity Patterns. In this chapter we’ll learn some key
concepts that make SQL code easier to read, understand and maintain. We first
talk about the concept of modularity and explore some patterns there. Then
we’ll cover Single Responsibility Principle (SRP), (Don’t Repeat Yourself)
DRY and a few others.
5
Introduction
In Chapter 6 we wrap up our project and you get to see the entire query. By now
you should be able to understand it and know exactly how it was designed. I
recap the entire project so that you get another chance to review all the patterns.
The goal here is to allow you to see all the patterns together and give you ideas
on how to apply them in your day-to-day work.
In Chapter 7 we cover dbt Patterns. In this chapter we’re going to use all the
patterns we’ve seen to simplify our final query from the project using dbt. The
purpose of this chapter is to show how these patterns apply beyond just SQL.
With that out of the way, let’s dive into the database.
6
Chapter 1: Understanding the
Database
In this chapter we get into the details of the StackOverflow database we’re going
to be using throughout the book. You can refer back to it at any point you feel
you don’t understand the underlying tables.
Before we dive into writing queries you should make sure you have
the proper development environment set up. I have posted a detailed
guide on how to set things up with dbt and DuckDB on this Github repo:
github.com/ergest/sql_patterns. This way I can update them as needed without
having to update the book.
StackOverflow is a popular website where users post questions about any
technical topic such as programming languages, databases, etc. and other
users can post answers to these questions, vote on the answers or comment on
them.
Based on the quality of the answers, users gain reputation and badges. These
badges act as social proof on StackOverflow and potentially on other websites.
This database is made available for free online in BigQuery but it’s really large
so I’ve extracted one month of data and packaged it with the Github repo as
parquet files.
In the first edition of this book I used BigQuery directly but I found that people
had some issues with it. Plus if the free plan was ever revoked or the dataset
deleted, I wanted to ensure the queries in the book could still be run locally.
For our project we want to build a table that calculates reputation metrics for
7
Chapter 1: Understanding the Database
every user. This type of table is sometimes called a “feature table” and is very
common in data science and machine learning applications. It has one row per
entity (in our case a single user) and numerical attributes pertaining to that
entity.
This is the perfect project to illustrate many of the patterns covered in this book
because it’s a challenging task that requires multiple data transformation steps.
We will first see how to build it with a single query, then in Chapter 7 we build it
using dbt.
Let’s take a look at the schema. As you can see, we have our entity identifier (in
our case the user_id and user_name) and every other column represents some
type of score pertaining to that user:
| column_name | type |
|---------------------------|---------|
| user_id | INT64 |
| user_name | STRING |
| total_posts_created | NUMERIC |
| total_answers_created | NUMERIC |
| total_answers_edited | NUMERIC |
| total_questions_created | NUMERIC |
| total_upvotes | NUMERIC |
| total_comments_by_user | NUMERIC |
| total_questions_edited | NUMERIC |
| max_streak_in_days | NUMERIC |
| total_comments_on_post | NUMERIC |
| posts_per_day | NUMERIC |
| edits_per_day | NUMERIC |
| answers_per_day | NUMERIC |
| questions_per_day | NUMERIC |
| comments_by_user_per_day | NUMERIC |
| answers_per_post | NUMERIC |
| questions_per_post | NUMERIC |
| upvotes_per_post | NUMERIC |
| downvotes_per_post | NUMERIC |
| user_comments_per_post | NUMERIC |
| comments_on_post_per_post | NUMERIC |
Writing accurate and efficient SQL begins with understanding the underlying
data model. It often exists as an Entity-Relationship Diagram (ERD) that shows
8
Chapter 1: Understanding the Database
you how the tables connect with each other. The ERD is usually a graphical
representation though it may not always be available so more often than not
you’ll have to learn it as you go.
You can find the original StackOverflow data model online here but the one
included with this book is slightly different I’ll walk you through it step by step.
9
Chapter 1: Understanding the Database
table for all the post types whereas ours splits each one into a separate table
posts_questions and posts_answers. You can view them in our database using
the information_schema views in DuckDB like this:
--listing 1.1
SELECT table_name
FROM information_schema.tables
WHERE table_name like 'posts_%';
--sample output
table_name |
---------------+
posts_answers |
posts_questions|
Assuming you’ve set things up properly here’s the result of the query in DBeaver
(in text output mode) I’ll only use this format henceforth but your output might
be different in the GUI.
They both have the same schema which we can view using another
information_schema view:
10
Chapter 1: Understanding the Database
--listing 1.2
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'posts_answers';
--sample output
column_name |data_type|
------------------------+---------+
id |BIGINT |
title |VARCHAR |
body |VARCHAR |
accepted_answer_id |VARCHAR |
answer_count |VARCHAR |
comment_count |BIGINT |
community_owned_date |TIMESTAMP|
creation_date |TIMESTAMP|
favorite_count |VARCHAR |
last_activity_date |TIMESTAMP|
last_edit_date |TIMESTAMP|
last_editor_display_name|VARCHAR |
last_editor_user_id |BIGINT |
owner_display_name |VARCHAR |
owner_user_id |BIGINT |
parent_id |BIGINT |
post_type_id |BIGINT |
score |BIGINT |
tags |VARCHAR |
view_count |VARCHAR |
Both tables have an id column that identifies a single post; creation_date that
identifies the timestamp when the post was created and a few other attributes
like score for the upvotes and downvotes, view_count, tags, etc.
Note the parent_id column which signifies a hierarchical structure. The parent_id
is a one-to-many relationship modeled within the same table. It links up all the
answers to the corresponding question. A single question can have one or many
answers but an answer belongs to one and only one question. This is relation 1
in the Figure 1.1 above.
Both post types have a one-to-many relationship to the post_history which means
that one entry in the posts tables corresponds to one or many entries in the
post_history table. These are relations 3 and 4 in the diagram above. The post
history contains a log of all the activities that can be performed on a post such
as initial creation, any subsequent edits, comments, etc.
11
Chapter 1: Understanding the Database
--listing 1.3
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'post_history';
--sample output
column_name |data_type|
--------------------+---------+
id |BIGINT |
creation_date |TIMESTAMP|
post_id |BIGINT |
post_history_type_id|BIGINT |
revision_guid |VARCHAR |
user_id |BIGINT |
text |VARCHAR |
comment |VARCHAR |
A single post can have many types of activities identified by the post_history_type_id
column. This id shows the different types of activities a user can perform on the
site. We’re only concerned with the first 6. You can see the rest of them here if
you’re curious.
The first 3 indicate when a post is first submitted and the next 3 when a post is
edited. The post_history table also connects to the users table via the user_id in
a one-to-many relationship shown in Figure 1.1 as number 6. A single user can
perform multiple activities on a post.
In database lingo this is known as a bridge table because it connects two tables
(user and posts) that have a many-to-many relationship which cannot be
modeled otherwise.
The users table has one row per user and contains user attributes such as name,
reputation, etc. We’ll use some of these attributes in our final table.
12
Chapter 1: Understanding the Database
--listing 1.4
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'users';
--sample output
column_name |data_type|
-----------------+---------+
id |BIGINT |
text |VARCHAR |
creation_date |TIMESTAMP|
post_id |BIGINT |
user_id |BIGINT |
user_display_name|VARCHAR |
score |BIGINT |
Finally the votes table represents the upvotes and downvotes on a post. We’ll
need this to compute the total vote count on a user’s post which will show how
good the question or the answer is. This table has a granularity of one row per
13
Chapter 1: Understanding the Database
--sample output
column_name |data_type|
-------------+---------+
id |BIGINT |
creation_date|TIMESTAMP|
post_id |BIGINT |
vote_type_id |BIGINT |
14
Chapter 2: Core Concepts and
Patterns
In this chapter we’re going to cover some of the core concepts of querying data
and building tables for analysis and data science. We’ll start with the most
important but underrated concept in SQL; granularity.
Concept 1: Granularity
Granularity (also known as the grain of the table) is a measure of the level of
detail that determines an individual row in a table or view. This is extremely
important when it comes to joins or aggregating data.
A finely grained table means a high level of detail like one row per transaction
at the millisecond level. A coarse grained table means a low level of detail like
count of all transactions per day, week or month.
Granularity is usually expressed as the combination of columns (or column) that
makes up a unique row.
For example the users table has one row per user id level of detail specified by
the id column. This is also known as the primary key of the table. That is the
finest grain of it.
The post_history table, on the other hand, contains a log of all the activities a
user performs on a post on a given date and time. Therefore the finest granularity
is one row per user, per post, per timestamp.
15
Chapter 2: Core Concepts and Patterns
The comments table contains a log of all the comments on a post by a user on a
given date so its granularity is also one row per user, per post, per timestamp.
The votes table contains a log of all the upvotes and downvotes on a post on a
given date. It has separate rows for upvotes and downvotes so its granularity is
one row per post, per vote type, per timestamp.
To find a table’s granularity you either read the documentation, or if that doesn’t
exist, you make an educated guess and check. How do you check? It’s easy.
For example for post_history I assume (or guess) that I can find a unique row by
combining creation_date, post_id, post_history_type_id and user_id.
To check we can run the following query:
--listing 2.1
SELECT
creation_date,
post_id,
post_history_type_id AS type_id,
user_id,
COUNT(*) AS total
FROM
post_history
GROUP BY
1,2,3,4
HAVING
COUNT(*) > 1;
--sample output
creation_date |post_id |type_id|user_id|total|
-----------------------+--------+-------+-------+-----+
2021-12-10 14:09:36.950|70276799| 5| | 2|
If my hunch is correct, we have found our granularity and I should get 0 rows from
this query. But we don’t! We get one row. This means we have to be careful when
joining with this table on post_id, user_id, creation_date, post_history_type_id.
We have to deal with the duplicate issue first otherwise we’ll get incorrect
results.
Our final table will have a grain of one row per user. Only the users table has that
same granularity. In order to build it we’ll have to manipulate the granularity of
the source tables so that’s what we focus on next.
16
Chapter 2: Core Concepts and Patterns
Now that you have a grasp of the concept of granularity the next thing to learn is
how to manipulate it. What I mean by manipulation is specifically going from a
fine grain to a coarser grain.
For example an e-commerce website might store each transaction it performs as
a single row on a table with the millisecond timestamp when it occurred. This
gives us a very fine-grained table (i.e. a very high level of detail). But if we wanted
to know how much revenue we got on a given day, we have to reduce that level
of detail to a single row per day. That’s exactly what aggregation does.
17
Chapter 2: Core Concepts and Patterns
--listing 2.2.1
SELECT
ph.post_id,
ph.user_id,
ph.creation_date AS activity_date,
ph.post_history_type_id
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248;
--sample output
post_id |user_id|activity_date |post_history_type_id|
--------+-------+-----------------------+--------------------+
70182248|2230216|2021-12-01 10:03:18.350| 2|
70182248|2230216|2021-12-01 10:03:18.350| 1|
70182248|2230216|2021-12-01 10:03:18.350| 3|
70182248|2230216|2021-12-01 11:04:12.603| 5|
70182248|2230216|2021-12-01 12:59:48.113| 5|
70182248|2230216|2021-12-01 13:07:56.327| 5|
70182248|2702894|2021-12-01 13:35:41.293| 6|
70182248|2230216|2021-12-01 18:41:18.033| 5|
70182248|2230216|2021-12-01 18:41:18.033| 6|
70182248|2230216|2021-12-02 07:46:22.630| 4|
Notice that there are three rows for post_history_type_id values 1, 2 and 3
which all have the same timestamp 2021-12-01 10:03:18.350 and two rows for
post_history_type_id values 5 and 6 for timestamp 2021-12-01 18:41:18.033 If you
recall the type ids from Chapter 1 values 1, 2 and 3 represent initial body, initial
title and initial tags while values 5 and 6 represent editing the body and tags.
Since we don’t really care about the specifics, we can group those ids into a
single value and then aggregate the rows in order to collapse the granularity via
a CASE statement as shown below:
18
Chapter 2: Core Concepts and Patterns
--listing 2.2.2
SELECT
ph.post_id,
ph.user_id,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) AS total
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3,4;
--sample output
post_id |user_id|activity_date |activity_type|total|
--------+-------+-----------------------+-------------+-----+
70182248|2230216|2021-12-01 10:03:18.350|create | 3|
70182248|2230216|2021-12-01 11:04:12.603|edit | 1|
70182248|2230216|2021-12-01 12:59:48.113|edit | 1|
70182248|2230216|2021-12-01 13:07:56.327|edit | 1|
70182248|2702894|2021-12-01 13:35:41.293|edit | 1|
70182248|2230216|2021-12-01 18:41:18.033|edit | 2|
70182248|2230216|2021-12-02 07:46:22.630|edit | 1|
We have now effectively manipulated the granularity of the table by reducing the
overall number of rows but retaining most of the information. Notice however
that this action is both destructive in terms of information loss and irreversible.
What I mean is that if we were to store ONLY the above table in our database and
get rid of the detailed table, we’d lose information about the specific section of
the post that was edited or created. We’d no longer know that on 2021-12-01
18:41:18.033 it was only the body and tags that were edited but not the title.
That’s why it’s common practice in data warehouses to always store the finest
grain (aka highest level of detail available) and then aggregate information on
top of it. This way we can easily debug data issues when they arise.
19
Chapter 2: Core Concepts and Patterns
The timestamp column creation_date is a rich field with both the date and time
information (hour, minute, second, microsecond, millisecond). Timestamp fields
are unique when it comes to aggregation because they have many levels of
granularities built in.
Given a single timestamp, we can construct granularities for seconds, minutes,
hours, days, weeks, months, quarters, years, decades, etc. We do that by
using one of the many date manipulation functions like CAST(), DATE_TRUNC(),
DATE_PART(), etc.
For example if I wanted to remove the time information, I could collapse all
activities on a given date to a single row using DATE_TRUNC() like this:
--listing 2.3
SELECT
ph.post_id,
ph.user_id,
DATE_TRUNC('day', ph.creation_date) AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) AS total
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3,4;
--sample output
post_id |user_id|activity_date|activity_type|total|
--------+-------+-------------+-------------+-----+
70182248|2702894| 2021-12-01|edit | 1|
70182248|2230216| 2021-12-01|create | 3|
70182248|2230216| 2021-12-01|edit | 5|
70182248|2230216| 2021-12-02|edit | 1|
20
Chapter 2: Core Concepts and Patterns
This is another form of granularity manipulation where you change the shape
of aggregated data by “pivoting” rows into columns. In the above dataset we
tried to collapse the overall granularity of the table to a single day, but we got
edit occurring twice on 2021-12-01 could we reduce the granularity further?
That’s exactly what the code below does. By pivoting the rows into columns, we
can have multiple independent aggregations occurring on the same day show
up on the same row. We will use exactly this output for our final table putting
each metric we calculate on its own column. Again notice how the granularity
manipulation process is both destructive and irreversible.
This is the query will take the above output and turn it into:
--listing 2.4
SELECT
ph.post_id,
ph.user_id,
DATE_TRUNC('day', ph.creation_date) AS activity_date,
SUM(CASE WHEN ph.post_history_type_id IN (1,2,3)
THEN 1 ELSE 0 END) AS total_created,
SUM(CASE WHEN ph.post_history_type_id IN (4,5,6)
THEN 1 ELSE 0 END) AS total_edited
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3;
--sample output
post_id |user_id|activity_date|total_created|total_edited|
--------+-------+-------------+-------------+------------+
70182248|2230216| 2021-12-02| 0| 1|
70182248|2702894| 2021-12-01| 0| 1|
70182248|2230216| 2021-12-01| 3| 5|
21
Chapter 2: Core Concepts and Patterns
Granularity multiplication will happen if the tables you’re joining have different
levels of detail for the columns being joined on. This will cause the resulting
number of rows to multiply.
Joining tables is one of the most basic functions in SQL. Databases are designed
to minimize redundancy of information by a process known as normalization. A
normalized database splits information into many separate tables but provides
ways to join them together and re-assemble that information.
Let’s look at an example. The users table has a grain of one row per user:
--listing 2.5
SELECT
id,
display_name,
creation_date,
reputation
FROM users
WHERE id = 2702894;
--sample output
id |display_name |creation_date |reputation|
-------+--------------+-----------------------+----------+
2702894|Graham Ritchie|2013-08-21 09:07:23.133| 20218|
Whereas the post_history table has multiple rows for the same user:
22
Chapter 2: Core Concepts and Patterns
--listing 2.6
SELECT
id,
creation_date,
post_id,
post_history_type_id AS type_id,
user_id
FROM
post_history ph
WHERE
TRUE
AND ph.user_id = 2702894
LIMIT 10;
--sample output
id |creation_date |post_id |type_id|user_id|
---------+-----------------------+--------+-------+-------+
260173419|2021-12-16 10:54:11.637|70377756| 2|2702894|
260541172|2021-12-22 07:51:17.123|70445771| 2|2702894|
260044378|2021-12-14 16:28:26.013|70352124| 6|2702894|
260548889|2021-12-22 10:04:40.227|70446634| 6|2702894|
259143984|2021-12-01 13:34:28.483|70185165| 2|2702894|
259145213|2021-12-01 13:50:18.883|70185401| 2|2702894|
259211259|2021-12-02 10:38:18.150|70197917| 2|2702894|
259212754|2021-12-02 10:59:39.880|70198204| 2|2702894|
259457154|2021-12-06 07:56:54.167|70242375| 2|2702894|
If we join them on user_id the granularity of the final result will be multiplied to
have as many rows per user:
23
Chapter 2: Core Concepts and Patterns
--listing 2.7
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
post_history_type_id AS type_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.user_id = 2702894;
--sample output
post_id |user_id|user_name |activity_date |type_id|
--------+-------+--------------+-----------------------+-------+
70377756|2702894|Graham Ritchie|2021-12-16 10:54:11.637| 2|
70445771|2702894|Graham Ritchie|2021-12-22 07:51:17.123| 2|
70352124|2702894|Graham Ritchie|2021-12-14 16:28:26.013| 6|
70446634|2702894|Graham Ritchie|2021-12-22 10:04:40.227| 6|
70185165|2702894|Graham Ritchie|2021-12-01 13:34:28.483| 2|
70185401|2702894|Graham Ritchie|2021-12-01 13:50:18.883| 2|
70197917|2702894|Graham Ritchie|2021-12-02 10:38:18.150| 2|
70198204|2702894|Graham Ritchie|2021-12-02 10:59:39.880| 2|
70242375|2702894|Graham Ritchie|2021-12-06 07:56:54.167| 2|
Notice how the user_name repeats for each row. So if the history table has 10
entries for the same user and the users table has 1, the final result will contain 10
x 1 entries for the same user. If for some reason the users contained 2 entries for
the same user (messy real world data), we’d see 10 x 2 = 20 entries for that user
in the final result and each row would repeat twice.
Did you know that SQL will ignore a LEFT JOIN clause and perform an INNER JOIN
instead if you make this one simple mistake? This is one of those SQL hidden
secrets which sometimes gets asked as a question in interviews.
Let’s take a look at the example query from above:
24
Chapter 2: Core Concepts and Patterns
--listing 2.8
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_id = 70286266
ORDER BY
activity_date;
--sample output
post_id |user_id |user_name |activity_date |
--------+--------+-----------------+-----------------------+
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|12221382|Aldin Bradaric |2021-12-09 14:06:00.677|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
You’ll see 6 rows. Now let’s change the INNER JOIN to a LEFT JOIN and rerun the
query:
25
Chapter 2: Core Concepts and Patterns
--listing 2.9
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_id = 70286266
ORDER BY
activity_date;
--sample output
post_id |user_id |user_name |activity_date |
--------+--------+-----------------+-----------------------+
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|12221382|Aldin Bradaric |2021-12-09 14:06:00.677|
70286266|NULL |NULL |2021-12-09 14:06:00.677|
70286266|NULL |NULL |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
26
Chapter 2: Core Concepts and Patterns
--listing 2.10
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND u.reputation >= 500000;
--sample output
count_star()|
------------+
7596|
We get 7,596 rows. Fine you might say, that looks right. But it’s not! Adding filters
on the WHERE clause for tables that are left joined will ALWAYS perform an INNER
JOIN.
If we wanted to filter rows in the users table and still do a LEFT JOIN we have to
add the filter in the join condition like so:
--listing 2.11
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
AND u.reputation >= 500000
WHERE
TRUE;
--sample output
count_star()|
------------+
806608|
27
Chapter 2: Core Concepts and Patterns
--listing 2.12
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND u.id IS NULL;
--sample output
count_star()|
------------+
15704|
Granularity addition will happen when you want to append the results of two or
more queries. Appending only occurs at the row level while the columns remain
the same.
There are two ways you can append the results of multiple queries UNION ALL
and UNION. UNION ALL will append query results without checking if they have
the same exact row.
This might cause duplicates but it’s really fast. If you know for sure your results
don’t contain any rows in common this is the preferred way to append them.
Two result sets contain no rows in common if their intersection is empty, so if
you were to join them, you’d get no results.
UNION (distinct) will append query results but remove all duplicates from the
final output thus ensuring unique rows. It is much slower than UNION because
of the extra operations to find and remove duplicates. Use this only when you’re
sure the results contain rows in common and you HAVE to remove the duplicates
from the final output.
28
Chapter 2: Core Concepts and Patterns
1. The number of the columns from all tables has to be the same
2. The data types of the columns from all the tables has to line up
You can achieve the first requirement by using SELECT to choose only the columns
that match across multiple tables or if you know the tables have the same exact
schema. Note that when you union tables with different schemas, you have to
line up all the columns in the right order. This is useful when two tables have the
same column named differently.
For example:
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
As a rule of thumb, when you append tables, it’s a good idea to add a constant
column to indicate the source table or some kind of type. This is helpful when
appending say activity tables to create a long, time-series table and you want to
identify each activity type in the final result set.
Note that when appending results, the column names will be those of the first
table and all the names of the subsequent columns will be ignored.
With that out of the way lets get into the patterns.
29
Chapter 3: Modularity Patterns
In this chapter we’ll learn some key concepts that make SQL code easier to read,
understand and maintain. We first talk about the concept of modularity and then
explore some of patterns related to it like SRP, DRY and a few others.
Concept 1: Modularity
Every complex system is made up of simple, self contained elements that can be
designed, developed and tested independently. And that means you can take
very complex queries and systematically break them down into much simpler
elements.
Just about every modern system is modular. Your smartphone might seem like
a single piece of hardware but in reality all its components (the screen, CPU,
memory, battery, speaker, GPU, accelerometer, GPS chip, etc.) were designed
independently and then assembled into a singular device.
Definition:
A module is a unit whose elements are tightly connected to themselves but
weakly connected to other units.
• When the modules are simple and self-contained the code is infinitely more
readable, easy to understand, easy to debug and fix, easy to extend and
scale.
30
Chapter 3: Modularity Patterns
• When the modules are carefully thought out, logical and with clean
interfaces the code becomes much easier to write. Once written, all you
have to do is assemble them like “LEGO” bricks instead of writing the
entire long query from scratch.
• When a system is designed with modularity in mind, the modules can be
developed by other parties in parallel so they can be assembled later. It
also makes it easy to improve functionality later on by swapping out old
modules for new ones as long as the interface is the same.
In this chapter we’ll only cover the first two methods. The third method is more
advanced so we’ll cover it in its own chapter.
CTEs or Common Table Expressions are temporary views whose scope is limited
to the current query. They are not stored in the database; they only exist in
memory while the query is running and are only accessible inside that query.
They act like subqueries but are easier to understand and use.
CTEs allow you to break down complex queries into simpler, smaller self-
contained modules. By connecting them together we can solve any complex
query.
31
Chapter 3: Modularity Patterns
When you use CTEs you can read a query from top to bottom and easily
understand what’s going on. When you use sub-queries you have to find the
innermost subquery and work your way outwards while keeping track of
everything in your head. That’s much harder to do so your code becomes really
hard to read, understand and maintain.
Side Note: Even though CTEs have been part of the definition of the SQL
standard since 1999, it has taken many years for database vendors to
implement them. Some versions of older databases (like MySQL before 8.0,
PostgreSQL before 8.4, SQL Server before 2005) do not have support for
them. All the modern cloud warehouse vendors support them.
One of the best ways to visualize CTEs is to think of them as a DAG (aka Directed
Acyclical Graph) where each node handles a single processing step. Here are
some examples of how CTEs could be chained to solve a complex query.
In this example each CTE uses the results of the previous CTE to build upon its
result set and take it further.
32
Chapter 3: Modularity Patterns
-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2 by referring to CTE 1
cte2_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 3 by referring to CTE 2
cte3_name AS (
SELECT col1
FROM cte2_name
),
-- Define CTE 4 by referring to CTE 3
cte4_name AS (
SELECT col1
FROM cte3_name
)
-- Main query
SELECT *
FROM cte4_name
In this example, CTE 3 depends on CTE 1 and CTE 2 which are independent of
each other and CTE 4 depends on CTE 3.
33
Chapter 3: Modularity Patterns
-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2
cte2_name AS (
SELECT col1
FROM table2_name
),
-- Define CTE 3 by referring to CTE 1 and 2
cte3_name AS (
SELECT *
FROM cte1_name AS cte1
JOIN cte2_name AS cte2
ON cte1.col1 = cte2.col1
),
-- Define CTE 4 by referring to CTE 3
cte4_name AS (
SELECT col1
FROM cte3_name
)
-- Main query
SELECT *
FROM cte4_name
34
Chapter 3: Modularity Patterns
-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2 by referring to CTE 1
cte2_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 3 by referring to CTE 1
cte3_name AS (
SELECT col1
FROM cte1_name
)
-- Define CTE 4 by referring to CTE 1
cte4_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 5 by referring to CTE 4
cte5_name AS (
SELECT col1
FROM cte4_name
),
-- Define CTE 6 by referring to CTEs 2, 3 and 5
cte6_name AS (
SELECT *
FROM cte2_name cte2
JOIN cte3_name cte3 ON cte2.column1 = cte3.column1
JOIN cte5_name cte5 ON cte3.column1 = cte5.column1
)
-- Main query
SELECT *
FROM cte6_name
As you can see, there’s an endless way in which you can chain or stack CTEs to
solve complex queries. Now that you’ve seen the basics of what CTEs are, let’s
apply them to our project.
Getting our user data from the current form to the final form of one row per user
is not something that can be done in a single step.
Well you probably could hack something together that works but that will not be
very easy to maintain. It’s a complex query. So In order to solve it, we need to
decompose (break down) our complex query into smaller, easier to write pieces.
Here’s how to think about it:
35
Chapter 3: Modularity Patterns
We know that a user can perform any of the following activities on any given
date:
1. Post a question
2. Post an answer
3. Edit a question
4. Edit an answer
5. Comment on a post
6. Receive a comment on their post
7. Receive a vote (upvote or downvote) on their post
We have separate tables for these activities, so our first step is to aggregate the
data from each of the tables to the user_id and activity_date granularity and put
each one on its own CTE. We can break this down into several sub-problems and
map out a solution like this:
Sub-problem 1
Calculate user metrics for post types and post activity types.
To get there we first have to manipulate the granularity of the post_history table
so we have one row per user_id per post_id per activity_type per activity_date.
That would look like this:
36
Chapter 3: Modularity Patterns
--listing 3.1
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity
WHERE user_id = 4603670
ORDER BY activity_date
LIMIT 10;
--sample output:
post_id |user_id|user_name |activity_date |activity_type|
--------+-------+----------------+-----------------------+-------------+
70192540|4603670|Barmak Shemirani|2021-12-01 23:30:38.057|created |
70192540|4603670|Barmak Shemirani|2021-12-01 23:35:42.157|edited |
70193076|4603670|Barmak Shemirani|2021-12-02 01:06:08.973|edited |
70192540|4603670|Barmak Shemirani|2021-12-02 01:56:02.137|edited |
70199876|4603670|Barmak Shemirani|2021-12-02 12:54:40.230|created |
70199876|4603670|Barmak Shemirani|2021-12-02 13:21:05.200|edited |
70199876|4603670|Barmak Shemirani|2021-12-02 14:14:56.210|edited |
70208753|4603670|Barmak Shemirani|2021-12-03 02:18:58.930|created |
70208753|4603670|Barmak Shemirani|2021-12-03 02:40:51.667|edited |
70212702|4603670|Barmak Shemirani|2021-12-03 11:40:09.240|edited |
We then join this with the posts_questions and post_answers on post_id. That
would look like this:
37
Chapter 3: Modularity Patterns
--listing 3.2
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY ALL
),
post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
--continued below
38
Chapter 3: Modularity Patterns
SELECT
pa.user_id,
CAST(pa.activity_date AS DATE) AS activity_date,
pa.activity_type,
pt.post_type
FROM
post_activity pa
JOIN post_types pt ON pa.post_id = pt.post_id
WHERE user_id = 4603670
LIMIT 10;
--sample output:
user_id|activity_date|activity_type|post_type|
-------+-------------+-------------+---------+
4603670| 2021-12-01|edit |answer |
4603670| 2021-12-01|create |answer |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-02|create |answer |
4603670| 2021-12-02|edit |question |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-03|edit |answer |
4603670| 2021-12-03|create |answer |
4603670| 2021-12-03|edit |question |
What we really want is to pivot data from rows into columns using Pattern 3 from
Chapter 2:
39
Chapter 3: Modularity Patterns
--listing 3.3
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
),
post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
--continued below
40
Chapter 3: Modularity Patterns
SELECT
user_id,
CAST(pa.activity_date AS DATE) AS activity_dt,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_create,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_create,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_edit,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_edit
FROM post_activity pa
JOIN post_types pt ON pt.post_id = pa.post_id
WHERE user_id = 4603670
GROUP BY 1,2
LIMIT 10;
--sample output
user_id|activity_dt|question_create|answer_create|question_edit|answer_edit|
-------+-----------+---------------+-------------+-------------+-----------+
4603670| 2021-12-01| 0| 1| 0| 1|
4603670| 2021-12-02| 0| 1| 1| 3|
4603670| 2021-12-03| 0| 3| 1| 5|
4603670| 2021-12-04| 0| 2| 0| 6|
4603670| 2021-12-05| 0| 2| 0| 3|
4603670| 2021-12-06| 0| 3| 2| 9|
4603670| 2021-12-07| 0| 2| 3| 2|
4603670| 2021-12-08| 0| 2| 2| 6|
4603670| 2021-12-09| 0| 0| 1| 0|
4603670| 2021-12-10| 0| 1| 1| 1|
Sub-problem 2
41
Chapter 3: Modularity Patterns
--listing 3.4
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, comments_on_user_post AS (
SELECT
pa.user_id,
CAST(c.creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments c
INNER JOIN post_activity pa ON pa.post_id = c.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
--continued below
42
Chapter 3: Modularity Patterns
, comments_by_user AS (
SELECT
user_id,
CAST(creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments
GROUP BY
1,2
)
SELECT
c1.user_id,
c1.activity_date,
c1.total_comments AS comments_by_user,
c2.total_comments AS comments_on_user_post
FROM comments_by_user c1
LEFT OUTER JOIN comments_on_user_post c2
ON c1.user_id = c2.user_id
AND c1.activity_date = c2.activity_date
WHERE
c1.user_id = 4603670
LIMIT 10;
--sample output
user_id|activity_date|comments_by_user|comments_on_user_post|
-------+-------------+----------------+---------------------+
4603670| 2021-12-03| 3| 7|
4603670| 2021-12-05| 7| 1|
4603670| 2021-12-06| 9| 6|
4603670| 2021-12-08| 6| 7|
4603670| 2021-12-10| 4| 2|
4603670| 2021-12-11| 3| 6|
4603670| 2021-12-12| 2| 4|
4603670| 2021-12-13| 1| 1|
4603670| 2021-12-26| 1| 3|
4603670| 2021-12-24| 3| 2|
Sub-problem 3
43
Chapter 3: Modularity Patterns
--listing 3.5
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, votes_on_user_post AS (
SELECT
pa.user_id,
CAST(v.creation_date AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
--continued below
44
Chapter 3: Modularity Patterns
SELECT
v.user_id,
v.activity_date,
v.total_upvotes,
v.total_downvotes
FROM
votes_on_user_post v
WHERE
v.user_id = 4603670
LIMIT 10;
--sample output:
user_id|activity_date|total_upvotes|total_downvotes|
-------+-------------+-------------+---------------+
4603670| 2021-12-02| 0| 1|
4603670| 2021-12-03| 3| 0|
4603670| 2021-12-05| 2| 0|
4603670| 2021-12-06| 5| 0|
4603670| 2021-12-07| 2| 0|
4603670| 2021-12-08| 2| 0|
4603670| 2021-12-09| 1| 0|
4603670| 2021-12-10| 0| 0|
4603670| 2021-12-11| 2| 0|
4603670| 2021-12-12| 1| 0|
By now you should start to see very clearly how the final result is constructed.
All we have to do is take the 3 results from the sub-problems and join them
together on user_id and activity_date This will allow us to have a single table
with a granularity of one row per user and all the metrics aggregated on the day
level like this:
45
Chapter 3: Modularity Patterns
When you find yourself copying and pasting CTEs across multiple queries it’s
time to turn them into views or UDFs. Views are database objects that can be
queried with SQL just like a table.
The difference between the two is that views typically don’t contain any data.
They store a query that gets executed every time the view is queried (just like a
CTE).
I say “typically” because there are certain types of views that do contain data
(known as materialized views but we won’t cover them here).
Creating a view is easy:
46
Chapter 3: Modularity Patterns
This view is now stored in the database but it doesn’t take up any space (unless
it’s materialized). It only stores the query which is executed each time you select
from the view or join the view in a query.
Views can be put inside of CTEs or can themselves contain CTEs, thus creating
multiple layers of modularity. Here’s an example of what that would look like.
Side Note: By combining views and CTEs, you’re nesting many queries
within others. Not only does this negatively impact performance but some
databases have limits to how many levels of nesting you can have.
47
Chapter 3: Modularity Patterns
and pasting the same CTE in multiple places, you can turn it into a view and store
it in the database. What could be made into a view in our specific query?
I think the post_types CTE would be a good candidate. That way whenever you
have to combine all the post types you don’t have to use that CTE everywhere.
--listing 3.7
CREATE OR REPLACE VIEW v_post_types AS
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers;
Similar to views you can also put commonly used logic into UDFs (user-defined
functions) Pretty much all databases allow you to create UDFs but they each
use different programming languages to do so. Different database systems
use different programming languages to allow for UDF creation. DuckDB offers
Python for such functionality. You can read about it here
Functions allow for a lot more flexibility in data processing. While tables and
views use set based logic (set algebra) for operating on data, functions allow
you to work on a single row at a time, use conditional flow of logic (if-then-else),
variables and loops which makes it easy to implement complex logic.
They can return a single scalar value or a table. A single scalar value can be
used for example to parse JSON formatted strings via regular expressions. Table
valued functions return a table instead of a single value.
They behave exactly like views but the main difference is that they can take
input parameters and return different result sets based on that. This can be very
useful.
48
Chapter 3: Modularity Patterns
The SRP principle dictates that your modules should be small, self-contained
and have a single responsibility or purpose. For example you don’t expect the
GPS chip on your phone to also handle WiFi connectivity. The main benefit
of SRP is that it makes modules more composable and facilitates code reuse.
By organizing your code into well thought out “LEGO” blocks, writing complex
queries becomes infinitely easier. dbt makes SRP infinitely better as we’ll see in
a later chapter.
When you’re designing a query and breaking it up into CTEs, there is one principle
to keep in mind. Whenever possible, construct CTEs to ensure that they can be
reused in the query later.
Let’s take a look at the example from earlier:
--listing 3.8
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity;
49
Chapter 3: Modularity Patterns
and to join with comments and votes to user level data via the post_id
50
Chapter 3: Modularity Patterns
This is at the heart of a well-designed CTE. Notice here that we’re being very
careful about granularity multiplication! If we simply joined with post_activity
on post_id without specifying the activity_type we’d get duplication. By filtering
to inlude only created posts, since a post can only be created once, we’re pretty
safe in getting a single row per post.
The DRY principle dictates that a piece of code encapsulating some functionality
must appear only once in a codebase. So ff you find yourself copying and pasting
the same chunk of code everywhere your code is not DRY. The main benefit of
51
Chapter 3: Modularity Patterns
DRY code is maintainability. If you need to change your logic later, and there’s
a lot of repetition, you have to change all the places where the code repeats
instead of a single place.
In the previous section we saw how we can decompose a large complex query
into multiple smaller components. The main benefit for doing this is that it makes
the queries more readable. In that same vein, the DRY (Don’t Repeat Yourself)
principle ensures that your query is clean from unnecessary repetition.
The DRY principle states that if you find yourself copy-pasting the same chunk of
code in multiple locations, you should put that code in a CTE and reference that
CTE where it’s needed.
To illustrate let’s rewrite the query from the previous chapter so that it still
produces the same result but it clearly shows repeating code
52
Chapter 3: Modularity Patterns
--listing 3.11
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u on u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, questions AS (
SELECT
id AS post_id,
'question' AS post_type,
pa.user_id,
pa.user_name,
pa.activity_date,
pa.activity_type
FROM
posts_questions q
INNER JOIN post_activity pa ON q.id = pa.post_id
)
, answers AS (
SELECT
id AS post_id,
'answer' AS post_type,
pa.user_id,
pa.user_name,
pa.activity_date,
pa.activity_type
FROM
posts_answers q
INNER JOIN post_activity pa ON q.id = pa.post_id
)
--continued below
53
Chapter 3: Modularity Patterns
SELECT
user_id,
CAST(activity_date AS DATE) AS activity_dt,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_create,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_create,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_edit,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_edit
FROM
(SELECT * FROM questions
UNION ALL
SELECT * FROM answers) AS p
WHERE
user_id = 4603670
GROUP BY 1,2
LIMIT 10;
--sample output
user_id|activity_dt|question_create|answer_create|question_edit|answer_edit|
-------+-----------+---------------+-------------+-------------+-----------+
4603670| 2021-12-01| 0| 1| 0| 1|
4603670| 2021-12-02| 0| 1| 1| 3|
4603670| 2021-12-03| 0| 3| 1| 5|
4603670| 2021-12-04| 0| 2| 0| 6|
4603670| 2021-12-05| 0| 2| 0| 3|
4603670| 2021-12-06| 0| 3| 2| 9|
4603670| 2021-12-07| 0| 2| 3| 2|
4603670| 2021-12-08| 0| 2| 2| 6|
4603670| 2021-12-09| 0| 0| 1| 0|
4603670| 2021-12-10| 0| 1| 1| 1|
This query will get you the same results as table 3.3 you saw earlier but notice
that the questions and answers CTEs both have almost identical code. What if
we had 10 different post types? You’d be copying and pasting a lot of code thus
repeating yourself. Also, the subquery that handles the UNION is not ideal.
When you find yourself implementing very specific logic in a model that might
be used elsewhere, move that logic upstream closer to the source of data. In
the world of DAGs, upstream has a very precise meaning. It means to move
54
Chapter 3: Modularity Patterns
potentially common logic onto earlier nodes in the graph because you never
know which downstream models might use it.
(Models here represent dbt models which will be covered in a separate chapter)
Figure 3.5 - SQL DAG
With that out of the way let’s now look at some performance patterns.
55
Chapter 4: Performance Patterns
In this chapter we’re going to talk about query performance, aka how to make
your queries run faster. Why do we care about making queries run faster? Faster
queries get you results faster, obviously, but they also consume fewer resources,
making them cheaper on modern data warehouses.
This chapter isn’t just about speed however. There are many clever hacks to make
your queries run really fast, but many of them will make your code unreadable
and unmaintainable. We want to strike a balance between performance and
maintainability.
So far we’ve learned that using modularity via CTEs and views is the best way to
tackle complex queries. We also learned to keep our modules small and single
56
Chapter 4: Performance Patterns
--sample output
post_id |user_id |user_name |activity_date |activity_type|
--------+--------+----------------+-----------------------+-------------+
70401248|13437718|BGE34 |2021-12-18 05:50:33.917|edit |
70380038|17501206|vtable |2021-12-16 21:47:01.913|edit |
70387919|17697814|user17697814 |2021-12-17 02:55:13.043|create |
70364800|17436438|user17436438 |2021-12-15 13:48:18.577|create |
70382506|12327190|TalGav |2021-12-16 16:31:44.240|create |
70401589| 5708566|windowsill |2021-12-18 07:05:07.927|create |
70401645| 8331542|Saad Abdul Majid|2021-12-18 07:17:10.987|create |
70418579| 4925718|msefer |2021-12-20 07:25:11.413|create |
70362252| 4925718|msefer |2021-12-15 13:35:49.967|edit |
70362983| 4925718|msefer |2021-12-20 07:13:06.500|edit |
This is a correct way to filter the results and it may even be performant in our
57
Chapter 4: Performance Patterns
small database using the blazingly fast DuckDB engine but it’s better if we can
filter data inside the CTE vs outside. Sometimes that’s by design, for example we
might want a rolling window of just the current week’s post activity.
--listing 4.2
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
AND activity_date BETWEEN '2021-12-14' AND '2021-12-21'
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity
LIMIT 10;
--sample output
post_id |user_id |user_name |activity_date |activity_type|
--------+--------+----------------+-----------------------+-------------+
70401248|13437718|BGE34 |2021-12-18 05:50:33.917|edit |
70380038|17501206|vtable |2021-12-16 21:47:01.913|edit |
70387919|17697814|user17697814 |2021-12-17 02:55:13.043|create |
70364800|17436438|user17436438 |2021-12-15 13:48:18.577|create |
70382506|12327190|TalGav |2021-12-16 16:31:44.240|create |
70401589| 5708566|windowsill |2021-12-18 07:05:07.927|create |
70401645| 8331542|Saad Abdul Majid|2021-12-18 07:17:10.987|create |
70418579| 4925718|msefer |2021-12-20 07:25:11.413|create |
70362252| 4925718|msefer |2021-12-15 13:35:49.967|edit |
70362983| 4925718|msefer |2021-12-20 07:13:06.500|edit |
Moving the WHERE clause filter inside the CTE is an example of filtering data as
early as possible. We might use that CTE several times and it will make our query
more performant if we do.
58
Chapter 4: Performance Patterns
Almost every SQL book or course will tell you to start exploring a table by doing:
--listing 4.3
SELECT *
FROM posts_questions
LIMIT 10;
This may be ok in a traditional RDBMS, but with modern data warehouses things
are different. Because they store data in columns vs rows _SELECT *_ will scan the
entire table and your query will be slower even if we’re limiting it to 10 rows.
Here’s an example you’ve seen before. In the post_activity CTE we select only
the id column which is the only one we need to join with post_activity on. The
post_type is a static value which is negligible when it comes to performance.
--code snippet will not run
--listing 4.4
,post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
59
Chapter 4: Performance Patterns
Compared to:
--code snippet will not run
--listing 4.5
,post_types AS (
SELECT
pq.*,
id AS post_id,
'question' AS post_type,
FROM
posts_questions pq
UNION ALL
SELECT
pa.*,
id AS post_id,
'answer' AS post_type,
FROM
posts_answers pa
)
It may seem innocent at first, but if any of those tables contained 300 columns,
now you’ll be selecting all 300 of them every time you join on those CTEs. You
don’t have to know anything about databases to know that the query will be
much slower than if you selected a subset of columns.
As a rule of thumb you should AVOID any kind of sorting inside production level
queries. Sorting is a very expensive operation, especially for really large tables
and it will dramatically slow down your queries.
What’s worse, if you add an ORDER BY operation in your CTEs or views, anytime
you join with that CTE or view, the database engine will be forced to sort data
every time before joining. That will make your queries crawl!
Sorting is best left to reporting and BI tools if it’s not needed, or done at the very
end, if it is at all necessary. You can’t always avoid it though. Window functions
for example necessitate sorting in order to choose the top row. We’ll see an
example of this later.
For example, the following is unnecessary and slows down performance because
60
Chapter 4: Performance Patterns
the sorting is done is inside a CTE. You don’t need to sort your data yet.
--code snippet will not run
--listing 4.6
, votes_on_user_post AS (
SELECT
pa.user_id,
CAST(DATE_TRUNC(v.creation_date, DAY) AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
AND v.creation_date BETWEEN '2021-12-14' AND '2021-12-21'
GROUP BY
1,2
ORDER BY
v.creation_date
)
SELECT DISTINCT is a code smell for me. Whenever I see it, I suspect the
programmer is trying to hide data problems without fixing them. It’s so common
as a catchall fix that this meme exploded both on Twitter/X and LinkedIn when I
posted it.
61
Chapter 4: Performance Patterns
62
Chapter 4: Performance Patterns
Now imagine if DISTINCT is coded inside of a view and that view gets used multiple
times downstream. Those operations will be performed every time you join on
that view.
If you must use it, make sure you materialize the query into a table that gets
refreshed regularly with a tool like dbt. That way your results are clean and the
DISTINCT operation is performed once.
The most insidious application of DISTINCT I have personally dealt with is when
combining multiple tables via the UNION operator. As discussed in Chapter 2
Pattern 4, the UNION operator will append data and ensure uniqueness of the
results.
In this case I had inadvertently used UNION instead of UNION ALL and when I
fixed it, query execution went from 15 minutes down to 1 minute while the result
was identical.
Here’s my original query:
63
Chapter 4: Performance Patterns
--original code
with cte_union_source_data as (
select
column1,
column2,
count(*) as total
from source_table1
group by 1, 2
union
select
column1,
column2,
count(*) as total
from source_table2
group by 1, 2
union
select
column1,
column2,
count(*) as total
from source_table3
group by 1, 2
)
select
column1,
column2,
sum(total) as total
from
cte_union_source_data;
It’s pretty straightforward. I was aggregating the results from multiple tables
inside a CTE then summing everything up. By using UNION I was guaranteeing
uniqueness of the results before the final aggregation. This query was taking 15
minutes.
Once I realized my mistake, I changed it to this:
64
Chapter 4: Performance Patterns
-- refactored code
with cte_union_source_data as (
select
column1,
column2
from
source_tablel
union all
select
column1,
column2
from
source_table2
union all
select
column1,
column2
from
source_table3
)
select
column1,
column2,
count(*) as total
from
cte_union_source_data;
Now I’m simply appending all the results – including any duplicates – and then
aggregating them. Apart from being 15x faster, because we’re only doing one
aggregation and avoiding , this query is simpler and more compact.
Here’s an example with our database. Suppose I’m trying to get the total user
activity (i.e. posts created, edited and commented on) My original query looked
like this.
65
Chapter 4: Performance Patterns
--listing 4.15
WITH cte_user_activity_by_type AS (
SELECT
user_id,
CASE WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) as total_activity
FROM
post_history
GROUP BY
1,2
UNION
SELECT
user_id,
'commented' AS activity_type,
COUNT(*) as total_activity
FROM
comments
GROUP BY
1,2
)
SELECT
user_id,
sum(total_activity) as total_activity
FROM
cte_user_activity_by_type
GROUP BY 1
LIMIT 10;
--sample output
user_id |total_activity|
--------+--------------+
3690518| 2|
3439894| 37|
5454021| 4|
14391494| 10|
7069126| 9|
433351| 4|
2186184| 6|
12579274| 11|
15821771| 22|
752843| 16|
Notice how I’m using two CTEs for aggregation and how I append them using
UNION vs UNION ALL. While the final result is correct because I sum the total
activity, the aggregation inside the CTEs is unnecessary.
We could rewrite the query using UNION ALL while simultaneously avoiding
expensive aggregation like this:
66
Chapter 4: Performance Patterns
--listing 4.16
WITH cte_user_activity_by_type AS (
SELECT
user_id,
CASE WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history
UNION ALL
SELECT
user_id,
'comment' AS activity_type
FROM
comments
)
SELECT
user_id,
COUNT(*) as total_activity
FROM
cte_user_activity_by_type
LIMIT 10;
In case you didn’t know, you can put anything in the WHERE clause. You already
know about filtering on dates, numbers and strings of course but you can also
filter by calculations, functions, CASE statements, etc. WHERE clauses can get
quite complicated.
When you use compare a column to a fixed value or to another column, the query
optimizer can filter down to the relevant rows much faster. When you use a
function or a complicated formula, the optimizer needs to scan the entire table
and perform that calculation before doing the filtering.
This is negligible for small tables but when dealing with millions of rows query
performance will suffer. Let’s see some examples. The tags column in both
67
Chapter 4: Performance Patterns
--sample output
post_id |creation_date |tags |
--------+-----------------------+-------------------------------------+
70177589|2021-12-01 00:02:03.777|blockchain|nearprotocol|near|nearcore|
70177596|2021-12-01 00:02:52.657|google-oauth|google-workspace |
70177598|2021-12-01 00:03:16.373|python|graph|networkx |
70177601|2021-12-01 00:03:32.413|elasticsearch |
70177623|2021-12-01 00:06:16.950|python|tkinter |
70177624|2021-12-01 00:06:19.537|c# |
70177627|2021-12-01 00:07:50.607|flutter |
70177629|2021-12-01 00:08:02.943|python|python-3.x|pexpect |
70177630|2021-12-01 00:08:16.173|sql|sql-server|tsql |
70177633|2021-12-01 00:08:46.233|sql|sql-server|tsql |
The tags pertain to the list of topics or subjects that a post is about. One of the
tricky things about storing tags like this is that you don’t have to worry about
the order in which they appear. There’s no categorization system here. A tag can
appear anywhere in the string.
Suppose we’re looking for posts mentioning SQL. How would we do it? I’m pretty
sure you’re familiar with pattern matching strings in SQL using the keyword
LIKE
But since we don’t know if the string is capitalized (i.e. it could be SQL, sql, Sql,
etc) and we want to match all of them, it’s common to use a function like LOWER()
to force the case before matching the pattern.
Here’s an example of what NOT to do (unless you’re doing ad-hoc querying)
68
Chapter 4: Performance Patterns
--listing 4.8
SELECT
q.id AS post_id,
q.creation_date,
q.tags
FROM
posts_questions q
WHERE
TRUE
AND lower(tags) like '%sql%'
LIMIT 10;
Here’s how to get the same result without using functions in WHERE
--listing 4.9
SELECT
q.id AS post_id,
q.creation_date,
q.tags
FROM
posts_questions q
WHERE
TRUE
AND tags ilike '%sql%'
LIMIT 10;
In our small database this query will be quite fast, however by using the function
LOWER() in the WHERE clause, you’re causing the database engine to scan the
entire table, perform the lowercase operation and then perform the filtering. By
using the keyword ILIKE (which makes the pattern match is case-insensitive) we
avoids using LOWER()
Alternatively you can perform the LOWER() operator beforehand in a CTE or view
like this:
69
Chapter 4: Performance Patterns
--listing 4.10
WITH cte_lowercase_tags AS (
SELECT
q.id AS post_id,
q.creation_date,
LOWER(q.tags) as tags
FROM
posts_questions q
)
SELECT *
FROM cte_lowercase_tags
WHERE tags LIKE '%sql%'
LIMIT 10;
--sample output
post_id |creation_date |tags |
--------+-----------------------+--------------------------+
70338059|2021-12-13 16:46:16.940|mysql|node.js|sequelize.js|
70276304|2021-12-08 14:02:39.313|sql-order-by|where-clause |
70341363|2021-12-13 21:50:42.510|php|mysql |
70218001|2021-12-03 16:54:34.417|windows|postgresql |
70287562|2021-12-09 09:35:49.333|database|psql |
70292467|2021-12-09 15:25:07.093|mysql |
70316036|2021-12-11 14:37:31.220|python|sqlalchemy |
70239290|2021-12-05 22:56:40.487|javascript|sqlite |
70274207|2021-12-08 11:26:41.477|sql|rest|td-engine |
70192916|2021-12-02 00:33:41.363|sql|spring|spring-boot |
I mentioned earlier that this is not advisable but in this case, if you really need to
lowercase tags it’s another option. Ideally we can prepare data ahead of time so
that production level tables contain strings with a consistent case. You do that
with a tool like dbt where you can materialize the lowercase tags into a table to
make downstream querying much easier.
Let’s look at a few more examples. In this query we’re trying to filter by performing
a math operation in the WHERE clause. Same thing applies. The database
performs a full table scan before filtering.
70
Chapter 4: Performance Patterns
--listing 4.11
SELECT
q.id AS post_id,
q.creation_date,
q.answer_count + q.comment_count as total_activity
FROM
posts_questions q
WHERE
TRUE
AND answer_count + comment_count >= 10
LIMIT 10;
--sample output
post_id |creation_date |total_activity|
--------+-----------------------+--------------+
70270242|2021-12-08 05:09:48.113| 10|
70255288|2021-12-07 05:19:45.337| 12|
70256716|2021-12-07 08:04:30.497| 10|
70318632|2021-12-11 20:10:08.213| 12|
70334900|2021-12-13 12:45:37.097| 11|
70333905|2021-12-13 11:29:00.117| 14|
70237681|2021-12-05 19:13:40.890| 10|
70257087|2021-12-07 08:38:39.263| 10|
70281346|2021-12-08 20:29:31.357| 13|
70190971|2021-12-01 20:43:14.507| 12|
71
Chapter 4: Performance Patterns
--listing 4.12
WITH cte_lowercase_tags AS (
SELECT
q.id AS post_id,
q.creation_date,
q.answer_count + q.comment_count as total_activity
FROM
posts_questions q
)
SELECT *
FROM cte_lowercase_tags
WHERE total_activity >= 10
LIMIT 10;
--sample output
post_id |creation_date |total_activity|
--------+-----------------------+--------------+
70270242|2021-12-08 05:09:48.113| 10|
70255288|2021-12-07 05:19:45.337| 12|
70256716|2021-12-07 08:04:30.497| 10|
70318632|2021-12-11 20:10:08.213| 12|
70334900|2021-12-13 12:45:37.097| 11|
70333905|2021-12-13 11:29:00.117| 14|
70237681|2021-12-05 19:13:40.890| 10|
70257087|2021-12-07 08:38:39.263| 10|
70281346|2021-12-08 20:29:31.357| 13|
70190971|2021-12-01 20:43:14.507| 12|
Let’s look at another common example with date functions. You often want to
filter on a date field by using the week, month, quarter, etc. It’s quite common to
see queries where you apply a date partition function in the WHERE clause so
you can filter to the proper week like below. Here we want only the questions
posted on week 50.
72
Chapter 4: Performance Patterns
--listing 4.13
SELECT
q.id AS post_id,
q.creation_date,
DATE_PART('week', creation_date) as week_of_year
FROM
posts_questions q
WHERE
DATE_PART('week', creation_date) = 50
LIMIT 10;
--sample output
post_id |creation_date |week_of_year|
--------+-----------------------+------------+
70337022|2021-12-13 15:25:08.903| 50|
70338059|2021-12-13 16:46:16.940| 50|
70348470|2021-12-14 11:56:02.373| 50|
70347796|2021-12-14 11:02:31.563| 50|
70347279|2021-12-14 10:24:40.953| 50|
70337072|2021-12-13 15:28:32.317| 50|
70328850|2021-12-13 00:35:38.387| 50|
70332341|2021-12-13 09:22:07.927| 50|
70333562|2021-12-13 11:00:05.760| 50|
70341363|2021-12-13 21:50:42.510| 50|
With dates we can be a little clever and avoid using DATE_PART() in the WHERE
clause. Basically we can dynamically calculate the start date and end date of week
50 and then filter directly by creation_date() Note that applying DATE_TRUNC()
on a static value (like 2021-01-01) is really fast. The same applies if you use scalar
functions that return a single value (e.g CURRENT_DATE()).
73
Chapter 4: Performance Patterns
--listing 4.14
SELECT
q.id AS post_id,
q.creation_date,
DATE_PART('week', creation_date) as week_of_year
FROM
posts_questions q
WHERE
creation_date >= DATE_TRUNC('week', '2021-01-01'::date + INTERVAL 50 WEEK
)
AND creation_date < DATE_TRUNC('week', '2021-01-01'::date + INTERVAL 51
WEEK)
LIMIT 10;
--sample output
post_id |creation_date |week_of_year|
--------+-----------------------+------------+
70337022|2021-12-13 15:25:08.903| 50|
70338059|2021-12-13 16:46:16.940| 50|
70348470|2021-12-14 11:56:02.373| 50|
70347796|2021-12-14 11:02:31.563| 50|
70347279|2021-12-14 10:24:40.953| 50|
70337072|2021-12-13 15:28:32.317| 50|
70328850|2021-12-13 00:35:38.387| 50|
70332341|2021-12-13 09:22:07.927| 50|
70333562|2021-12-13 11:00:05.760| 50|
70341363|2021-12-13 21:50:42.510| 50|
Using OR in the WHERE clause can be quite natural based on the logic you’re
trying to implement but I bet you didn’t know there are hidden, performance
“gotchas” if you do. They’re not very obvious either so let me show you.
If you use OR to search for multiple values of the same column, there will be no
performance issues. In fact you already do this without realizing it.
Let’s see an example. This query will get all the created posts:
--listing 4.17
SELECT
post_id,
creation_date,
user_id
FROM
post_history
WHERE
post_history_type_id IN (1,2,3);
74
Chapter 4: Performance Patterns
75
Chapter 4: Performance Patterns
--listing 4.20
SELECT
ph.post_id,
ph.creation_date,
u.display_name
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
ph.post_history_type_id = 1 OR u.up_votes >= 100;
When I see a query like this, I immediately know it will cause problems. It might
be fast in our tiny database with a fast engine like DuckDB but when you throw
millions of rows at it, you will see performance degradation.
What happens is that the database engine will most likely perform the two
separate filtering operations then combine the results via a join. But there’s
good news! You can rewrite the above query using UNION ALL get the same exact
result while seeing 10x - 100x performance improvement. Here it is:
--listing 4.21
SELECT
post_id,
ph.creation_date,
user_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
post_history_type_id = 1
UNION ALL
SELECT
post_id,
ph.creation_date,
user_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
u.up_votes >= 100;
What we’ve done here is to separate the two filtering conditions into their own
separate query then combine the results. This will often cause the database
76
Chapter 4: Performance Patterns
engine to parallelize the filtering operations and then simply append them which
is a lot faster.
With that we wrap up our chapter on query performance. There’s a lot more
to learn about improving query performance but that’s not the purpose of this
book. In the next chapter we’ll cover how to make your queries robust against
unexpected changes in the underlying data.
77
Chapter 5: Robustness Patterns
In this chapter we’re going to talk about how to protect your queries against
most data problems you’ll encounter. Robustness means that your query will
not break if the underlying data changes unpredictably.
Spend enough time working with real world data and you’ll eventually get
burned by one of these. But when you know about them ahead of time you can
write defensive code.
Side Note: I don’t believe in the term “dirty data.” There’s no such thing. I
prefer the terms “fit for purpose” and “unfit for purpose.” Most real world
data is unfit for the purposes you want so you have to “retrofit” it to make
it suitable. Retrofitting is a much more suitable term for data preparation
because it avoids blame. Data that’s fit for purpose can be used as is.
We cannot know exactly how data will change, but what we CAN foresee many of
the patterns of how data changes and write our queries to protect against them.
Here are some of the patterns of data changes:
1. New columns are added that have NULL values for past data
2. Existing columns that didn’t have NULLs before now contain NULLs
3. Columns that contained numbers or dates stored as strings now contain
unexpected values
4. The formatting of dates or numbers gets messed up and type conversion
fails.
5. The denominator in a ratio calculation becomes zero and now we’re
dividing by zero
78
Chapter 5: Robustness Patterns
SQL supports 3 primitive data types, strings, numbers and dates. They allow
for mathematical operations with numbers, calendar operations with dates and
many types of string operations.
It’s quite common to see numbers and dates stored as strings, especially when
you’re loading flat text files like CSVs or TSVs. Some data loading tools will try
and guess the type and format it on the fly but they’re not always correct. So you
will often have to manually convert dates and numbers.
The standard function for converting data in SQL is CAST(). Some other database
implementations, like SQL Server, also use their own custom function called
CONVERT() but also support CAST(). We will use CAST() to both convert between
types (like string to date) or within the same type (like a timestamp to date)
Here’s an example of how type conversion works:
--listing 5.1
SELECT CAST('2021-12-01' as DATE);
CAST('2021-12-01' AS DATE)|
--------------------------+
2021-12-01|
That should work in most cases but of there are always exceptions. Suppose that
79
Chapter 5: Robustness Patterns
Obviously there’s no 13th month so we get an error. What if the date was fine but
the formatting was bad?
--listing 5.3
SELECT CAST('2021-12--01' as DATE);
The extra dash in this case messes up automatic conversion, but the date itself
was correct. What if you try to convert a string to a number and the data is not
numeric?
--listing 5.4
SELECT CAST('2o21' as INT);
So how do we deal with these issues? Let’s have a look at some patterns.
One of the easiest ways to deal with formatting issues when converting data is to
simply ignore bad formatting. What this means is we simply skip the malformed
rows when querying data.
This works great in cases when the error is unfixable or occurs very rarely. So if a
few rows out of 10 million are malformed and can’t be fixed we can skip them.
However the CAST() function will fail if it encounters an issue, thus breaking the
query, and we want our query to be robust. To deal with this problem some
databases introduce “safe” casting functions like SAFE_CAST() or TRY_CAST().
80
Chapter 5: Robustness Patterns
Note: Not all servers provide this function. PostgreSQL for example doesn’t have
built-in safe casting but it can be built as custom user defined function (UDF).
SAFE_CAST() and TRY_CAST() are designed to return NULL if the conversion fails
instead of breaking. We can then handle NULL by COALESCE() to replace the bad
values with a sensible value.
DuckDB uses TRY_CAST() so let’s see it in action:
--listing 5.5
SELECT TRY_CAST('2021-12--01' as DATE) AS dt;
dt|
------+
NULL |
And if we want to skip the incorrect values we leave it as is. If however we don’t
want to skip the bad rows we can replace them by using COALESCE():
--listing 5.6
SELECT COALESCE(TRY_CAST('2o21' as INT), 0) AS year;
year|
----+
0|
While ignoring incorrect data is easy, you can’t always get away with it.
Sometimes you need to extract the actual data by finding patterns in how
formatting is broken and fixing them using string parsing functions. Let’s see
some examples
Suppose that some of the rows of dates had extra dashes like this:
2021-12--01
2021-12--02
2021-12--03
2021-12--04
Since this is a recurring format, we can use string parsing functions to remove
the extra dash and then do the conversion like this:
81
Chapter 5: Robustness Patterns
--listing 5.7
WITH dates AS (
SELECT '2021-12--01' AS dt
UNION ALL
SELECT '2021-12--02' AS dt
UNION ALL
SELECT '2021-12--03' AS dt
UNION ALL
SELECT '2021-12--04' AS dt
UNION ALL
SELECT '2021-12--05' AS dt
)
SELECT TRY_CAST(SUBSTRING(dt, 1, 4) || '-' ||
SUBSTRING(dt, 6, 2) || '-' ||
SUBSTRING(dt, 10, 2) AS DATE) AS date_field
FROM dates;
date_field|
----------+
2021-12-01|
2021-12-02|
2021-12-03|
2021-12-04|
2021-12-05|
So as you can see in this example, we took advantage of the regularity of the
incorrect formatting to extract the the year, month and day from the rows and
reconstruct the correct formatting by concatenating strings via the || operator.
What if you have different types of irregularities in your data? In some cases if
information is aggregated from multiple sources you might have to deal with
mixed formatting.
Let’s take a look at an example:
dt |
-----------+
2021-12--01|
2021-12--02|
2021-12--03|
12/04/2021 |
12/05/2021 |
Obviously we can’t force the same formatting for all the dates here so we’ll have
to split it up using the CASE statement like this:
82
Chapter 5: Robustness Patterns
--listing 5.8
WITH dates AS (
SELECT '2021-12--01' AS dt
UNION ALL
SELECT '2021-12--02' AS dt
UNION ALL
SELECT '2021-12--03' AS dt
UNION ALL
SELECT '12/04/2021' AS dt
UNION ALL
SELECT '12/05/2021' AS dt
)
SELECT TRY_CAST(CASE WHEN dt LIKE '%-%--%'
THEN SUBSTRING(dt, 1, 4) || '-' ||
SUBSTRING(dt, 6, 2) || '-' ||
SUBSTRING(dt, 10, 2)
WHEN dt LIKE '%/%/%'
THEN SUBSTRING(dt, 7, 4) || '-' ||
SUBSTRING(dt, 1, 2) || '-' ||
SUBSTRING(dt, 4, 2)
END AS DATE) AS date_field
FROM dates;
--sample output
date_field|
----------+
2021-12-01|
2021-12-02|
2021-12-03|
2021-12-04|
2021-12-05|
Notice how we’re separating rows with different formatting using the CASE and
LIKE operators to handle each of them differently. You can repeat this pattern as
many times as you want to handle each different format.
83
Chapter 5: Robustness Patterns
--listing 5.9
WITH weights AS (
SELECT '32.5lb' AS wt
UNION ALL
SELECT '45.2lb' AS wt
UNION ALL
SELECT '53.1lb' AS wt
UNION ALL
SELECT '77kg' AS wt
UNION ALL
SELECT '68kg' AS wt
)
SELECT
TRY_CAST(CASE WHEN wt LIKE '%lb' THEN SUBSTRING(wt, 1, INSTR(wt, 'lb')-1)
WHEN wt LIKE '%kg' THEN SUBSTRING(wt, 1, INSTR(wt, 'kg')-1)
END AS DECIMAL) AS weight,
CASE WHEN wt LIKE '%lb' THEN 'LB'
WHEN wt LIKE '%kg' THEN 'KG'
END AS unit
FROM weights;
--sample output
weight|unit|
------+----+
32.500|LB |
45.200|LB |
53.100|LB |
77.000|KG |
68.000|KG |
I’m using the SUBSTRING() function again to extract parts of a string, and I used
the INSTR() function, which searches for a string within another string and returns
the first occurrence of it or 0 if not found, in order to tell the SUBSTRING() function
how many characters to read.
NULLs in SQL represent unknown values. While the data may appear to be blank
or empty in the results, it’s not the same as an empty string or white space. The
reason we want to handle them is because they cause issues when it comes
to comparing fields or joining data. They might confuse users, so as a general
pattern you should replace NULLs with predetermined default values.
84
Chapter 5: Robustness Patterns
One of my favorite rules of thumb is to always use a LEFT JOIN when I’m not sure
if one table is a subset of the other.
For example in the query below:, we use a left join with the static table
post_history_type_mapping because we’re not sure how the post_history_type_id
might change.
We might have new mappings being created that we haven’t added to our lookup
table yet and we don’t want to limit our final results unknowingly. By the way
this query is part of our dbt chapter and explained in Chapter 7
--listing 5.10
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }} ph
LEFT JOIN {{ ref('post_history_type_mapping') }} m
ON ph.post_history_type_id = m.post_history_type_id
As a rule, you should always assume any column can be NULL at any point in
time so it’s a good idea to provide a default value for that column as part of your
SELECT. This way you make sure that even if your data becomes NULL your query
will not fail.
For strings you might use default values such as NA, Not Provided, Not Available,
etc. Dates and numbers are trickier. For a date field you might use a default
value such as 1900-01-01 and that’s a safe enough signal that the data is not
available.
85
Chapter 5: Robustness Patterns
Doing this however could mess up age calculations, especially if the age is later
averaged, so be careful where you use it. Same thing applies to using a default
value like 0, -1, or 9999 for numbers. It might make sense when the column
cannot be 0 or negative, but not always.
You do this by using COALESCE() as described earlier:
--listing 5.11
SELECT
id,
COALESCE(display_name, 'unknown') AS user_name,
COALESCE(about_me, 'unknown') AS about_me,
COALESCE(age, 'unknown') AS age,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(last_access_date, '1900-01-01') AS last_access_date,
COALESCE(location, 'unknown') AS location,
COALESCE(reputation, 0) AS reputation,
COALESCE(up_votes, 0) AS up_votes,
COALESCE(down_votes, 0) AS down_votes,
COALESCE(views, 0) AS views,
COALESCE(profile_image_url, 'unknown') AS profile_image_url,
COALESCE(website_url, 'unknown') AS website_url
FROM
users
LIMIT 10;
Since id is the primary key in this table it can’t be NULL so we’re not to handling it
here, but we do handle everything else regardless of whether it’s NULL or not.
When you calculate ratios you must always handle potential division by zero.
Your query might work when you first test it, but if the denominator ever becomes
zero it will fail.
The easiest way to handle this is by excluding zero values in the denominator.
This will work fine but it will also filter out rows which could be needed.
86
Chapter 5: Robustness Patterns
Here’s an example:
--listing 5.12
WITH cte_test_data AS (
SELECT 94 as comments_on_post, 38 as posts_created
UNION ALL
SELECT 62, 0
UNION ALL
SELECT 39, 20
UNION ALL
SELECT 34, 19
UNION ALL
SELECT 167, 120
UNION ALL
SELECT 189, 48
UNION ALL
SELECT 96, 17
UNION ALL
SELECT 15, 15
)
SELECT
ROUND(CAST(comments_on_post AS NUMERIC) /
CAST(posts_created AS NUMERIC), 1) AS comments_on_post_per_post
FROM
cte_test_data
WHERE
posts_created > 0;
--sample output
comments_on_post_per_post|
-------------------------+
2.5|
2.0|
1.8|
1.4|
3.9|
5.6|
1.0|
The best way to handle division by zero without filtering out rows is to use a
CASE statement. While this will work, there are other options. Cloud warehouses
like BigQuery offer a SAFE_DIVIDE() function which returns NULL in the case of
divide-by-zero error.
Then you simply deal with NULL values using COALESCE() like above. Snowflake
offers a similar function called DIV0() which automatically returns 0 if there’s a
87
Chapter 5: Robustness Patterns
division by zero error. DuckDB on the other hand seems to handle divide by zero
directly without throwing an error.
Here’s an example:
--listing 5.13
WITH cte_test_data AS (
SELECT 94 as comments_on_post, 38 as posts_created
UNION ALL
SELECT 62, 0
UNION ALL
SELECT 39, 20
UNION ALL
SELECT 34, 19
UNION ALL
SELECT 167, 120
UNION ALL
SELECT 189, 48
UNION ALL
SELECT 96, 17
UNION ALL
SELECT 15, 15
)
SELECT
CASE
WHEN posts_created > 0 THEN
ROUND(CAST(comments_on_post AS NUMERIC) /
CAST(posts_created AS NUMERIC), 1)
ELSE 0
END AS comments_on_post_per_post
FROM
cte_test_data;
I said earlier that strings are the easiest way to store any kind of data (numbers,
dates, strings) but strings also have their own issues, especially when you’re
trying to join on a string field.
Here are some issues you’ll undoubtedly run into with strings.
1. Inconsistent casing
2. Space padding
3. Unexpected characters
88
Chapter 5: Robustness Patterns
Many databases are case sensitive so if the same string is stored with different
cases it will not match when doing a join. Let’s see an example:
--listing 5.14
SELECT 'string' = 'String' AS test;
test |
-----+
false|
As you can see, a different case causes the test to show as FALSE The only way to
deal with this problem when joining on strings or matching patterns on a string
is to convert all fields to upper or lower case.
--listing 5.15
SELECT LOWER('string') = LOWER('String') AS test;
test|
----+
true|
Space padding is the other common issue you deal with strings.
--listing 5.16
SELECT 'string' = ' string' AS test;
test |
-----+
false|
You deal with this by using the TRIM() function which removes all the leading and
trailing spaces.
--listing 5.17
SELECT TRIM('string') = TRIM(' string') AS test;
test|
----+
true|
89
Chapter 5: Robustness Patterns
test|
----+
true|
As far as handling unexpected characters, you’ll first need to figure out how they
appear and then fix them using the function REPLACE() This can vary a lot, but
usually you’ll want to replace offending characters with an empty string.
Here’s an example:
--listing 5.15
SELECT REPLACE(TRIM(LOWER('String//')), '/','') = TRIM(LOWER(' string')) AS
test;
test|
----+
true|
test|
----+
true|
90
Chapter 5: Robustness Patterns
Schema changes are one of the most common issues with source data. Whether
the changes came from your internal engineering team or an external party, you
should have ways to deal with them gracefully.
The data interface pattern states that you should have a single point of entry
between external data and your workflow. This means that all external tables
should have an internal table or view that “translates” their columns into
meaningful equivalents and all queries downstream depend on the internal
table or view.
Here’s what it would look like visually:
91
Chapter 5: Robustness Patterns
--listing 5.20
SELECT
ph.id AS post_history_id,
ph.post_id,
ph.post_history_type_id,
ph.revision_guid,
ph.user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(ph.creation_date, '1900-01-01') AS post_creation_date,
COALESCE(ph.text, 'unknown') AS post_text,
COALESCE(ph.comment, 'unknown') AS post_comment
FROM
post_history ph
LEFT JOIN post_history_type_mapping m
ON ph.post_history_type_id = m.post_history_type_id
With that we wrap up our chapter on query robustness. In the next chapter we
get to see the entire query for user engagement. It’s also a great opportunity to
review what we’ve learned so far.
92
Chapter 6: Finishing the Project
In this chapter we wrap up our query and go over it one more time highlighting
the various patterns we’ve learned so far. This is a good opportunity to test
yourself and see what you’ve learned. Analyze the query and see what patterns
you recognize.
So here’s the whole query
-- listing 6.1
-- Get the user name and collapse the granularity of post_history to the
user_id, post_id, activity type and date
-- Get the user name and collapse the granularity of post_history to the
user_id, post_id, activity type and date
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
93
Chapter 6: Finishing the Project
-- Get the post types we care about questions and answers only and combine
them in one CTE
,post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
-- Finally calculate the post metrics
, user_post_metrics AS (
SELECT
user_id,
user_name,
TRY_CAST(activity_date AS DATE) AS activity_date,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_created,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_created,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_edited,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_edited,
SUM(CASE WHEN activity_type = 'create'
THEN 1 ELSE 0 END) AS posts_created,
SUM(CASE WHEN activity_type = 'edit'
THEN 1 ELSE 0 END) AS posts_edited
FROM
post_types pt
JOIN post_activity pa ON pt.post_id = pa.post_id
GROUP BY 1,2,3
)
, comments_by_user AS (
SELECT
user_id,
TRY_CAST(creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments
WHERE
TRUE
GROUP BY
1,2
)
94
Chapter 6: Finishing the Project
, comments_on_user_post AS (
SELECT
pa.user_id,
TRY_CAST(c.creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments c
INNER JOIN post_activity pa ON pa.post_id = c.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
, votes_on_user_post AS (
SELECT
pa.user_id,
TRY_CAST(v.creation_date AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
, total_metrics_per_user AS (
SELECT
pm.user_id,
pm.user_name,
CAST(SUM(pm.posts_created) AS NUMERIC) AS posts_created,
CAST(SUM(pm.posts_edited) AS NUMERIC) AS posts_edited,
CAST(SUM(pm.answers_created) AS NUMERIC) AS answers_created,
CAST(SUM(pm.questions_created) AS NUMERIC) AS questions_created,
CAST(SUM(vu.total_upvotes) AS NUMERIC) AS total_upvotes,
CAST(SUM(vu.total_downvotes) AS NUMERIC) AS total_downvotes,
CAST(SUM(cu.total_comments) AS NUMERIC) AS comments_by_user,
CAST(SUM(cp.total_comments) AS NUMERIC) AS comments_on_post,
CAST(COUNT(DISTINCT pm.activity_date) AS NUMERIC) AS streak_in_days
FROM
user_post_metrics pm
JOIN votes_on_user_post vu
ON pm.activity_date = vu.activity_date
AND pm.user_id = vu.user_id
JOIN comments_on_user_post cp
ON pm.activity_date = cp.activity_date
AND pm.user_id = cp.user_id
JOIN comments_by_user cu
ON pm.activity_date = cu.activity_date
AND pm.user_id = cu.user_id
GROUP BY
1,2
)
95
Chapter 6: Finishing the Project
------------------------------------------------
---- Main Query
SELECT
user_id,
user_name,
posts_created,
answers_created,
questions_created,
total_upvotes,
comments_by_user,
comments_on_post,
streak_in_days,
96
Chapter 6: Finishing the Project
Project Remarks
There are a few things to mention before we move on to the next chapter.
Our query is very long and complex. While we did a pretty good job of
decomposing it into clean modules it’s still 200+ lines long. Many of the CTEs can
only be used inside this query. As discussed in Chapter 3, if we want to use them
elsewhere in the database we need to create views. We’ll see how to do this with
dbt in the next chapter
97
Chapter 6: Finishing the Project
You’ll notice that in the CTE named total_metrics_per_user, I cast all those
integer values into type NUMERIC why? The reason is when many databases
perform integer division they will not show any decimal values.
By casting them into NUMERIC we ensure decimal places. And since the number
of decimals can be unpredictable, we use the ROUND() function to round all the
values to 1 decimal place. A clever trick to do this without casting is to multiply
each colum by 1.0 which forces the database to do the type conversion implicitly.
Did you notice how many times the CASE statement was repeated? It makes
the query unnecessarily complicated and hard to maintain. Remember
the DRY Principle? Is there a way we can avoid having to use it? Not
unless your database has a “safe divide” function, but there is a way to do
this with a SQL compiler macro like dbt. We’ll see that pattern in the next chapter.
Now that you have all these wonderful metrics you can sort the results
by any of them to see different types of users. For example you can sort
by questions_per_post to see everyone who posts mostly questions or
answers_by_post to see those who post mostly answers. You can also create
new metrics that indicate who your best users are.
Some of the best uses of this type of table are for customer segmentation or as a
feature table for data science. In fact this exactly the type of table DS and ML
engineers build when deploying machine learning systems.
That wraps up our final project, but we’re not done yet. In the next chapter we’ll
see how to apply many of the patterns with dbt.
98
Chapter 7: dbt Patterns
In this chapter we’re going to use all the patterns we’ve seen so far to simplify
our final query from the project we just saw using dbt.
dbt is a SQL compiler that uses a combination of SQL code with Jinja templates
to allow far greater flexibility in how you design data transformations than SQL
alone. These are patterns I use everyday in my job and they have helped me
make my code not only easier to maintain and debug but also portable across
many platforms.
The power of dbt is its support for dependencies which lets you to decompose a
data transformation into modular workflows that form DAGs. It also supports
macros which make your code more portable.
What we’ll do in this chapter is take the query we completed in Chapter 6 and
show you how to rewrite it with dbt. I won’t go into too much depth on how dbt
works, because I don’t want to make this a dbt tutorial. You can learn more about
it here
dbt uses the concept of “models” for modularizing your code. All the models by
default live in the models folder. In Github repo for this book, under the models
folder you will find 3 subfolders, bronze, silver and gold. They represent what
is called “the medallion architecture.” I won’t get into details about that here,
you can read about it on https://www.databricks.com/glossary/medallion-
architecture
99
Chapter 7: dbt Patterns
The first one loads the StackOverflow tables from parquet files as is without any
modifications. We’ve used those exact tables throughout the book
But the beauty of dbt is that it makes it really easy to create our own custom
models while applying the robustness patterns we learned in Chapter 5. We can
have our own foundational models rather than rely on raw data. The model
below uses COALESCE() on all the fields, ensuring that all downstream models
no longer have to worry about NULLs.
Have a look at this example in the models/clean subfolder:
100
Chapter 7: dbt Patterns
--model post_activity_history_clean_original
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
CASE
WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
WHEN post_history_type_id IN (7,8,9) THEN 'rollback'
END AS grouped_activity_type,
CASE
WHEN post_history_type_id = 1 THEN 'create_title'
WHEN post_history_type_id = 2 THEN 'create_body'
WHEN post_history_type_id = 3 THEN 'create_tags'
WHEN post_history_type_id = 4 THEN 'edit_title'
WHEN post_history_type_id = 5 THEN 'edit_body'
WHEN post_history_type_id = 6 THEN 'edit_tags'
WHEN post_history_type_id = 10 THEN 'post_closed'
WHEN post_history_type_id = 11 THEN 'post_reopened'
WHEN post_history_type_id = 12 THEN 'post_deleted'
WHEN post_history_type_id = 13 THEN 'post_undeleted'
WHEN post_history_type_id = 14 THEN 'post_locked'
WHEN post_history_type_id = 15 THEN 'post_unlocked'
WHEN post_history_type_id = 16 THEN 'community_owned'
WHEN post_history_type_id = 17 THEN 'post_migrated'
WHEN post_history_type_id = 18 THEN 'question_merged'
WHEN post_history_type_id = 19 THEN 'question_protected'
WHEN post_history_type_id = 20 THEN 'question_unprotected'
WHEN post_history_type_id = 21 THEN 'post_disassociated'
WHEN post_history_type_id = 22 THEN 'question_unmerged'
WHEN post_history_type_id = 24 THEN 'suggested_edit_applied'
WHEN post_history_type_id = 25 THEN 'post_tweeted'
WHEN post_history_type_id = 31 THEN 'comment_discussion_moved_to_chat
'
WHEN post_history_type_id = 33 THEN 'post_notice_added'
WHEN post_history_type_id = 34 THEN 'post_notice_removed'
WHEN post_history_type_id = 35 THEN 'post_migrated'
WHEN post_history_type_id = 36 THEN 'post_migrated'
WHEN post_history_type_id = 37 THEN 'post_merge_source'
WHEN post_history_type_id = 38 THEN 'post_merge_destination'
WHEN post_history_type_id = 50 THEN 'bumped_by_community_user'
WHEN post_history_type_id = 52 THEN 'question_became_hot_network'
WHEN post_history_type_id = 53 THEN '
question_removed_from_hot_network'
WHEN post_history_type_id = 66 THEN 'created_from_ask_wizard'
END AS activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }}
101
Chapter 7: dbt Patterns
We also handle the mapping of the post_history_type_id. There are a lot more
types we didn’t see before because we didn’t have to, but now we can put them all
here so we only work with text later. Text descriptions make code more readable
and maintainable vs some magic number.
This is fine but do you notice how many times we had to copy paste the same
piece of code? Can we do better? With dbt we can. There’s a concept in dbt called
seed files, which are perfect for this type of mapping. This is basically a CSV file
with two columns post_history_type_id and text_description The file makes it a
lot easier to add or update mapping in the future.
This is what it looks like:
--seed file (partial listing)
post_history_type_id,activity_type,grouped_activity_type
1,create_title,create
2,create_body,create
3,create_tags,create
4,edit_title,edit
5,edit_body,edit
6,edit_tags,edit
7,rollback_title,rollback
8,rollback_body,rollback
9,rollback_tags,rollback
10,post_closed,post_closed
11,post_reopened,post_reopened
12,post_deleted,post_deleted
13,post_undeleted,post_undeleted
14,post_locked,post_locked
...
102
Chapter 7: dbt Patterns
--model post_history_clean
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }} ph
LEFT JOIN {{ ref('post_history_type_mapping') }} m
ON ph.post_history_type_id = m.post_history_type_id
Notice a couple of things. First of all our code is a lot more compact, easy to
read, understand and maintain. Second we’re using a LEFT JOIN as explained
in Chapter 5 Pattern 3. Also notice how we assume NULL with activity_type and
grouped_activity_type and COALESCE() the input coming from the LEFT JOIN in
order to protect ourselves.
While CTEs provide a great way to decompose a single query into readable and
maintainable modules, they don’t go far enough. If you wanted to reuse any of
them you’d have to manually create views. And when views no longer cut it, due
to performance issues, you’d have to materialize them into tables.
Dbt makes both of those options easier while also allowing you to create linkages
across models forming a DAG as we saw in Chapter 3.
103
Chapter 7: dbt Patterns
The CTE only selects the post_id and post_type columns but I think this can be a
very useful in the future so we create a more comprehensive model that unions
all the columns in a single view. To save ourselves from writing boilerplate SQL
and cover future cases where new columns are added to the base tables we use
the union_relations() macro from dbt-utils:
--listing 7.2 all_post_types_combined
{{
dbt_utils.union_relations(
relations=[ref('posts_answers_clean'), ref('posts_questions_clean')]
)
}}
The macro will compile into the appropriate SQL before execution. If you
want to see the code (which I won’t list here) simply run dbt compile -m
all_post_types_combined And if you want to see the beautiful DAG created, just
run dbt docs generate && dbt docs serve
104
Chapter 7: dbt Patterns
105
Chapter 7: dbt Patterns
cte_all_posts_created_and_edited AS (
SELECT
pa.user_id,
TRY_CAST(pa.creation_date AS DATE) AS activity_date,
{{- sumif("pa.grouped_activity_type = 'create'
AND pt.post_type = 'question'", 1) }} AS
questions_created,
{{- sumif("pa.grouped_activity_type = 'create'
AND pt.post_type = 'answer'", 1) }} AS answers_created,
{{- sumif("pa.grouped_activity_type = 'edit'
AND pt.post_type = 'question'", 1) }} AS questions_edited,
{{- sumif("pa.grouped_activity_type = 'edit'
AND pt.post_type = 'answer'", 1) }} AS answers_edited,
{{- sumif("pa.grouped_activity_type = 'create'", 1) }} AS
posts_created,
{{- sumif("pa.grouped_activity_type = 'create'", 1) }} AS
posts_edited
FROM
{{ ref('all_post_types_combined') }} pt
INNER JOIN {{ ref('post_activity_history_clean') }} pa
ON pt.post_id = pa.post_id
WHERE
true
AND pa.grouped_activity_type in ('create', 'edit')
AND pt.post_type in ('question', 'answer')
AND pa.user_id > 0 --exclude automated processes
AND pa.user_id IS NOT NULL --exclude deleted accounts
GROUP BY 1,2
)
We do a few very interesting things here. First notice all that boilerplate SQL
with SUM and CASE statements. This where dbt really shines. We make a custom
macro to hide the functionality behind. This is a VERY important pattern unique
to dbt. Some might argue this make the code unnecessarily complex but I beg to
differ. This one pattern has saved me hours of drudgery.
{% macro sumif(condition, column) %}
SUM(CASE WHEN {{condition}} THEN {{column}} ELSE 0 END)
{%- endmacro %}
At first the macro seems superfluous. Why bother right? In this case it does seem
like the macro is not adding any functionality, however by using a macro, we’re
106
Chapter 7: dbt Patterns
Now we can apply that macro to our final model that gets us the same result as
the query in the last chapter.
107
Chapter 7: dbt Patterns
WITH cte_metrics_per_user AS (
SELECT
user_id,
user_name,
SUM(posts_created) AS posts_created,
SUM(posts_edited) AS posts_edited,
SUM(answers_created) AS answers_created,
SUM(questions_created) AS questions_created,
SUM(total_upvotes) AS total_upvotes,
SUM(total_downvotes) AS total_downvotes,
SUM(comments_by_user) AS comments_by_user,
SUM(comments_on_post) AS comments_on_post,
COUNT(DISTINCT activity_date) AS streak_in_days
FROM
{{ ref('all_user_metrics_per_day') }}
GROUP BY
1,2
)
SELECT
user_id,
user_name,
posts_created,
answers_created,
questions_created,
total_upvotes,
comments_by_user,
comments_on_post,
streak_in_days,
109