100% found this document useful (1 vote)

41 views

sql_patterns_v1.5

Uploaded by

SantoshJammi

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

41 views

sql_patterns_v1.5

Uploaded by

SantoshJammi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 113

Introduction 2
Who am I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why I wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Who this book is for . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
What you’ll learn in this book . . . . . . . . . . . . . . . . . . . . . . 4
How this book is organized . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 1: Understanding the Database 7

Understanding the Data Model . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 2: Core Concepts and Patterns 15

Concept 1: Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Concept 2: Granularity Collapsing . . . . . . . . . . . . . . . . . . . . 17
Pattern 1: Collapsing Column Granularity . . . . . . . . . . . . . 17
Pattern 2: Collapsing Date Granularity . . . . . . . . . . . . . . . 20
Pattern 3: Pivoting Rows Into Columns . . . . . . . . . . . . . . 21
Concept 3: Granularity Multiplication . . . . . . . . . . . . . . . . . . 22
Pattern 1: Basic JOINs . . . . . . . . . . . . . . . . . . . . . . . 22
Pattern 2: Accidental INNER JOIN . . . . . . . . . . . . . . . . . 24
Concept 4: Granularity Addition . . . . . . . . . . . . . . . . . . . . . 28
Pattern 1: Appending Tables . . . . . . . . . . . . . . . . . . . . 28

ii
Contents

Chapter 3: Modularity Patterns 30

Concept 1: Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Pattern 1: Modular SQL Using CTEs . . . . . . . . . . . . . . . . 31
Pattern 2: Modular SQL Using Views/UDFs . . . . . . . . . . . . 46
Concept 2: Single Responsibility Principle (SRP) . . . . . . . . . . . . 49
Pattern 1: Applying SRP . . . . . . . . . . . . . . . . . . . . . . 49
Concept 3: Don’t Repeat Yourself (DRY) . . . . . . . . . . . . . . . . . 51
Pattern 1: Applying DRY . . . . . . . . . . . . . . . . . . . . . . 52
Concept 4: Move Logic Upstream . . . . . . . . . . . . . . . . . . . . 54

Chapter 4: Performance Patterns 56

Concept 1: Reduce Unnecessary Processing . . . . . . . . . . . . . . . 56
Pattern 1: Reduce Unnecessary Rows . . . . . . . . . . . . . . . 56
Pattern 2: Reduce Unnecessary Columns . . . . . . . . . . . . . 59
Pattern 3: Delay Unnecessary Sorting . . . . . . . . . . . . . . . 60
Pattern 4: Avoid Unnecessary DISTINCT . . . . . . . . . . . . . . 61
Concept 2: Keep the WHERE Clause Simple . . . . . . . . . . . . . . . 67
Pattern 1: Compare to Static Values (if possible) . . . . . . . . . 67
Pattern 2: Avoid OR in the WHERE clause . . . . . . . . . . . . . 74

Chapter 5: Robustness Patterns 78

Concept 1: Handling Type Conversions . . . . . . . . . . . . . . . . . 79
Pattern 1: Ignore or Replace Bad Data . . . . . . . . . . . . . . . 80
Pattern 2: Force Formatting (if possible) . . . . . . . . . . . . . . 81
Concept 2: Handling NULLs . . . . . . . . . . . . . . . . . . . . . . . 84
Pattern 1: Start With LEFT JOIN . . . . . . . . . . . . . . . . . . 85
Pattern 2: Assume NULL . . . . . . . . . . . . . . . . . . . . . . 85
Concept 3: Handling Division By Zero . . . . . . . . . . . . . . . . . . 86
Pattern 1: Skip Rows With 0 Denominator . . . . . . . . . . . . . 86
Pattern 2: Anticipate and Bypass . . . . . . . . . . . . . . . . . . 87
Concept 4: Handling Inconsistent Comparisons . . . . . . . . . . . . . 88
Pattern 1: Anticipate and Pre-format . . . . . . . . . . . . . . . 90

iii
Contents

Concept 5: Handling Schema Changes . . . . . . . . . . . . . . . . . 91

Pattern 1: External to Internal Interface . . . . . . . . . . . . . . 91

Chapter 6: Finishing the Project 93

Project Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Chapter 7: dbt Patterns 99

Applying Robustness Patterns . . . . . . . . . . . . . . . . . . . . . . 99
Applying Modularity Patterns . . . . . . . . . . . . . . . . . . . . . . 103
Applying SRP Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 106

iv
Copyright

Copyright © 2022, 2024 by Ergest Xheblati

All rights reserved.
No portion of this book may be reproduced in any form without written
permission from the publisher or the author, except as permitted by U.S.
copyright law.
First edition v1.0: 2022
Updated edition v1.5: 2024
Got questions? Drop me a note at ergest@gmail.com

1
Introduction

This is a book about SQL Patterns. Patterns describe problems that occur over
and over in our professional settings. A pattern is like a template. Once you
learn them you can apply them to solve problems faster and make your code
better. Learning and applying patterns is how you level up in your career. We can
illustrate this with an example.
In fiction writing, authors rarely write from books scratch. They use character
patterns like: “antihero”, “sidekick”, “mad scientist”, “girl next door” and plot
patterns like: “romantic comedy,” “melodrama”, “red herring”, “foreshadowing”,
“cliffhangers”, etc. This helps them write better books, movies and TV shows
faster.
Each pattern in the book has one of these elements:

1. The pattern name describes the problem and/or potential solutions

2. The problem describes when you should apply the pattern and in what
context
3. The solution describes the elements of the design for the solution to the
problem
4. The tradeoffs are the consequences of applying that specific solution

Who am I

I’ve been writing SQL for nearly 20 years. I’ve seen and written hundreds of
thousands of lines of code. Over time I noticed a set of patterns and best practices

2
Introduction

I always came back to when writing queries. These patterns made my code more
efficient, easier to understand and a breeze to maintain.

Why I wrote this book

I have a background in computer science. As part of the curriculum we learn how

to make our code more efficient, more readable, easier to maintain and debug.
As I started to write SQL I applied many of these lessons to my own code.
When reviewing other people’s code I kept spotting the same mistakes. Chunks
of code that would repeat everywhere; 500+ lines queries that were complex
and slow; indecipherable CTE names and table aliases that made no sense, etc..
I would often have to rewrite the queries just so I could understand what they
were doing.
After doing this over and over again I thought there must be a book or course that
teaches analysts and data scientists code design patterns. They already exist
in the world of software engineering and they’re used extensively to tame the
complexity of software system design. But why now?
In recent years SQL has gone from being just a querying language to being used
for heavy duty, in-database data transformation. Especially with the advent of
tools like dbt it has become a lot easier to use SQL for building complex data
workflows. As you would expect there were tens if not hundreds of thousands of
SQL books, database books, data modeling books, etc. but nothing that fit my
need. So I wrote one myself.

Who this book is for

This book is for anyone who is familiar with SQL and wants to take their skills to
the next level. I assume you’re already familiar basic SQL syntax and you know
how to join tables and do basic filtering and aggregation.

3
Introduction

• If you’re using SQL to build complex data processing workflows – like I

have – this book is a must for you.
• If you find that your SQL code is often inefficient and slow and you want to
make it faster, this book is for you.
• If you find that your SQL code is long, messy and hard to understand and
you want to make it cleaner, this book is for you
• If you find that your SQL code breaks easily when data changes and you
want to make it more resilient, this book is for you.
• If you work (or plan to work) with tools like dbt or sqlmesh this book is a
must for you.

What you’ll learn in this book

I’m a huge fan of project-based learning. You can learn anything if you can come
up with an interesting project to use what you’re learning. I used this exact
method to teach myself data science. I came up with a work-related project that
was both valuable to the company and used the things I was learning.
That’s why for this book I came up with an interesting and useful data project
to organize it around. I’ll explain each pattern as I walk you through the project.
This will ensure that you learn the material better and remember it the next time
you need to apply it.
In the previous edition of this book I used the StackOverflow dataset that’s
publicly available in BigQuery. Realizing that not everyone has access to this and
that it could disappear at any moment I decided to make a few changes
First of all I made the tables available as parquet files in GitHub. Second I decided
to use the freely available (and quite amazing) DuckDB. The instructions for
setting everything up are available on this repo: github.com/ergest/sql_patterns
I’ve also included all the chapter code listings in the repo so you can copy/paste
it and run it. I strongly encourage you to type it yourself. You’ll learn better that
way. Using this dataset we’re going to build a table which calculates reputation

4
Introduction

metrics. You can use this same type of table to calculate a customer engagement
score or a customer 360 table.
As we go through the project, we’ll cover each pattern when it arises. That will
help you understand why we’re using that pattern at that exact moment. Each
chapter will cover a select group of patterns while building on the previous
chapters.

How this book is organized

In Chapter 1 we cover Understanding the Database where we look into the

StackOverflow database we’ll be working with throughout the book. We’ll make
sure you set up your development environment correctly and can run the queries.

In Chapter 2 we cover Core Concepts and Patterns. In this chapter we’re going
to cover some of the core concepts of querying data and building tables for
analysis and data science. We’ll start with the most important but underrated
concept in SQL; granularity.

In Chapter 3 we cover Modularity Patterns. In this chapter we’ll learn some key
concepts that make SQL code easier to read, understand and maintain. We first
talk about the concept of modularity and explore some patterns there. Then
we’ll cover Single Responsibility Principle (SRP), (Don’t Repeat Yourself)
DRY and a few others.

In Chapter 4 we cover Performance Patterns. In this chapter we’re going to

talk about query performance, aka how to make your queries run faster. Faster
queries get you results faster, obviously, but they also consume fewer resources,
making them cheaper on modern data warehouses.

5
Introduction

In Chapter 5 we cover Robustness Patterns. In this chapter we’re going to talk

about how to make your queries robust to most data problems you’ll encounter.
Spend enough time working with real world data and you’ll eventually get
burned by one of these. That’s why it’s important to write robust code that can
withstand data changes

In Chapter 6 we wrap up our project and you get to see the entire query. By now
you should be able to understand it and know exactly how it was designed. I
recap the entire project so that you get another chance to review all the patterns.
The goal here is to allow you to see all the patterns together and give you ideas
on how to apply them in your day-to-day work.

In Chapter 7 we cover dbt Patterns. In this chapter we’re going to use all the
patterns we’ve seen to simplify our final query from the project using dbt. The
purpose of this chapter is to show how these patterns apply beyond just SQL.

With that out of the way, let’s dive into the database.

6
Chapter 1: Understanding the
Database

In this chapter we get into the details of the StackOverflow database we’re going
to be using throughout the book. You can refer back to it at any point you feel
you don’t understand the underlying tables.
Before we dive into writing queries you should make sure you have
the proper development environment set up. I have posted a detailed
guide on how to set things up with dbt and DuckDB on this Github repo:
github.com/ergest/sql_patterns. This way I can update them as needed without
having to update the book.
StackOverflow is a popular website where users post questions about any
technical topic such as programming languages, databases, etc. and other
users can post answers to these questions, vote on the answers or comment on
them.
Based on the quality of the answers, users gain reputation and badges. These
badges act as social proof on StackOverflow and potentially on other websites.
This database is made available for free online in BigQuery but it’s really large
so I’ve extracted one month of data and packaged it with the Github repo as
parquet files.
In the first edition of this book I used BigQuery directly but I found that people
had some issues with it. Plus if the free plan was ever revoked or the dataset
deleted, I wanted to ensure the queries in the book could still be run locally.
For our project we want to build a table that calculates reputation metrics for

7
Chapter 1: Understanding the Database

every user. This type of table is sometimes called a “feature table” and is very
common in data science and machine learning applications. It has one row per
entity (in our case a single user) and numerical attributes pertaining to that
entity.
This is the perfect project to illustrate many of the patterns covered in this book
because it’s a challenging task that requires multiple data transformation steps.
We will first see how to build it with a single query, then in Chapter 7 we build it
using dbt.
Let’s take a look at the schema. As you can see, we have our entity identifier (in
our case the user_id and user_name) and every other column represents some
type of score pertaining to that user:
| column_name | type |
|---------------------------|---------|
| user_id | INT64 |
| user_name | STRING |
| total_posts_created | NUMERIC |
| total_answers_created | NUMERIC |
| total_answers_edited | NUMERIC |
| total_questions_created | NUMERIC |
| total_upvotes | NUMERIC |
| total_comments_by_user | NUMERIC |
| total_questions_edited | NUMERIC |
| max_streak_in_days | NUMERIC |
| total_comments_on_post | NUMERIC |
| posts_per_day | NUMERIC |
| edits_per_day | NUMERIC |
| answers_per_day | NUMERIC |
| questions_per_day | NUMERIC |
| comments_by_user_per_day | NUMERIC |
| answers_per_post | NUMERIC |
| questions_per_post | NUMERIC |
| upvotes_per_post | NUMERIC |
| downvotes_per_post | NUMERIC |
| user_comments_per_post | NUMERIC |
| comments_on_post_per_post | NUMERIC |

Understanding the Data Model

Writing accurate and efficient SQL begins with understanding the underlying
data model. It often exists as an Entity-Relationship Diagram (ERD) that shows

8
Chapter 1: Understanding the Database

you how the tables connect with each other. The ERD is usually a graphical
representation though it may not always be available so more often than not
you’ll have to learn it as you go.
You can find the original StackOverflow data model online here but the one
included with this book is slightly different I’ll walk you through it step by step.

Figure 1.1 - StackOverflow ER diagram

There are some key differences between ours and the one available on the
site above. For example the real StackOverflow model contains a single Posts

9
Chapter 1: Understanding the Database

table for all the post types whereas ours splits each one into a separate table
posts_questions and posts_answers. You can view them in our database using
the information_schema views in DuckDB like this:
--listing 1.1
SELECT table_name
FROM information_schema.tables
WHERE table_name like 'posts_%';

--sample output
table_name |
---------------+
posts_answers |
posts_questions|

Assuming you’ve set things up properly here’s the result of the query in DBeaver
(in text output mode) I’ll only use this format henceforth but your output might
be different in the GUI.
They both have the same schema which we can view using another
information_schema view:

10
Chapter 1: Understanding the Database

--listing 1.2
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'posts_answers';

Both tables have an id column that identifies a single post; creation_date that
identifies the timestamp when the post was created and a few other attributes
like score for the upvotes and downvotes, view_count, tags, etc.
Note the parent_id column which signifies a hierarchical structure. The parent_id
is a one-to-many relationship modeled within the same table. It links up all the
answers to the corresponding question. A single question can have one or many
answers but an answer belongs to one and only one question. This is relation 1
in the Figure 1.1 above.
Both post types have a one-to-many relationship to the post_history which means
that one entry in the posts tables corresponds to one or many entries in the
post_history table. These are relations 3 and 4 in the diagram above. The post
history contains a log of all the activities that can be performed on a post such
as initial creation, any subsequent edits, comments, etc.

11
Chapter 1: Understanding the Database

--listing 1.3
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'post_history';

A single post can have many types of activities identified by the post_history_type_id
column. This id shows the different types of activities a user can perform on the
site. We’re only concerned with the first 6. You can see the rest of them here if
you’re curious.

1. Initial Title - initial title (questions only)

2. Initial Body - initial post (raw body text)
3. Initial Tags - initial list of tags (questions only)
4. Edit Title - modified title (questions only)
5. Edit Body - modified post body (raw markdown)
6. Edit Tags - modified list of tags (questions only)

The first 3 indicate when a post is first submitted and the next 3 when a post is
edited. The post_history table also connects to the users table via the user_id in
a one-to-many relationship shown in Figure 1.1 as number 6. A single user can
perform multiple activities on a post.
In database lingo this is known as a bridge table because it connects two tables
(user and posts) that have a many-to-many relationship which cannot be
modeled otherwise.
The users table has one row per user and contains user attributes such as name,
reputation, etc. We’ll use some of these attributes in our final table.

12
Chapter 1: Understanding the Database

--listing 1.4
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'users';

Here’s the result of the query

Next we take a look at the comments table. It has a zero-to-many relationship

with posts and with users shown in the diagram as number 5 and number 7, since
both a user or a post could have 0 or many comments.
--listing 1.5
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'comments';

Finally the votes table represents the upvotes and downvotes on a post. We’ll
need this to compute the total vote count on a user’s post which will show how
good the question or the answer is. This table has a granularity of one row per

13
Chapter 1: Understanding the Database

vote per post per date.

--listing 1.6
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'votes';

The votes table is connected to a post in a one-to-many relationship shown in

Fig 1.1 as number 2. In order for us to get upvotes and downvotes on a user’s
post, we’ll need to join it with the users table.
Alright, now that we’ve familiarized ourselves with the data model, next let’s look
at some core concepts.

14
Chapter 2: Core Concepts and
Patterns

In this chapter we’re going to cover some of the core concepts of querying data
and building tables for analysis and data science. We’ll start with the most
important but underrated concept in SQL; granularity.

Concept 1: Granularity

Granularity (also known as the grain of the table) is a measure of the level of
detail that determines an individual row in a table or view. This is extremely
important when it comes to joins or aggregating data.
A finely grained table means a high level of detail like one row per transaction
at the millisecond level. A coarse grained table means a low level of detail like
count of all transactions per day, week or month.
Granularity is usually expressed as the combination of columns (or column) that
makes up a unique row.
For example the users table has one row per user id level of detail specified by
the id column. This is also known as the primary key of the table. That is the
finest grain of it.
The post_history table, on the other hand, contains a log of all the activities a
user performs on a post on a given date and time. Therefore the finest granularity
is one row per user, per post, per timestamp.

15
Chapter 2: Core Concepts and Patterns

The comments table contains a log of all the comments on a post by a user on a
given date so its granularity is also one row per user, per post, per timestamp.
The votes table contains a log of all the upvotes and downvotes on a post on a
given date. It has separate rows for upvotes and downvotes so its granularity is
one row per post, per vote type, per timestamp.
To find a table’s granularity you either read the documentation, or if that doesn’t
exist, you make an educated guess and check. How do you check? It’s easy.
For example for post_history I assume (or guess) that I can find a unique row by
combining creation_date, post_id, post_history_type_id and user_id.
To check we can run the following query:
--listing 2.1
SELECT
creation_date,
post_id,
post_history_type_id AS type_id,
user_id,
COUNT(*) AS total
FROM
post_history
GROUP BY
1,2,3,4
HAVING
COUNT(*) > 1;

--sample output
creation_date |post_id |type_id|user_id|total|
-----------------------+--------+-------+-------+-----+
2021-12-10 14:09:36.950|70276799| 5| | 2|

If my hunch is correct, we have found our granularity and I should get 0 rows from
this query. But we don’t! We get one row. This means we have to be careful when
joining with this table on post_id, user_id, creation_date, post_history_type_id.
We have to deal with the duplicate issue first otherwise we’ll get incorrect
results.
Our final table will have a grain of one row per user. Only the users table has that
same granularity. In order to build it we’ll have to manipulate the granularity of
the source tables so that’s what we focus on next.

16
Chapter 2: Core Concepts and Patterns

Concept 2: Granularity Collapsing

Now that you have a grasp of the concept of granularity the next thing to learn is
how to manipulate it. What I mean by manipulation is specifically going from a
fine grain to a coarser grain.
For example an e-commerce website might store each transaction it performs as
a single row on a table with the millisecond timestamp when it occurred. This
gives us a very fine-grained table (i.e. a very high level of detail). But if we wanted
to know how much revenue we got on a given day, we have to reduce that level
of detail to a single row per day. That’s exactly what aggregation does.

Pattern 1: Collapsing Column Granularity

Aggregation is a way of reducing (collapsing) the granularity of a table by grouping

multiple values of a column under the same value. This is usually done via a
CASE statement as we see below combined with a GROUP BY for the remaining
columns. You can of course do this for multiple columns at once. The more you
remove, the coarser the grain gets.
Let’s look at an example.

17
Chapter 2: Core Concepts and Patterns

--listing 2.2.1
SELECT
ph.post_id,
ph.user_id,
ph.creation_date AS activity_date,
ph.post_history_type_id
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248;

--sample output
post_id |user_id|activity_date |post_history_type_id|
--------+-------+-----------------------+--------------------+
70182248|2230216|2021-12-01 10:03:18.350| 2|
70182248|2230216|2021-12-01 10:03:18.350| 1|
70182248|2230216|2021-12-01 10:03:18.350| 3|
70182248|2230216|2021-12-01 11:04:12.603| 5|
70182248|2230216|2021-12-01 12:59:48.113| 5|
70182248|2230216|2021-12-01 13:07:56.327| 5|
70182248|2702894|2021-12-01 13:35:41.293| 6|
70182248|2230216|2021-12-01 18:41:18.033| 5|
70182248|2230216|2021-12-01 18:41:18.033| 6|
70182248|2230216|2021-12-02 07:46:22.630| 4|

Notice that there are three rows for post_history_type_id values 1, 2 and 3
which all have the same timestamp 2021-12-01 10:03:18.350 and two rows for
post_history_type_id values 5 and 6 for timestamp 2021-12-01 18:41:18.033 If you
recall the type ids from Chapter 1 values 1, 2 and 3 represent initial body, initial
title and initial tags while values 5 and 6 represent editing the body and tags.
Since we don’t really care about the specifics, we can group those ids into a
single value and then aggregate the rows in order to collapse the granularity via
a CASE statement as shown below:

18
Chapter 2: Core Concepts and Patterns

--listing 2.2.2
SELECT
ph.post_id,
ph.user_id,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) AS total
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3,4;

--sample output
post_id |user_id|activity_date |activity_type|total|
--------+-------+-----------------------+-------------+-----+
70182248|2230216|2021-12-01 10:03:18.350|create | 3|
70182248|2230216|2021-12-01 11:04:12.603|edit | 1|
70182248|2230216|2021-12-01 12:59:48.113|edit | 1|
70182248|2230216|2021-12-01 13:07:56.327|edit | 1|
70182248|2702894|2021-12-01 13:35:41.293|edit | 1|
70182248|2230216|2021-12-01 18:41:18.033|edit | 2|
70182248|2230216|2021-12-02 07:46:22.630|edit | 1|

We have now effectively manipulated the granularity of the table by reducing the
overall number of rows but retaining most of the information. Notice however
that this action is both destructive in terms of information loss and irreversible.
What I mean is that if we were to store ONLY the above table in our database and
get rid of the detailed table, we’d lose information about the specific section of
the post that was edited or created. We’d no longer know that on 2021-12-01
18:41:18.033 it was only the body and tags that were edited but not the title.
That’s why it’s common practice in data warehouses to always store the finest
grain (aka highest level of detail available) and then aggregate information on
top of it. This way we can easily debug data issues when they arise.

19
Chapter 2: Core Concepts and Patterns

Pattern 2: Collapsing Date Granularity

The timestamp column creation_date is a rich field with both the date and time
information (hour, minute, second, microsecond, millisecond). Timestamp fields
are unique when it comes to aggregation because they have many levels of
granularities built in.
Given a single timestamp, we can construct granularities for seconds, minutes,
hours, days, weeks, months, quarters, years, decades, etc. We do that by
using one of the many date manipulation functions like CAST(), DATE_TRUNC(),
DATE_PART(), etc.
For example if I wanted to remove the time information, I could collapse all
activities on a given date to a single row using DATE_TRUNC() like this:
--listing 2.3
SELECT
ph.post_id,
ph.user_id,
DATE_TRUNC('day', ph.creation_date) AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) AS total
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3,4;

--sample output
post_id |user_id|activity_date|activity_type|total|
--------+-------+-------------+-------------+-----+
70182248|2702894| 2021-12-01|edit | 1|
70182248|2230216| 2021-12-01|create | 3|
70182248|2230216| 2021-12-01|edit | 5|
70182248|2230216| 2021-12-02|edit | 1|

20
Chapter 2: Core Concepts and Patterns

Pattern 3: Pivoting Rows Into Columns

This is another form of granularity manipulation where you change the shape
of aggregated data by “pivoting” rows into columns. In the above dataset we
tried to collapse the overall granularity of the table to a single day, but we got
edit occurring twice on 2021-12-01 could we reduce the granularity further?
That’s exactly what the code below does. By pivoting the rows into columns, we
can have multiple independent aggregations occurring on the same day show
up on the same row. We will use exactly this output for our final table putting
each metric we calculate on its own column. Again notice how the granularity
manipulation process is both destructive and irreversible.
This is the query will take the above output and turn it into:
--listing 2.4
SELECT
ph.post_id,
ph.user_id,
DATE_TRUNC('day', ph.creation_date) AS activity_date,
SUM(CASE WHEN ph.post_history_type_id IN (1,2,3)
THEN 1 ELSE 0 END) AS total_created,
SUM(CASE WHEN ph.post_history_type_id IN (4,5,6)
THEN 1 ELSE 0 END) AS total_edited
FROM
post_history ph
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND ph.user_id > 0 --exclude automated processes
AND ph.user_id IS NOT NULL --exclude deleted accounts
AND ph.creation_date >= '2021-12-01'
AND ph.creation_date <= '2021-12-31'
AND ph.post_id = 70182248
GROUP BY
1,2,3;

--sample output
post_id |user_id|activity_date|total_created|total_edited|
--------+-------+-------------+-------------+------------+
70182248|2230216| 2021-12-02| 0| 1|
70182248|2702894| 2021-12-01| 0| 1|
70182248|2230216| 2021-12-01| 3| 5|

21
Chapter 2: Core Concepts and Patterns

Concept 3: Granularity Multiplication

Granularity multiplication will happen if the tables you’re joining have different
levels of detail for the columns being joined on. This will cause the resulting
number of rows to multiply.

Pattern 1: Basic JOINs

Joining tables is one of the most basic functions in SQL. Databases are designed
to minimize redundancy of information by a process known as normalization. A
normalized database splits information into many separate tables but provides
ways to join them together and re-assemble that information.
Let’s look at an example. The users table has a grain of one row per user:
--listing 2.5
SELECT
id,
display_name,
creation_date,
reputation
FROM users
WHERE id = 2702894;

Whereas the post_history table has multiple rows for the same user:

22
Chapter 2: Core Concepts and Patterns

--listing 2.6
SELECT
id,
creation_date,
post_id,
post_history_type_id AS type_id,
user_id
FROM
post_history ph
WHERE
TRUE
AND ph.user_id = 2702894
LIMIT 10;

--sample output
id |creation_date |post_id |type_id|user_id|
---------+-----------------------+--------+-------+-------+
260173419|2021-12-16 10:54:11.637|70377756| 2|2702894|
260541172|2021-12-22 07:51:17.123|70445771| 2|2702894|
260044378|2021-12-14 16:28:26.013|70352124| 6|2702894|
260548889|2021-12-22 10:04:40.227|70446634| 6|2702894|
259143984|2021-12-01 13:34:28.483|70185165| 2|2702894|
259145213|2021-12-01 13:50:18.883|70185401| 2|2702894|
259211259|2021-12-02 10:38:18.150|70197917| 2|2702894|
259212754|2021-12-02 10:59:39.880|70198204| 2|2702894|
259457154|2021-12-06 07:56:54.167|70242375| 2|2702894|

If we join them on user_id the granularity of the final result will be multiplied to
have as many rows per user:

23
Chapter 2: Core Concepts and Patterns

--listing 2.7
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
post_history_type_id AS type_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.user_id = 2702894;

--sample output
post_id |user_id|user_name |activity_date |type_id|
--------+-------+--------------+-----------------------+-------+
70377756|2702894|Graham Ritchie|2021-12-16 10:54:11.637| 2|
70445771|2702894|Graham Ritchie|2021-12-22 07:51:17.123| 2|
70352124|2702894|Graham Ritchie|2021-12-14 16:28:26.013| 6|
70446634|2702894|Graham Ritchie|2021-12-22 10:04:40.227| 6|
70185165|2702894|Graham Ritchie|2021-12-01 13:34:28.483| 2|
70185401|2702894|Graham Ritchie|2021-12-01 13:50:18.883| 2|
70197917|2702894|Graham Ritchie|2021-12-02 10:38:18.150| 2|
70198204|2702894|Graham Ritchie|2021-12-02 10:59:39.880| 2|
70242375|2702894|Graham Ritchie|2021-12-06 07:56:54.167| 2|

Notice how the user_name repeats for each row. So if the history table has 10
entries for the same user and the users table has 1, the final result will contain 10
x 1 entries for the same user. If for some reason the users contained 2 entries for
the same user (messy real world data), we’d see 10 x 2 = 20 entries for that user
in the final result and each row would repeat twice.

Pattern 2: Accidental INNER JOIN

Did you know that SQL will ignore a LEFT JOIN clause and perform an INNER JOIN
instead if you make this one simple mistake? This is one of those SQL hidden
secrets which sometimes gets asked as a question in interviews.
Let’s take a look at the example query from above:

24
Chapter 2: Core Concepts and Patterns

--listing 2.8
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_id = 70286266
ORDER BY
activity_date;

--sample output
post_id |user_id |user_name |activity_date |
--------+--------+-----------------+-----------------------+
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|12221382|Aldin Bradaric |2021-12-09 14:06:00.677|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|

You’ll see 6 rows. Now let’s change the INNER JOIN to a LEFT JOIN and rerun the
query:

25
Chapter 2: Core Concepts and Patterns

--listing 2.9
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_id = 70286266
ORDER BY
activity_date;

--sample output
post_id |user_id |user_name |activity_date |
--------+--------+-----------------+-----------------------+
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|11693691|M.hussnain Gujjar|2021-12-09 07:45:41.700|
70286266|12221382|Aldin Bradaric |2021-12-09 14:06:00.677|
70286266|NULL |NULL |2021-12-09 14:06:00.677|
70286266|NULL |NULL |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|
70286266|12410533|Andrew Halil |2021-12-13 09:02:26.593|

Now we get 8 rows! What happened?

If you scan the results, you’ll notice several where both the user_name and the
user_id are NULL which means they’re unknown. These could be people who
made changes to that post and then deleted their accounts. Notice how the
INNER JOIN was filtering them out? That’s what I mean by data reduction which
we discussed previously.
Suppose we only want to see users with a reputation of 500,000 or higher. That’s
seems pretty straightforward just add the condition to the WHERE clause.

26
Chapter 2: Core Concepts and Patterns

--listing 2.10
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND u.reputation >= 500000;

--sample output
count_star()|
------------+
7596|

We get 7,596 rows. Fine you might say, that looks right. But it’s not! Adding filters
on the WHERE clause for tables that are left joined will ALWAYS perform an INNER
JOIN.
If we wanted to filter rows in the users table and still do a LEFT JOIN we have to
add the filter in the join condition like so:
--listing 2.11
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
AND u.reputation >= 500000
WHERE
TRUE;

--sample output
count_star()|
------------+
806608|

Now we get 806,608 rows!

The ONLY time when putting a condition in the WHERE clause does NOT turn a
LEFT JOIN into an INNER JOIN is when checking for NULL.
This is very useful when you want to see the missing data on the table that’s
being left joined. Here’s an example

27
Chapter 2: Core Concepts and Patterns

--listing 2.12
SELECT
COUNT(*)
FROM
post_history ph
LEFT JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND u.id IS NULL;

--sample output
count_star()|
------------+
15704|

Concept 4: Granularity Addition

Granularity addition will happen when you want to append the results of two or
more queries. Appending only occurs at the row level while the columns remain
the same.

Pattern 1: Appending Tables

There are two ways you can append the results of multiple queries UNION ALL
and UNION. UNION ALL will append query results without checking if they have
the same exact row.
This might cause duplicates but it’s really fast. If you know for sure your results
don’t contain any rows in common this is the preferred way to append them.
Two result sets contain no rows in common if their intersection is empty, so if
you were to join them, you’d get no results.
UNION (distinct) will append query results but remove all duplicates from the
final output thus ensuring unique rows. It is much slower than UNION because
of the extra operations to find and remove duplicates. Use this only when you’re
sure the results contain rows in common and you HAVE to remove the duplicates
from the final output.

28
Chapter 2: Core Concepts and Patterns

Appending rows to a table only has two requirements:

1. The number of the columns from all tables has to be the same
2. The data types of the columns from all the tables has to line up

You can achieve the first requirement by using SELECT to choose only the columns
that match across multiple tables or if you know the tables have the same exact
schema. Note that when you union tables with different schemas, you have to
line up all the columns in the right order. This is useful when two tables have the
same column named differently.
For example:
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers

As a rule of thumb, when you append tables, it’s a good idea to add a constant
column to indicate the source table or some kind of type. This is helpful when
appending say activity tables to create a long, time-series table and you want to
identify each activity type in the final result set.
Note that when appending results, the column names will be those of the first
table and all the names of the subsequent columns will be ignored.
With that out of the way lets get into the patterns.

29
Chapter 3: Modularity Patterns

In this chapter we’ll learn some key concepts that make SQL code easier to read,
understand and maintain. We first talk about the concept of modularity and then
explore some of patterns related to it like SRP, DRY and a few others.

Concept 1: Modularity

Every complex system is made up of simple, self contained elements that can be
designed, developed and tested independently. And that means you can take
very complex queries and systematically break them down into much simpler
elements.
Just about every modern system is modular. Your smartphone might seem like
a single piece of hardware but in reality all its components (the screen, CPU,
memory, battery, speaker, GPU, accelerometer, GPS chip, etc.) were designed
independently and then assembled into a singular device.
Definition:
A module is a unit whose elements are tightly connected to themselves but
weakly connected to other units.

Modular code has the following benefits:

• When the modules are simple and self-contained the code is infinitely more
readable, easy to understand, easy to debug and fix, easy to extend and
scale.

30
Chapter 3: Modularity Patterns

• When the modules are carefully thought out, logical and with clean
interfaces the code becomes much easier to write. Once written, all you
have to do is assemble them like “LEGO” bricks instead of writing the
entire long query from scratch.
• When a system is designed with modularity in mind, the modules can be
developed by other parties in parallel so they can be assembled later. It
also makes it easy to improve functionality later on by swapping out old
modules for new ones as long as the interface is the same.

Something I’ve noticed in my experience writing modular SQL code is that

often the modules aren’t apparent upfront, especially if you’re designing a data
workflow from scratch. They become much more easier to identify in hindsight.
Therefore it’s a lot easier to define modules later through refactoring as opposed
to upfront.
In SQL there are 3 ways to modularize your SQL code:

1. Writing modular SQL using CTEs

2. Writing modular SQL using views/UDFs
3. Writing modular SQL using an external compiler (like dbt or sqlmesh)

In this chapter we’ll only cover the first two methods. The third method is more
advanced so we’ll cover it in its own chapter.

Pattern 1: Modular SQL Using CTEs

CTEs or Common Table Expressions are temporary views whose scope is limited
to the current query. They are not stored in the database; they only exist in
memory while the query is running and are only accessible inside that query.
They act like subqueries but are easier to understand and use.
CTEs allow you to break down complex queries into simpler, smaller self-
contained modules. By connecting them together we can solve any complex
query.

31
Chapter 3: Modularity Patterns

When you use CTEs you can read a query from top to bottom and easily
understand what’s going on. When you use sub-queries you have to find the
innermost subquery and work your way outwards while keeping track of
everything in your head. That’s much harder to do so your code becomes really
hard to read, understand and maintain.

Side Note: Even though CTEs have been part of the definition of the SQL
standard since 1999, it has taken many years for database vendors to
implement them. Some versions of older databases (like MySQL before 8.0,
PostgreSQL before 8.4, SQL Server before 2005) do not have support for
them. All the modern cloud warehouse vendors support them.

One of the best ways to visualize CTEs is to think of them as a DAG (aka Directed
Acyclical Graph) where each node handles a single processing step. Here are
some examples of how CTEs could be chained to solve a complex query.
In this example each CTE uses the results of the previous CTE to build upon its
result set and take it further.

Figure 3.1 - Example DAG

32
Chapter 3: Modularity Patterns

-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2 by referring to CTE 1
cte2_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 3 by referring to CTE 2
cte3_name AS (
SELECT col1
FROM cte2_name
),
-- Define CTE 4 by referring to CTE 3
cte4_name AS (
SELECT col1
FROM cte3_name
)
-- Main query
SELECT *
FROM cte4_name

In this example, CTE 3 depends on CTE 1 and CTE 2 which are independent of
each other and CTE 4 depends on CTE 3.

Figure 3.2 - Example DAG

33
Chapter 3: Modularity Patterns

-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2
cte2_name AS (
SELECT col1
FROM table2_name
),
-- Define CTE 3 by referring to CTE 1 and 2
cte3_name AS (
SELECT *
FROM cte1_name AS cte1
JOIN cte2_name AS cte2
ON cte1.col1 = cte2.col1
),
-- Define CTE 4 by referring to CTE 3
cte4_name AS (
SELECT col1
FROM cte3_name
)
-- Main query
SELECT *
FROM cte4_name

Finally here’s something more complex and its corresponding code.

Figure 3.3 - Example DAG

34
Chapter 3: Modularity Patterns

-- Define CTE 1
WITH cte1_name AS (
SELECT col1
FROM table1_name
),
-- Define CTE 2 by referring to CTE 1
cte2_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 3 by referring to CTE 1
cte3_name AS (
SELECT col1
FROM cte1_name
)
-- Define CTE 4 by referring to CTE 1
cte4_name AS (
SELECT col1
FROM cte1_name
),
-- Define CTE 5 by referring to CTE 4
cte5_name AS (
SELECT col1
FROM cte4_name
),
-- Define CTE 6 by referring to CTEs 2, 3 and 5
cte6_name AS (
SELECT *
FROM cte2_name cte2
JOIN cte3_name cte3 ON cte2.column1 = cte3.column1
JOIN cte5_name cte5 ON cte3.column1 = cte5.column1
)
-- Main query
SELECT *
FROM cte6_name

As you can see, there’s an endless way in which you can chain or stack CTEs to
solve complex queries. Now that you’ve seen the basics of what CTEs are, let’s
apply them to our project.
Getting our user data from the current form to the final form of one row per user
is not something that can be done in a single step.
Well you probably could hack something together that works but that will not be
very easy to maintain. It’s a complex query. So In order to solve it, we need to
decompose (break down) our complex query into smaller, easier to write pieces.
Here’s how to think about it:

35
Chapter 3: Modularity Patterns

We know that a user can perform any of the following activities on any given
date:

1. Post a question
2. Post an answer
3. Edit a question
4. Edit an answer
5. Comment on a post
6. Receive a comment on their post
7. Receive a vote (upvote or downvote) on their post

We have separate tables for these activities, so our first step is to aggregate the
data from each of the tables to the user_id and activity_date granularity and put
each one on its own CTE. We can break this down into several sub-problems and
map out a solution like this:

Sub-problem 1

Calculate user metrics for post types and post activity types.
To get there we first have to manipulate the granularity of the post_history table
so we have one row per user_id per post_id per activity_type per activity_date.
That would look like this:

36
Chapter 3: Modularity Patterns

--listing 3.1
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity
WHERE user_id = 4603670
ORDER BY activity_date
LIMIT 10;

--sample output:
post_id |user_id|user_name |activity_date |activity_type|
--------+-------+----------------+-----------------------+-------------+
70192540|4603670|Barmak Shemirani|2021-12-01 23:30:38.057|created |
70192540|4603670|Barmak Shemirani|2021-12-01 23:35:42.157|edited |
70193076|4603670|Barmak Shemirani|2021-12-02 01:06:08.973|edited |
70192540|4603670|Barmak Shemirani|2021-12-02 01:56:02.137|edited |
70199876|4603670|Barmak Shemirani|2021-12-02 12:54:40.230|created |
70199876|4603670|Barmak Shemirani|2021-12-02 13:21:05.200|edited |
70199876|4603670|Barmak Shemirani|2021-12-02 14:14:56.210|edited |
70208753|4603670|Barmak Shemirani|2021-12-03 02:18:58.930|created |
70208753|4603670|Barmak Shemirani|2021-12-03 02:40:51.667|edited |
70212702|4603670|Barmak Shemirani|2021-12-03 11:40:09.240|edited |

We then join this with the posts_questions and post_answers on post_id. That
would look like this:

37
Chapter 3: Modularity Patterns

--listing 3.2
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY ALL
),
post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
--continued below

38
Chapter 3: Modularity Patterns

SELECT
pa.user_id,
CAST(pa.activity_date AS DATE) AS activity_date,
pa.activity_type,
pt.post_type
FROM
post_activity pa
JOIN post_types pt ON pa.post_id = pt.post_id
WHERE user_id = 4603670
LIMIT 10;

--sample output:
user_id|activity_date|activity_type|post_type|
-------+-------------+-------------+---------+
4603670| 2021-12-01|edit |answer |
4603670| 2021-12-01|create |answer |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-02|create |answer |
4603670| 2021-12-02|edit |question |
4603670| 2021-12-02|edit |answer |
4603670| 2021-12-03|edit |answer |
4603670| 2021-12-03|create |answer |
4603670| 2021-12-03|edit |question |

What we really want is to pivot data from rows into columns using Pattern 3 from
Chapter 2:

39
Chapter 3: Modularity Patterns

--listing 3.3
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
),
post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
--continued below

40
Chapter 3: Modularity Patterns

SELECT
user_id,
CAST(pa.activity_date AS DATE) AS activity_dt,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_create,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_create,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_edit,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_edit
FROM post_activity pa
JOIN post_types pt ON pt.post_id = pa.post_id
WHERE user_id = 4603670
GROUP BY 1,2
LIMIT 10;

--sample output
user_id|activity_dt|question_create|answer_create|question_edit|answer_edit|
-------+-----------+---------------+-------------+-------------+-----------+
4603670| 2021-12-01| 0| 1| 0| 1|
4603670| 2021-12-02| 0| 1| 1| 3|
4603670| 2021-12-03| 0| 3| 1| 5|
4603670| 2021-12-04| 0| 2| 0| 6|
4603670| 2021-12-05| 0| 2| 0| 3|
4603670| 2021-12-06| 0| 3| 2| 9|
4603670| 2021-12-07| 0| 2| 3| 2|
4603670| 2021-12-08| 0| 2| 2| 6|
4603670| 2021-12-09| 0| 0| 1| 0|
4603670| 2021-12-10| 0| 1| 1| 1|

Sub-problem 2

Calculate comments metrics. There are two types of comments:

1. Comments by a user (on one or many posts)

2. Comments on a user’s post (by other users)

The query and final result should look like this:

41
Chapter 3: Modularity Patterns

--listing 3.4
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, comments_on_user_post AS (
SELECT
pa.user_id,
CAST(c.creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments c
INNER JOIN post_activity pa ON pa.post_id = c.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
--continued below

42
Chapter 3: Modularity Patterns

, comments_by_user AS (
SELECT
user_id,
CAST(creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments
GROUP BY
1,2
)
SELECT
c1.user_id,
c1.activity_date,
c1.total_comments AS comments_by_user,
c2.total_comments AS comments_on_user_post
FROM comments_by_user c1
LEFT OUTER JOIN comments_on_user_post c2
ON c1.user_id = c2.user_id
AND c1.activity_date = c2.activity_date
WHERE
c1.user_id = 4603670
LIMIT 10;

--sample output
user_id|activity_date|comments_by_user|comments_on_user_post|
-------+-------------+----------------+---------------------+
4603670| 2021-12-03| 3| 7|
4603670| 2021-12-05| 7| 1|
4603670| 2021-12-06| 9| 6|
4603670| 2021-12-08| 6| 7|
4603670| 2021-12-10| 4| 2|
4603670| 2021-12-11| 3| 6|
4603670| 2021-12-12| 2| 4|
4603670| 2021-12-13| 1| 1|
4603670| 2021-12-26| 1| 3|
4603670| 2021-12-24| 3| 2|

Sub-problem 3

Calculate votes metrics. There are two types of votes:

1. Upvotes on a user’s post

2. Downvotes on a user’s post

The query and final result should look like this:

43
Chapter 3: Modularity Patterns

--listing 3.5
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, votes_on_user_post AS (
SELECT
pa.user_id,
CAST(v.creation_date AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
--continued below

44
Chapter 3: Modularity Patterns

SELECT
v.user_id,
v.activity_date,
v.total_upvotes,
v.total_downvotes
FROM
votes_on_user_post v
WHERE
v.user_id = 4603670
LIMIT 10;

--sample output:
user_id|activity_date|total_upvotes|total_downvotes|
-------+-------------+-------------+---------------+
4603670| 2021-12-02| 0| 1|
4603670| 2021-12-03| 3| 0|
4603670| 2021-12-05| 2| 0|
4603670| 2021-12-06| 5| 0|
4603670| 2021-12-07| 2| 0|
4603670| 2021-12-08| 2| 0|
4603670| 2021-12-09| 1| 0|
4603670| 2021-12-10| 0| 0|
4603670| 2021-12-11| 2| 0|
4603670| 2021-12-12| 1| 0|

By now you should start to see very clearly how the final result is constructed.
All we have to do is take the 3 results from the sub-problems and join them
together on user_id and activity_date This will allow us to have a single table
with a granularity of one row per user and all the metrics aggregated on the day
level like this:

45
Chapter 3: Modularity Patterns

--code snippet will not actually run

--listing 3.6
SELECT
pm.user_id,
pm.user_name,
CAST(SUM(pm.posts_created) AS NUMERIC) AS total_posts_created,
CAST(SUM(pm.posts_edited) AS NUMERIC) AS total_posts_edited,
CAST(SUM(pm.answers_created) AS NUMERIC) AS total_answers_created,
CAST(SUM(pm.answers_edited) AS NUMERIC) AS total_answers_edited,
CAST(SUM(pm.questions_created) AS NUMERIC) AS total_questions_created,
CAST(SUM(pm.questions_edited) AS NUMERIC) AS total_questions_edited,
CAST(SUM(vu.total_upvotes) AS NUMERIC) AS total_upvotes,
CAST(SUM(vu.total_downvotes) AS NUMERIC) AS total_downvotes,
CAST(SUM(cu.total_comments) AS NUMERIC) AS total_comments_by_user,
CAST(SUM(cp.total_comments) AS NUMERIC) AS total_comments_on_post,
CAST(COUNT(DISTINCT pm.activity_date) AS NUMERIC) AS streak_in_days
FROM
user_post_metrics pm
JOIN votes_on_user_post vu
ON pm.activity_date = vu.activity_date
AND pm.user_id = vu.user_id
JOIN comments_on_user_post cp
ON pm.activity_date = cp.activity_date
AND pm.user_id = cp.user_id
JOIN comments_by_user cu
ON pm.activity_date = cu.activity_date
AND pm.user_id = cu.user_id
GROUP BY
1,2

Pattern 2: Modular SQL Using Views/UDFs

When you find yourself copying and pasting CTEs across multiple queries it’s
time to turn them into views or UDFs. Views are database objects that can be
queried with SQL just like a table.
The difference between the two is that views typically don’t contain any data.
They store a query that gets executed every time the view is queried (just like a
CTE).
I say “typically” because there are certain types of views that do contain data
(known as materialized views but we won’t cover them here).
Creating a view is easy:

46
Chapter 3: Modularity Patterns

CREATE OR REPLACE VIEW <view_name> AS

SELECT col1
FROM table1
WHERE col1 > x;

Once created you can run:

SELECT *
FROM <view_name>

This view is now stored in the database but it doesn’t take up any space (unless
it’s materialized). It only stores the query which is executed each time you select
from the view or join the view in a query.
Views can be put inside of CTEs or can themselves contain CTEs, thus creating
multiple layers of modularity. Here’s an example of what that would look like.

Figure 3.4 - Example DAG

Side Note: By combining views and CTEs, you’re nesting many queries
within others. Not only does this negatively impact performance but some
databases have limits to how many levels of nesting you can have.

A great application of SRP is to use a view to rename the columns of an external

table or present several joined tables as a single object thus providing a safe
interface to the rest of your downstream code.
One of the benefits of building reusable CTEs is that if you find yourself copying

47
Chapter 3: Modularity Patterns

and pasting the same CTE in multiple places, you can turn it into a view and store
it in the database. What could be made into a view in our specific query?
I think the post_types CTE would be a good candidate. That way whenever you
have to combine all the post types you don’t have to use that CTE everywhere.
--listing 3.7
CREATE OR REPLACE VIEW v_post_types AS
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers;

User Defined Functions (UDFs)

Similar to views you can also put commonly used logic into UDFs (user-defined
functions) Pretty much all databases allow you to create UDFs but they each
use different programming languages to do so. Different database systems
use different programming languages to allow for UDF creation. DuckDB offers
Python for such functionality. You can read about it here
Functions allow for a lot more flexibility in data processing. While tables and
views use set based logic (set algebra) for operating on data, functions allow
you to work on a single row at a time, use conditional flow of logic (if-then-else),
variables and loops which makes it easy to implement complex logic.
They can return a single scalar value or a table. A single scalar value can be
used for example to parse JSON formatted strings via regular expressions. Table
valued functions return a table instead of a single value.
They behave exactly like views but the main difference is that they can take
input parameters and return different result sets based on that. This can be very
useful.

48
Chapter 3: Modularity Patterns

Concept 2: Single Responsibility Principle (SRP)

The SRP principle dictates that your modules should be small, self-contained
and have a single responsibility or purpose. For example you don’t expect the
GPS chip on your phone to also handle WiFi connectivity. The main benefit
of SRP is that it makes modules more composable and facilitates code reuse.
By organizing your code into well thought out “LEGO” blocks, writing complex
queries becomes infinitely easier. dbt makes SRP infinitely better as we’ll see in
a later chapter.

Pattern 1: Applying SRP

When you’re designing a query and breaking it up into CTEs, there is one principle
to keep in mind. Whenever possible, construct CTEs to ensure that they can be
reused in the query later.
Let’s take a look at the example from earlier:
--listing 3.8
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity;

49
Chapter 3: Modularity Patterns

This CTE performs several operations like aggregation – to decrease granularity

of the underlying data – joining and filtering. Its main purpose is to get a mapping
between user_id and post_id at the right level of granularity so it can be used
later.
What’s great it is that we can also use it for generating user metrics:
--code snippet will not actually run
--listing 3.9
SELECT
user_id,
CAST(pa.activity_date AS DATE) AS activity_date,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_created,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_created,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_edited,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_edited
FROM
post_activity pa
JOIN post_types pt ON pt.post_id = pa.post_id
GROUP BY
1,2

and to join with comments and votes to user level data via the post_id

50
Chapter 3: Modularity Patterns

--code snippet will not actually run

--listing 3.10
, comments_on_user_post AS (
SELECT
pa.user_id,
CAST(c.creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments c
INNER JOIN post_activity pa ON pa.post_id = c.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
, votes_on_user_post AS (
SELECT
pa.user_id,
CAST(v.creation_date AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)

This is at the heart of a well-designed CTE. Notice here that we’re being very
careful about granularity multiplication! If we simply joined with post_activity
on post_id without specifying the activity_type we’d get duplication. By filtering
to inlude only created posts, since a post can only be created once, we’re pretty
safe in getting a single row per post.

Concept 3: Don’t Repeat Yourself (DRY)

The DRY principle dictates that a piece of code encapsulating some functionality
must appear only once in a codebase. So ff you find yourself copying and pasting
the same chunk of code everywhere your code is not DRY. The main benefit of

51
Chapter 3: Modularity Patterns

DRY code is maintainability. If you need to change your logic later, and there’s
a lot of repetition, you have to change all the places where the code repeats
instead of a single place.

Pattern 1: Applying DRY

In the previous section we saw how we can decompose a large complex query
into multiple smaller components. The main benefit for doing this is that it makes
the queries more readable. In that same vein, the DRY (Don’t Repeat Yourself)
principle ensures that your query is clean from unnecessary repetition.
The DRY principle states that if you find yourself copy-pasting the same chunk of
code in multiple locations, you should put that code in a CTE and reference that
CTE where it’s needed.
To illustrate let’s rewrite the query from the previous chapter so that it still
produces the same result but it clearly shows repeating code

52
Chapter 3: Modularity Patterns

--listing 3.11
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u on u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
, questions AS (
SELECT
id AS post_id,
'question' AS post_type,
pa.user_id,
pa.user_name,
pa.activity_date,
pa.activity_type
FROM
posts_questions q
INNER JOIN post_activity pa ON q.id = pa.post_id
)
, answers AS (
SELECT
id AS post_id,
'answer' AS post_type,
pa.user_id,
pa.user_name,
pa.activity_date,
pa.activity_type
FROM
posts_answers q
INNER JOIN post_activity pa ON q.id = pa.post_id
)
--continued below

53
Chapter 3: Modularity Patterns

SELECT
user_id,
CAST(activity_date AS DATE) AS activity_dt,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_create,
SUM(CASE WHEN activity_type = 'create'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_create,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'question' THEN 1 ELSE 0 END) AS question_edit,
SUM(CASE WHEN activity_type = 'edit'
AND post_type = 'answer' THEN 1 ELSE 0 END) AS answer_edit
FROM
(SELECT * FROM questions
UNION ALL
SELECT * FROM answers) AS p
WHERE
user_id = 4603670
GROUP BY 1,2
LIMIT 10;

This query will get you the same results as table 3.3 you saw earlier but notice
that the questions and answers CTEs both have almost identical code. What if
we had 10 different post types? You’d be copying and pasting a lot of code thus
repeating yourself. Also, the subquery that handles the UNION is not ideal.

Concept 4: Move Logic Upstream

When you find yourself implementing very specific logic in a model that might
be used elsewhere, move that logic upstream closer to the source of data. In
the world of DAGs, upstream has a very precise meaning. It means to move

54
Chapter 3: Modularity Patterns

potentially common logic onto earlier nodes in the graph because you never
know which downstream models might use it.

(Models here represent dbt models which will be covered in a separate chapter)
Figure 3.5 - SQL DAG
With that out of the way let’s now look at some performance patterns.

55
Chapter 4: Performance Patterns

In this chapter we’re going to talk about query performance, aka how to make
your queries run faster. Why do we care about making queries run faster? Faster
queries get you results faster, obviously, but they also consume fewer resources,
making them cheaper on modern data warehouses.
This chapter isn’t just about speed however. There are many clever hacks to make
your queries run really fast, but many of them will make your code unreadable
and unmaintainable. We want to strike a balance between performance and
maintainability.

Concept 1: Reduce Unnecessary Processing

The most important concept in performance tuning is to reduce unnecessary

processing as much as possible. What does that mean? It means if there’s rows
or columns you don’t need to look at to solve your query you should try to
avoid them through filtering, joining, etc. Same thing with sorting or getting
distinct rows. Each one of those unnecessary operations requires computational
resources which make queries slow.

Pattern 1: Reduce Unnecessary Rows

So far we’ve learned that using modularity via CTEs and views is the best way to
tackle complex queries. We also learned to keep our modules small and single

56
Chapter 4: Performance Patterns

purpose to ensure maximum composability. CTEs are great for aggregation

and calculation of metrics but they can also be used to filter data as early as
possible.
Let’s take a look at the example from the last chapter but now let’s add a filter
for only the activity that occurred in the second week of December 2021.
--listing 4.1
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity
WHERE activity_date BETWEEN '2021-12-14' AND '2021-12-21'
LIMIT 10;

--sample output
post_id |user_id |user_name |activity_date |activity_type|
--------+--------+----------------+-----------------------+-------------+
70401248|13437718|BGE34 |2021-12-18 05:50:33.917|edit |
70380038|17501206|vtable |2021-12-16 21:47:01.913|edit |
70387919|17697814|user17697814 |2021-12-17 02:55:13.043|create |
70364800|17436438|user17436438 |2021-12-15 13:48:18.577|create |
70382506|12327190|TalGav |2021-12-16 16:31:44.240|create |
70401589| 5708566|windowsill |2021-12-18 07:05:07.927|create |
70401645| 8331542|Saad Abdul Majid|2021-12-18 07:17:10.987|create |
70418579| 4925718|msefer |2021-12-20 07:25:11.413|create |
70362252| 4925718|msefer |2021-12-15 13:35:49.967|edit |
70362983| 4925718|msefer |2021-12-20 07:13:06.500|edit |

This is a correct way to filter the results and it may even be performant in our

57
Chapter 4: Performance Patterns

small database using the blazingly fast DuckDB engine but it’s better if we can
filter data inside the CTE vs outside. Sometimes that’s by design, for example we
might want a rolling window of just the current week’s post activity.
--listing 4.2
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
AND activity_date BETWEEN '2021-12-14' AND '2021-12-21'
GROUP BY
1,2,3,4,5
)
SELECT *
FROM post_activity
LIMIT 10;

Moving the WHERE clause filter inside the CTE is an example of filtering data as
early as possible. We might use that CTE several times and it will make our query
more performant if we do.

58
Chapter 4: Performance Patterns

Now of course I know that many modern database will automatically do

“predicate pushdown” which means they will see the WHERE clause outside
the CTE but still apply it inside. They will filter the rows before doing anything
else.
But it doesn’t always happen. I’ve seen cases where due to the table setup, a
query like 4.1 took 10 hours and changing it to the query in 4.2 reduced execution
time to 10 minutes!! Rather than relying on databases to do the right thing, we
can ensure that we do the right thing for it.

Pattern 2: Reduce Unnecessary Columns

Almost every SQL book or course will tell you to start exploring a table by doing:
--listing 4.3
SELECT *
FROM posts_questions
LIMIT 10;

This may be ok in a traditional RDBMS, but with modern data warehouses things
are different. Because they store data in columns vs rows _SELECT *_ will scan the
entire table and your query will be slower even if we’re limiting it to 10 rows.
Here’s an example you’ve seen before. In the post_activity CTE we select only
the id column which is the only one we need to join with post_activity on. The
post_type is a static value which is negligible when it comes to performance.
--code snippet will not run
--listing 4.4
,post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)

59
Chapter 4: Performance Patterns

Compared to:
--code snippet will not run
--listing 4.5
,post_types AS (
SELECT
pq.*,
id AS post_id,
'question' AS post_type,
FROM
posts_questions pq
UNION ALL
SELECT
pa.*,
id AS post_id,
'answer' AS post_type,
FROM
posts_answers pa
)

It may seem innocent at first, but if any of those tables contained 300 columns,
now you’ll be selecting all 300 of them every time you join on those CTEs. You
don’t have to know anything about databases to know that the query will be
much slower than if you selected a subset of columns.

Pattern 3: Delay Unnecessary Sorting

As a rule of thumb you should AVOID any kind of sorting inside production level
queries. Sorting is a very expensive operation, especially for really large tables
and it will dramatically slow down your queries.
What’s worse, if you add an ORDER BY operation in your CTEs or views, anytime
you join with that CTE or view, the database engine will be forced to sort data
every time before joining. That will make your queries crawl!
Sorting is best left to reporting and BI tools if it’s not needed, or done at the very
end, if it is at all necessary. You can’t always avoid it though. Window functions
for example necessitate sorting in order to choose the top row. We’ll see an
example of this later.
For example, the following is unnecessary and slows down performance because

60
Chapter 4: Performance Patterns

the sorting is done is inside a CTE. You don’t need to sort your data yet.
--code snippet will not run
--listing 4.6
, votes_on_user_post AS (
SELECT
pa.user_id,
CAST(DATE_TRUNC(v.creation_date, DAY) AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
AND v.creation_date BETWEEN '2021-12-14' AND '2021-12-21'
GROUP BY
1,2
ORDER BY
v.creation_date
)

Pattern 4: Avoid Unnecessary DISTINCT

SELECT DISTINCT is a code smell for me. Whenever I see it, I suspect the
programmer is trying to hide data problems without fixing them. It’s so common
as a catchall fix that this meme exploded both on Twitter/X and LinkedIn when I
posted it.

61
Chapter 4: Performance Patterns

Figure 4.1 - SELECT DISTINCT Meme

SELECT DISTINCT might fix your data problems but used liberally in your code
will cause many performance degradations. Why? In order to find the distinct
values in a result, the database engine has to perform expensive operations like
sorting, hashing and aggregating.

62
Chapter 4: Performance Patterns

Now imagine if DISTINCT is coded inside of a view and that view gets used multiple
times downstream. Those operations will be performed every time you join on
that view.
If you must use it, make sure you materialize the query into a table that gets
refreshed regularly with a tool like dbt. That way your results are clean and the
DISTINCT operation is performed once.
The most insidious application of DISTINCT I have personally dealt with is when
combining multiple tables via the UNION operator. As discussed in Chapter 2
Pattern 4, the UNION operator will append data and ensure uniqueness of the
results.
In this case I had inadvertently used UNION instead of UNION ALL and when I
fixed it, query execution went from 15 minutes down to 1 minute while the result
was identical.
Here’s my original query:

63
Chapter 4: Performance Patterns

--original code
with cte_union_source_data as (
select
column1,
column2,
count(*) as total
from source_table1
group by 1, 2
union
select
column1,
column2,
count(*) as total
from source_table2
group by 1, 2
union
select
column1,
column2,
count(*) as total
from source_table3
group by 1, 2
)
select
column1,
column2,
sum(total) as total
from
cte_union_source_data;

It’s pretty straightforward. I was aggregating the results from multiple tables
inside a CTE then summing everything up. By using UNION I was guaranteeing
uniqueness of the results before the final aggregation. This query was taking 15
minutes.
Once I realized my mistake, I changed it to this:

64
Chapter 4: Performance Patterns

-- refactored code
with cte_union_source_data as (
select
column1,
column2
from
source_tablel
union all
select
column1,
column2
from
source_table2
union all
select
column1,
column2
from
source_table3
)
select
column1,
column2,
count(*) as total
from
cte_union_source_data;

Now I’m simply appending all the results – including any duplicates – and then
aggregating them. Apart from being 15x faster, because we’re only doing one
aggregation and avoiding , this query is simpler and more compact.
Here’s an example with our database. Suppose I’m trying to get the total user
activity (i.e. posts created, edited and commented on) My original query looked
like this.

65
Chapter 4: Performance Patterns

--listing 4.15
WITH cte_user_activity_by_type AS (
SELECT
user_id,
CASE WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type,
COUNT(*) as total_activity
FROM
post_history
GROUP BY
1,2
UNION
SELECT
user_id,
'commented' AS activity_type,
COUNT(*) as total_activity
FROM
comments
GROUP BY
1,2
)
SELECT
user_id,
sum(total_activity) as total_activity
FROM
cte_user_activity_by_type
GROUP BY 1
LIMIT 10;

--sample output
user_id |total_activity|
--------+--------------+
3690518| 2|
3439894| 37|
5454021| 4|
14391494| 10|
7069126| 9|
433351| 4|
2186184| 6|
12579274| 11|
15821771| 22|
752843| 16|

Notice how I’m using two CTEs for aggregation and how I append them using
UNION vs UNION ALL. While the final result is correct because I sum the total
activity, the aggregation inside the CTEs is unnecessary.
We could rewrite the query using UNION ALL while simultaneously avoiding
expensive aggregation like this:

66
Chapter 4: Performance Patterns

--listing 4.16
WITH cte_user_activity_by_type AS (
SELECT
user_id,
CASE WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history
UNION ALL
SELECT
user_id,
'comment' AS activity_type
FROM
comments
)
SELECT
user_id,
COUNT(*) as total_activity
FROM
cte_user_activity_by_type
LIMIT 10;

Concept 2: Keep the WHERE Clause Simple

In case you didn’t know, you can put anything in the WHERE clause. You already
know about filtering on dates, numbers and strings of course but you can also
filter by calculations, functions, CASE statements, etc. WHERE clauses can get
quite complicated.

Pattern 1: Compare to Static Values (if possible)

When you use compare a column to a fixed value or to another column, the query
optimizer can filter down to the relevant rows much faster. When you use a
function or a complicated formula, the optimizer needs to scan the entire table
and perform that calculation before doing the filtering.
This is negligible for small tables but when dealing with millions of rows query
performance will suffer. Let’s see some examples. The tags column in both

67
Chapter 4: Performance Patterns

questions and answers is a collection of strings separated by the “|” character as

you see here:
--listing 4.7
SELECT
q.id AS post_id,
q.creation_date,
q.tags
FROM
posts_questions q
LIMIT 10;

--sample output
post_id |creation_date |tags |
--------+-----------------------+-------------------------------------+
70177589|2021-12-01 00:02:03.777|blockchain|nearprotocol|near|nearcore|
70177596|2021-12-01 00:02:52.657|google-oauth|google-workspace |
70177598|2021-12-01 00:03:16.373|python|graph|networkx |
70177601|2021-12-01 00:03:32.413|elasticsearch |
70177623|2021-12-01 00:06:16.950|python|tkinter |
70177624|2021-12-01 00:06:19.537|c# |
70177627|2021-12-01 00:07:50.607|flutter |
70177629|2021-12-01 00:08:02.943|python|python-3.x|pexpect |
70177630|2021-12-01 00:08:16.173|sql|sql-server|tsql |
70177633|2021-12-01 00:08:46.233|sql|sql-server|tsql |

The tags pertain to the list of topics or subjects that a post is about. One of the
tricky things about storing tags like this is that you don’t have to worry about
the order in which they appear. There’s no categorization system here. A tag can
appear anywhere in the string.
Suppose we’re looking for posts mentioning SQL. How would we do it? I’m pretty
sure you’re familiar with pattern matching strings in SQL using the keyword
LIKE
But since we don’t know if the string is capitalized (i.e. it could be SQL, sql, Sql,
etc) and we want to match all of them, it’s common to use a function like LOWER()
to force the case before matching the pattern.
Here’s an example of what NOT to do (unless you’re doing ad-hoc querying)

68
Chapter 4: Performance Patterns

--listing 4.8
SELECT
q.id AS post_id,
q.creation_date,
q.tags
FROM
posts_questions q
WHERE
TRUE
AND lower(tags) like '%sql%'
LIMIT 10;

Here’s how to get the same result without using functions in WHERE
--listing 4.9
SELECT
q.id AS post_id,
q.creation_date,
q.tags
FROM
posts_questions q
WHERE
TRUE
AND tags ilike '%sql%'
LIMIT 10;

In our small database this query will be quite fast, however by using the function
LOWER() in the WHERE clause, you’re causing the database engine to scan the
entire table, perform the lowercase operation and then perform the filtering. By
using the keyword ILIKE (which makes the pattern match is case-insensitive) we
avoids using LOWER()
Alternatively you can perform the LOWER() operator beforehand in a CTE or view
like this:

69
Chapter 4: Performance Patterns

--listing 4.10
WITH cte_lowercase_tags AS (
SELECT
q.id AS post_id,
q.creation_date,
LOWER(q.tags) as tags
FROM
posts_questions q
)
SELECT *
FROM cte_lowercase_tags
WHERE tags LIKE '%sql%'
LIMIT 10;

--sample output
post_id |creation_date |tags |
--------+-----------------------+--------------------------+
70338059|2021-12-13 16:46:16.940|mysql|node.js|sequelize.js|
70276304|2021-12-08 14:02:39.313|sql-order-by|where-clause |
70341363|2021-12-13 21:50:42.510|php|mysql |
70218001|2021-12-03 16:54:34.417|windows|postgresql |
70287562|2021-12-09 09:35:49.333|database|psql |
70292467|2021-12-09 15:25:07.093|mysql |
70316036|2021-12-11 14:37:31.220|python|sqlalchemy |
70239290|2021-12-05 22:56:40.487|javascript|sqlite |
70274207|2021-12-08 11:26:41.477|sql|rest|td-engine |
70192916|2021-12-02 00:33:41.363|sql|spring|spring-boot |

I mentioned earlier that this is not advisable but in this case, if you really need to
lowercase tags it’s another option. Ideally we can prepare data ahead of time so
that production level tables contain strings with a consistent case. You do that
with a tool like dbt where you can materialize the lowercase tags into a table to
make downstream querying much easier.
Let’s look at a few more examples. In this query we’re trying to filter by performing
a math operation in the WHERE clause. Same thing applies. The database
performs a full table scan before filtering.

70
Chapter 4: Performance Patterns

--listing 4.11
SELECT
q.id AS post_id,
q.creation_date,
q.answer_count + q.comment_count as total_activity
FROM
posts_questions q
WHERE
TRUE
AND answer_count + comment_count >= 10
LIMIT 10;

--sample output
post_id |creation_date |total_activity|
--------+-----------------------+--------------+
70270242|2021-12-08 05:09:48.113| 10|
70255288|2021-12-07 05:19:45.337| 12|
70256716|2021-12-07 08:04:30.497| 10|
70318632|2021-12-11 20:10:08.213| 12|
70334900|2021-12-13 12:45:37.097| 11|
70333905|2021-12-13 11:29:00.117| 14|
70237681|2021-12-05 19:13:40.890| 10|
70257087|2021-12-07 08:38:39.263| 10|
70281346|2021-12-08 20:29:31.357| 13|
70190971|2021-12-01 20:43:14.507| 12|

We can do the same thing here:

71
Chapter 4: Performance Patterns

--listing 4.12
WITH cte_lowercase_tags AS (
SELECT
q.id AS post_id,
q.creation_date,
q.answer_count + q.comment_count as total_activity
FROM
posts_questions q
)
SELECT *
FROM cte_lowercase_tags
WHERE total_activity >= 10
LIMIT 10;

Let’s look at another common example with date functions. You often want to
filter on a date field by using the week, month, quarter, etc. It’s quite common to
see queries where you apply a date partition function in the WHERE clause so
you can filter to the proper week like below. Here we want only the questions
posted on week 50.

72
Chapter 4: Performance Patterns

--listing 4.13
SELECT
q.id AS post_id,
q.creation_date,
DATE_PART('week', creation_date) as week_of_year
FROM
posts_questions q
WHERE
DATE_PART('week', creation_date) = 50
LIMIT 10;

--sample output
post_id |creation_date |week_of_year|
--------+-----------------------+------------+
70337022|2021-12-13 15:25:08.903| 50|
70338059|2021-12-13 16:46:16.940| 50|
70348470|2021-12-14 11:56:02.373| 50|
70347796|2021-12-14 11:02:31.563| 50|
70347279|2021-12-14 10:24:40.953| 50|
70337072|2021-12-13 15:28:32.317| 50|
70328850|2021-12-13 00:35:38.387| 50|
70332341|2021-12-13 09:22:07.927| 50|
70333562|2021-12-13 11:00:05.760| 50|
70341363|2021-12-13 21:50:42.510| 50|

With dates we can be a little clever and avoid using DATE_PART() in the WHERE
clause. Basically we can dynamically calculate the start date and end date of week
50 and then filter directly by creation_date() Note that applying DATE_TRUNC()
on a static value (like 2021-01-01) is really fast. The same applies if you use scalar
functions that return a single value (e.g CURRENT_DATE()).

73
Chapter 4: Performance Patterns

--listing 4.14
SELECT
q.id AS post_id,
q.creation_date,
DATE_PART('week', creation_date) as week_of_year
FROM
posts_questions q
WHERE
creation_date >= DATE_TRUNC('week', '2021-01-01'::date + INTERVAL 50 WEEK
)
AND creation_date < DATE_TRUNC('week', '2021-01-01'::date + INTERVAL 51
WEEK)
LIMIT 10;

Pattern 2: Avoid OR in the WHERE clause

Using OR in the WHERE clause can be quite natural based on the logic you’re
trying to implement but I bet you didn’t know there are hidden, performance
“gotchas” if you do. They’re not very obvious either so let me show you.
If you use OR to search for multiple values of the same column, there will be no
performance issues. In fact you already do this without realizing it.
Let’s see an example. This query will get all the created posts:
--listing 4.17
SELECT
post_id,
creation_date,
user_id
FROM
post_history
WHERE
post_history_type_id IN (1,2,3);

74
Chapter 4: Performance Patterns

But did you know that the above is equivalent to this?

--listing 4.18
SELECT
post_id,
creation_date,
user_id
FROM
post_history
WHERE
post_history_type_id = 1
OR post_history_type_id = 2
OR post_history_type_id = 3;

This is an example where using OR in the WHERE clause doesn’t incur a

performance penalty. You can even combine OR with AND (as long as you use
parenthesis in the right place) and you’ll still be ok because the OR is applying to
a single column.
--listing 4.19
SELECT
post_id,
creation_date,
user_id
FROM
post_history
WHERE
(
post_history_type_id = 1
OR post_history_type_id = 2
OR post_history_type_id = 3
)
AND
(
user_id = 17335553
OR user_id = 17551873
OR user_id = 15137025
);

However a query like would very likely be problematic:

75
Chapter 4: Performance Patterns

--listing 4.20
SELECT
ph.post_id,
ph.creation_date,
u.display_name
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
ph.post_history_type_id = 1 OR u.up_votes >= 100;

When I see a query like this, I immediately know it will cause problems. It might
be fast in our tiny database with a fast engine like DuckDB but when you throw
millions of rows at it, you will see performance degradation.
What happens is that the database engine will most likely perform the two
separate filtering operations then combine the results via a join. But there’s
good news! You can rewrite the above query using UNION ALL get the same exact
result while seeing 10x - 100x performance improvement. Here it is:
--listing 4.21
SELECT
post_id,
ph.creation_date,
user_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
post_history_type_id = 1
UNION ALL
SELECT
post_id,
ph.creation_date,
user_id
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
u.up_votes >= 100;

What we’ve done here is to separate the two filtering conditions into their own
separate query then combine the results. This will often cause the database

76
Chapter 4: Performance Patterns

engine to parallelize the filtering operations and then simply append them which
is a lot faster.
With that we wrap up our chapter on query performance. There’s a lot more
to learn about improving query performance but that’s not the purpose of this
book. In the next chapter we’ll cover how to make your queries robust against
unexpected changes in the underlying data.

77
Chapter 5: Robustness Patterns
In this chapter we’re going to talk about how to protect your queries against
most data problems you’ll encounter. Robustness means that your query will
not break if the underlying data changes unpredictably.
Spend enough time working with real world data and you’ll eventually get
burned by one of these. But when you know about them ahead of time you can
write defensive code.

Side Note: I don’t believe in the term “dirty data.” There’s no such thing. I
prefer the terms “fit for purpose” and “unfit for purpose.” Most real world
data is unfit for the purposes you want so you have to “retrofit” it to make
it suitable. Retrofitting is a much more suitable term for data preparation
because it avoids blame. Data that’s fit for purpose can be used as is.

We cannot know exactly how data will change, but what we CAN foresee many of
the patterns of how data changes and write our queries to protect against them.
Here are some of the patterns of data changes:
1. New columns are added that have NULL values for past data
2. Existing columns that didn’t have NULLs before now contain NULLs
3. Columns that contained numbers or dates stored as strings now contain
unexpected values
4. The formatting of dates or numbers gets messed up and type conversion
fails.
5. The denominator in a ratio calculation becomes zero and now we’re
dividing by zero

78
Chapter 5: Robustness Patterns

6. Strings have different casing so direct comparison fails

7. Schemas change and columns that existed before no longer exist

We’ll break these patterns down into two three groups:

1. Handling type conversions

2. Handling NULLs
3. Handling division by zero
4. Handling inconsistent comparisons
5. Handling schema changes

Concept 1: Handling Type Conversions

SQL supports 3 primitive data types, strings, numbers and dates. They allow
for mathematical operations with numbers, calendar operations with dates and
many types of string operations.
It’s quite common to see numbers and dates stored as strings, especially when
you’re loading flat text files like CSVs or TSVs. Some data loading tools will try
and guess the type and format it on the fly but they’re not always correct. So you
will often have to manually convert dates and numbers.
The standard function for converting data in SQL is CAST(). Some other database
implementations, like SQL Server, also use their own custom function called
CONVERT() but also support CAST(). We will use CAST() to both convert between
types (like string to date) or within the same type (like a timestamp to date)
Here’s an example of how type conversion works:
--listing 5.1
SELECT CAST('2021-12-01' as DATE);

CAST('2021-12-01' AS DATE)|
--------------------------+
2021-12-01|

That should work in most cases but of there are always exceptions. Suppose that

79
Chapter 5: Robustness Patterns

for whatever reason the date was bad:

--listing 5.2
SELECT CAST('2021-13-01' as DATE);

Conversion Error: date field value out of range: "2021-13-01", expected

format is (YYYY-MM-DD)

Obviously there’s no 13th month so we get an error. What if the date was fine but
the formatting was bad?
--listing 5.3
SELECT CAST('2021-12--01' as DATE);

Conversion Error: date field value out of range: "2021-12--01", expected

format is (YYYY-MM-DD)

The extra dash in this case messes up automatic conversion, but the date itself
was correct. What if you try to convert a string to a number and the data is not
numeric?
--listing 5.4
SELECT CAST('2o21' as INT);

Conversion Error: Could not convert string '2o21' to INT32

So how do we deal with these issues? Let’s have a look at some patterns.

Pattern 1: Ignore or Replace Bad Data

One of the easiest ways to deal with formatting issues when converting data is to
simply ignore bad formatting. What this means is we simply skip the malformed
rows when querying data.
This works great in cases when the error is unfixable or occurs very rarely. So if a
few rows out of 10 million are malformed and can’t be fixed we can skip them.
However the CAST() function will fail if it encounters an issue, thus breaking the
query, and we want our query to be robust. To deal with this problem some
databases introduce “safe” casting functions like SAFE_CAST() or TRY_CAST().

80
Chapter 5: Robustness Patterns

Note: Not all servers provide this function. PostgreSQL for example doesn’t have
built-in safe casting but it can be built as custom user defined function (UDF).
SAFE_CAST() and TRY_CAST() are designed to return NULL if the conversion fails
instead of breaking. We can then handle NULL by COALESCE() to replace the bad
values with a sensible value.
DuckDB uses TRY_CAST() so let’s see it in action:
--listing 5.5
SELECT TRY_CAST('2021-12--01' as DATE) AS dt;

dt|
------+
NULL |

And if we want to skip the incorrect values we leave it as is. If however we don’t
want to skip the bad rows we can replace them by using COALESCE():
--listing 5.6
SELECT COALESCE(TRY_CAST('2o21' as INT), 0) AS year;

year|
----+
0|

Pattern 2: Force Formatting (if possible)

While ignoring incorrect data is easy, you can’t always get away with it.
Sometimes you need to extract the actual data by finding patterns in how
formatting is broken and fixing them using string parsing functions. Let’s see
some examples
Suppose that some of the rows of dates had extra dashes like this:
2021-12--01
2021-12--02
2021-12--03
2021-12--04

Since this is a recurring format, we can use string parsing functions to remove
the extra dash and then do the conversion like this:

81
Chapter 5: Robustness Patterns

--listing 5.7
WITH dates AS (
SELECT '2021-12--01' AS dt
UNION ALL
SELECT '2021-12--02' AS dt
UNION ALL
SELECT '2021-12--03' AS dt
UNION ALL
SELECT '2021-12--04' AS dt
UNION ALL
SELECT '2021-12--05' AS dt
)
SELECT TRY_CAST(SUBSTRING(dt, 1, 4) || '-' ||
SUBSTRING(dt, 6, 2) || '-' ||
SUBSTRING(dt, 10, 2) AS DATE) AS date_field
FROM dates;

date_field|
----------+
2021-12-01|
2021-12-02|
2021-12-03|
2021-12-04|
2021-12-05|

So as you can see in this example, we took advantage of the regularity of the
incorrect formatting to extract the the year, month and day from the rows and
reconstruct the correct formatting by concatenating strings via the || operator.
What if you have different types of irregularities in your data? In some cases if
information is aggregated from multiple sources you might have to deal with
mixed formatting.
Let’s take a look at an example:
dt |
-----------+
2021-12--01|
2021-12--02|
2021-12--03|
12/04/2021 |
12/05/2021 |

Obviously we can’t force the same formatting for all the dates here so we’ll have
to split it up using the CASE statement like this:

82
Chapter 5: Robustness Patterns

--listing 5.8
WITH dates AS (
SELECT '2021-12--01' AS dt
UNION ALL
SELECT '2021-12--02' AS dt
UNION ALL
SELECT '2021-12--03' AS dt
UNION ALL
SELECT '12/04/2021' AS dt
UNION ALL
SELECT '12/05/2021' AS dt
)
SELECT TRY_CAST(CASE WHEN dt LIKE '%-%--%'
THEN SUBSTRING(dt, 1, 4) || '-' ||
SUBSTRING(dt, 6, 2) || '-' ||
SUBSTRING(dt, 10, 2)
WHEN dt LIKE '%/%/%'
THEN SUBSTRING(dt, 7, 4) || '-' ||
SUBSTRING(dt, 1, 2) || '-' ||
SUBSTRING(dt, 4, 2)
END AS DATE) AS date_field
FROM dates;

--sample output
date_field|
----------+
2021-12-01|
2021-12-02|
2021-12-03|
2021-12-04|
2021-12-05|

Notice how we’re separating rows with different formatting using the CASE and
LIKE operators to handle each of them differently. You can repeat this pattern as
many times as you want to handle each different format.

Here’s an example using numbers:

83
Chapter 5: Robustness Patterns

--listing 5.9
WITH weights AS (
SELECT '32.5lb' AS wt
UNION ALL
SELECT '45.2lb' AS wt
UNION ALL
SELECT '53.1lb' AS wt
UNION ALL
SELECT '77kg' AS wt
UNION ALL
SELECT '68kg' AS wt
)
SELECT
TRY_CAST(CASE WHEN wt LIKE '%lb' THEN SUBSTRING(wt, 1, INSTR(wt, 'lb')-1)
WHEN wt LIKE '%kg' THEN SUBSTRING(wt, 1, INSTR(wt, 'kg')-1)
END AS DECIMAL) AS weight,
CASE WHEN wt LIKE '%lb' THEN 'LB'
WHEN wt LIKE '%kg' THEN 'KG'
END AS unit
FROM weights;

--sample output
weight|unit|
------+----+
32.500|LB |
45.200|LB |
53.100|LB |
77.000|KG |
68.000|KG |

I’m using the SUBSTRING() function again to extract parts of a string, and I used
the INSTR() function, which searches for a string within another string and returns
the first occurrence of it or 0 if not found, in order to tell the SUBSTRING() function
how many characters to read.

Concept 2: Handling NULLs

NULLs in SQL represent unknown values. While the data may appear to be blank
or empty in the results, it’s not the same as an empty string or white space. The
reason we want to handle them is because they cause issues when it comes
to comparing fields or joining data. They might confuse users, so as a general
pattern you should replace NULLs with predetermined default values.

84
Chapter 5: Robustness Patterns

Pattern 1: Start With LEFT JOIN

One of my favorite rules of thumb is to always use a LEFT JOIN when I’m not sure
if one table is a subset of the other.
For example in the query below:, we use a left join with the static table
post_history_type_mapping because we’re not sure how the post_history_type_id
might change.
We might have new mappings being created that we haven’t added to our lookup
table yet and we don’t want to limit our final results unknowingly. By the way
this query is part of our dbt chapter and explained in Chapter 7
--listing 5.10
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }} ph
LEFT JOIN {{ ref('post_history_type_mapping') }} m
ON ph.post_history_type_id = m.post_history_type_id

Pattern 2: Assume NULL

As a rule, you should always assume any column can be NULL at any point in
time so it’s a good idea to provide a default value for that column as part of your
SELECT. This way you make sure that even if your data becomes NULL your query
will not fail.
For strings you might use default values such as NA, Not Provided, Not Available,
etc. Dates and numbers are trickier. For a date field you might use a default
value such as 1900-01-01 and that’s a safe enough signal that the data is not
available.

85
Chapter 5: Robustness Patterns

Doing this however could mess up age calculations, especially if the age is later
averaged, so be careful where you use it. Same thing applies to using a default
value like 0, -1, or 9999 for numbers. It might make sense when the column
cannot be 0 or negative, but not always.
You do this by using COALESCE() as described earlier:
--listing 5.11
SELECT
id,
COALESCE(display_name, 'unknown') AS user_name,
COALESCE(about_me, 'unknown') AS about_me,
COALESCE(age, 'unknown') AS age,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(last_access_date, '1900-01-01') AS last_access_date,
COALESCE(location, 'unknown') AS location,
COALESCE(reputation, 0) AS reputation,
COALESCE(up_votes, 0) AS up_votes,
COALESCE(down_votes, 0) AS down_votes,
COALESCE(views, 0) AS views,
COALESCE(profile_image_url, 'unknown') AS profile_image_url,
COALESCE(website_url, 'unknown') AS website_url
FROM
users
LIMIT 10;

Since id is the primary key in this table it can’t be NULL so we’re not to handling it
here, but we do handle everything else regardless of whether it’s NULL or not.

Concept 3: Handling Division By Zero

When you calculate ratios you must always handle potential division by zero.
Your query might work when you first test it, but if the denominator ever becomes
zero it will fail.

Pattern 1: Skip Rows With 0 Denominator

The easiest way to handle this is by excluding zero values in the denominator.
This will work fine but it will also filter out rows which could be needed.

86
Chapter 5: Robustness Patterns

Here’s an example:
--listing 5.12
WITH cte_test_data AS (
SELECT 94 as comments_on_post, 38 as posts_created
UNION ALL
SELECT 62, 0
UNION ALL
SELECT 39, 20
UNION ALL
SELECT 34, 19
UNION ALL
SELECT 167, 120
UNION ALL
SELECT 189, 48
UNION ALL
SELECT 96, 17
UNION ALL
SELECT 15, 15
)
SELECT
ROUND(CAST(comments_on_post AS NUMERIC) /
CAST(posts_created AS NUMERIC), 1) AS comments_on_post_per_post
FROM
cte_test_data
WHERE
posts_created > 0;

--sample output
comments_on_post_per_post|
-------------------------+
2.5|
2.0|
1.8|
1.4|
3.9|
5.6|
1.0|

Pattern 2: Anticipate and Bypass

The best way to handle division by zero without filtering out rows is to use a
CASE statement. While this will work, there are other options. Cloud warehouses
like BigQuery offer a SAFE_DIVIDE() function which returns NULL in the case of
divide-by-zero error.
Then you simply deal with NULL values using COALESCE() like above. Snowflake
offers a similar function called DIV0() which automatically returns 0 if there’s a

87
Chapter 5: Robustness Patterns

division by zero error. DuckDB on the other hand seems to handle divide by zero
directly without throwing an error.
Here’s an example:
--listing 5.13
WITH cte_test_data AS (
SELECT 94 as comments_on_post, 38 as posts_created
UNION ALL
SELECT 62, 0
UNION ALL
SELECT 39, 20
UNION ALL
SELECT 34, 19
UNION ALL
SELECT 167, 120
UNION ALL
SELECT 189, 48
UNION ALL
SELECT 96, 17
UNION ALL
SELECT 15, 15
)
SELECT
CASE
WHEN posts_created > 0 THEN
ROUND(CAST(comments_on_post AS NUMERIC) /
CAST(posts_created AS NUMERIC), 1)
ELSE 0
END AS comments_on_post_per_post
FROM
cte_test_data;

Concept 4: Handling Inconsistent Comparisons

I said earlier that strings are the easiest way to store any kind of data (numbers,
dates, strings) but strings also have their own issues, especially when you’re
trying to join on a string field.
Here are some issues you’ll undoubtedly run into with strings.

1. Inconsistent casing
2. Space padding
3. Unexpected characters

88
Chapter 5: Robustness Patterns

Many databases are case sensitive so if the same string is stored with different
cases it will not match when doing a join. Let’s see an example:
--listing 5.14
SELECT 'string' = 'String' AS test;

test |
-----+
false|

As you can see, a different case causes the test to show as FALSE The only way to
deal with this problem when joining on strings or matching patterns on a string
is to convert all fields to upper or lower case.
--listing 5.15
SELECT LOWER('string') = LOWER('String') AS test;

test|
----+
true|

Space padding is the other common issue you deal with strings.
--listing 5.16
SELECT 'string' = ' string' AS test;

test |
-----+
false|

You deal with this by using the TRIM() function which removes all the leading and
trailing spaces.
--listing 5.17
SELECT TRIM('string') = TRIM(' string') AS test;

test|
----+
true|

89
Chapter 5: Robustness Patterns

Pattern 1: Anticipate and Pre-format

So as a rule of thumb whenever it comes to processing strings, it’s best to assume

the worst and deal with it upfront and that’s why I always apply both TRIM() and
LOWER(). to all string columns.
If you ever have to join on an email column these functions are absolutely
essential. It’s best to combine them just to be sure:
--listing 5.18
SELECT TRIM(LOWER('String')) = TRIM(LOWER(' string')) AS test;

test|
----+
true|

As far as handling unexpected characters, you’ll first need to figure out how they
appear and then fix them using the function REPLACE() This can vary a lot, but
usually you’ll want to replace offending characters with an empty string.
Here’s an example:
--listing 5.15
SELECT REPLACE(TRIM(LOWER('String//')), '/','') = TRIM(LOWER(' string')) AS
test;

test|
----+
true|

If you’re replacing multiple offending characters at once, you can do it using

multiple nested calls to REPLACE() like the example below:
--listing 5.19
SELECT REPLACE(REPLACE(TRIM(LOWER('String//}')), '/',''),'}','') = TRIM(LOWER
(' string')) AS test;

test|
----+
true|

90
Chapter 5: Robustness Patterns

Concept 5: Handling Schema Changes

Schema changes are one of the most common issues with source data. Whether
the changes came from your internal engineering team or an external party, you
should have ways to deal with them gracefully.

Pattern 1: External to Internal Interface

The data interface pattern states that you should have a single point of entry
between external data and your workflow. This means that all external tables
should have an internal table or view that “translates” their columns into
meaningful equivalents and all queries downstream depend on the internal
table or view.
Here’s what it would look like visually:

Figure 5.1 - Example DAG

What this does is provide a single point of dependency for all external data.
When there’s a schema change there’s a single point that needs to be updated.
Ideally, this interface is owned by the data producers so they’re responsible for
maintaining and updating all the changes to source schemas.
This applies more to tools like dbt and sql-mesh. Here’s an example. Whenever
any of the underlying columns change, we only have to update a single file:

91
Chapter 5: Robustness Patterns

--listing 5.20
SELECT
ph.id AS post_history_id,
ph.post_id,
ph.post_history_type_id,
ph.revision_guid,
ph.user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(ph.creation_date, '1900-01-01') AS post_creation_date,
COALESCE(ph.text, 'unknown') AS post_text,
COALESCE(ph.comment, 'unknown') AS post_comment
FROM
post_history ph
LEFT JOIN post_history_type_mapping m
ON ph.post_history_type_id = m.post_history_type_id

With that we wrap up our chapter on query robustness. In the next chapter we
get to see the entire query for user engagement. It’s also a great opportunity to
review what we’ve learned so far.

92
Chapter 6: Finishing the Project

In this chapter we wrap up our query and go over it one more time highlighting
the various patterns we’ve learned so far. This is a good opportunity to test
yourself and see what you’ve learned. Analyze the query and see what patterns
you recognize.
So here’s the whole query
-- listing 6.1
-- Get the user name and collapse the granularity of post_history to the
user_id, post_id, activity type and date
-- Get the user name and collapse the granularity of post_history to the
user_id, post_id, activity type and date
WITH post_activity AS (
SELECT
ph.post_id,
ph.user_id,
u.display_name AS user_name,
ph.creation_date AS activity_date,
CASE WHEN ph.post_history_type_id IN (1,2,3) THEN 'create'
WHEN ph.post_history_type_id IN (4,5,6) THEN 'edit'
END AS activity_type
FROM
post_history ph
INNER JOIN users u
ON u.id = ph.user_id
WHERE
TRUE
AND ph.post_history_type_id BETWEEN 1 AND 6
AND user_id > 0 --exclude automated processes
AND user_id IS NOT NULL --exclude deleted accounts
GROUP BY
1,2,3,4,5
)

93
Chapter 6: Finishing the Project

-- Get the post types we care about questions and answers only and combine
them in one CTE
,post_types AS (
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)
-- Finally calculate the post metrics
, user_post_metrics AS (
SELECT
user_id,
user_name,
TRY_CAST(activity_date AS DATE) AS activity_date,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_created,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_created,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_edited,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_edited,
SUM(CASE WHEN activity_type = 'create'
THEN 1 ELSE 0 END) AS posts_created,
SUM(CASE WHEN activity_type = 'edit'
THEN 1 ELSE 0 END) AS posts_edited
FROM
post_types pt
JOIN post_activity pa ON pt.post_id = pa.post_id
GROUP BY 1,2,3
)
, comments_by_user AS (
SELECT
user_id,
TRY_CAST(creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments
WHERE
TRUE
GROUP BY
1,2
)

94
Chapter 6: Finishing the Project

, comments_on_user_post AS (
SELECT
pa.user_id,
TRY_CAST(c.creation_date AS DATE) AS activity_date,
COUNT(*) as total_comments
FROM
comments c
INNER JOIN post_activity pa ON pa.post_id = c.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
, votes_on_user_post AS (
SELECT
pa.user_id,
TRY_CAST(v.creation_date AS DATE) AS activity_date,
SUM(CASE WHEN vote_type_id = 2 THEN 1 ELSE 0 END) AS total_upvotes,
SUM(CASE WHEN vote_type_id = 3 THEN 1 ELSE 0 END) AS total_downvotes,
FROM
votes v
INNER JOIN post_activity pa ON pa.post_id = v.post_id
WHERE
TRUE
AND pa.activity_type = 'create'
GROUP BY
1,2
)
, total_metrics_per_user AS (
SELECT
pm.user_id,
pm.user_name,
CAST(SUM(pm.posts_created) AS NUMERIC) AS posts_created,
CAST(SUM(pm.posts_edited) AS NUMERIC) AS posts_edited,
CAST(SUM(pm.answers_created) AS NUMERIC) AS answers_created,
CAST(SUM(pm.questions_created) AS NUMERIC) AS questions_created,
CAST(SUM(vu.total_upvotes) AS NUMERIC) AS total_upvotes,
CAST(SUM(vu.total_downvotes) AS NUMERIC) AS total_downvotes,
CAST(SUM(cu.total_comments) AS NUMERIC) AS comments_by_user,
CAST(SUM(cp.total_comments) AS NUMERIC) AS comments_on_post,
CAST(COUNT(DISTINCT pm.activity_date) AS NUMERIC) AS streak_in_days
FROM
user_post_metrics pm
JOIN votes_on_user_post vu
ON pm.activity_date = vu.activity_date
AND pm.user_id = vu.user_id
JOIN comments_on_user_post cp
ON pm.activity_date = cp.activity_date
AND pm.user_id = cp.user_id
JOIN comments_by_user cu
ON pm.activity_date = cu.activity_date
AND pm.user_id = cu.user_id
GROUP BY
1,2
)

95
Chapter 6: Finishing the Project

------------------------------------------------
---- Main Query
SELECT
user_id,
user_name,
posts_created,
answers_created,
questions_created,
total_upvotes,
comments_by_user,
comments_on_post,
streak_in_days,

-- per day metrics

CASE
WHEN streak_in_days > 0 THEN
ROUND(posts_created / streak_in_days, 1)
ELSE 0
END AS posts_per_day,
CASE
WHEN streak_in_days > 0 THEN
ROUND(posts_edited / streak_in_days, 1)
ELSE 0
END AS edits_per_day,
CASE
WHEN streak_in_days > 0 THEN
ROUND(answers_created / streak_in_days, 1)
ELSE 0
END AS answers_per_day,
CASE
WHEN streak_in_days > 0 THEN
ROUND(questions_created / streak_in_days, 1)
ELSE 0
END AS questions_per_day,
CASE
WHEN streak_in_days > 0 THEN
ROUND(comments_by_user / streak_in_days, 1)
ELSE 0
END AS user_comments_per_day,
CASE
WHEN streak_in_days > 0 THEN
ROUND(comments_by_user / streak_in_days, 1)
ELSE 0
END AS user_comments_per_day,

96
Chapter 6: Finishing the Project

-- per post metrics

CASE
WHEN posts_created > 0 THEN
ROUND(answers_created / posts_created, 1)
ELSE 0
END AS answers_per_post,
CASE
WHEN posts_created > 0 THEN
ROUND(questions_created / posts_created, 1)
ELSE 0
END AS questions_per_post,
CASE
WHEN posts_created > 0 THEN
ROUND(total_upvotes / posts_created, 1)
ELSE 0
END AS upvotes_per_post,
CASE
WHEN posts_created > 0 THEN
ROUND(total_downvotes / posts_created, 1)
ELSE 0
END AS downvotes_per_post,
CASE
WHEN posts_created > 0 THEN
ROUND(comments_by_user / posts_created, 1)
ELSE 0
END AS user_comments_per_post,
CASE
WHEN posts_created > 0 THEN
ROUND(comments_on_post / posts_created, 1)
ELSE 0
END AS comments_per_post
FROM
total_metrics_per_user
ORDER BY
posts_created DESC;

Project Remarks

There are a few things to mention before we move on to the next chapter.
Our query is very long and complex. While we did a pretty good job of
decomposing it into clean modules it’s still 200+ lines long. Many of the CTEs can
only be used inside this query. As discussed in Chapter 3, if we want to use them
elsewhere in the database we need to create views. We’ll see how to do this with
dbt in the next chapter

97
Chapter 6: Finishing the Project

You’ll notice that in the CTE named total_metrics_per_user, I cast all those
integer values into type NUMERIC why? The reason is when many databases
perform integer division they will not show any decimal values.

By casting them into NUMERIC we ensure decimal places. And since the number
of decimals can be unpredictable, we use the ROUND() function to round all the
values to 1 decimal place. A clever trick to do this without casting is to multiply
each colum by 1.0 which forces the database to do the type conversion implicitly.

Did you notice how many times the CASE statement was repeated? It makes
the query unnecessarily complicated and hard to maintain. Remember
the DRY Principle? Is there a way we can avoid having to use it? Not
unless your database has a “safe divide” function, but there is a way to do
this with a SQL compiler macro like dbt. We’ll see that pattern in the next chapter.

Now that you have all these wonderful metrics you can sort the results
by any of them to see different types of users. For example you can sort
by questions_per_post to see everyone who posts mostly questions or
answers_by_post to see those who post mostly answers. You can also create
new metrics that indicate who your best users are.

Some of the best uses of this type of table are for customer segmentation or as a
feature table for data science. In fact this exactly the type of table DS and ML
engineers build when deploying machine learning systems.

That wraps up our final project, but we’re not done yet. In the next chapter we’ll
see how to apply many of the patterns with dbt.

98
Chapter 7: dbt Patterns
In this chapter we’re going to use all the patterns we’ve seen so far to simplify
our final query from the project we just saw using dbt.
dbt is a SQL compiler that uses a combination of SQL code with Jinja templates
to allow far greater flexibility in how you design data transformations than SQL
alone. These are patterns I use everyday in my job and they have helped me
make my code not only easier to maintain and debug but also portable across
many platforms.
The power of dbt is its support for dependencies which lets you to decompose a
data transformation into modular workflows that form DAGs. It also supports
macros which make your code more portable.
What we’ll do in this chapter is take the query we completed in Chapter 6 and
show you how to rewrite it with dbt. I won’t go into too much depth on how dbt
works, because I don’t want to make this a dbt tutorial. You can learn more about
it here

Applying Robustness Patterns

dbt uses the concept of “models” for modularizing your code. All the models by
default live in the models folder. In Github repo for this book, under the models
folder you will find 3 subfolders, bronze, silver and gold. They represent what
is called “the medallion architecture.” I won’t get into details about that here,
you can read about it on https://www.databricks.com/glossary/medallion-
architecture

99
Chapter 7: dbt Patterns

The first one loads the StackOverflow tables from parquet files as is without any
modifications. We’ve used those exact tables throughout the book
But the beauty of dbt is that it makes it really easy to create our own custom
models while applying the robustness patterns we learned in Chapter 5. We can
have our own foundational models rather than rely on raw data. The model
below uses COALESCE() on all the fields, ensuring that all downstream models
no longer have to worry about NULLs.
Have a look at this example in the models/clean subfolder:

100
Chapter 7: dbt Patterns

--model post_activity_history_clean_original
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
CASE
WHEN post_history_type_id IN (1,2,3) THEN 'create'
WHEN post_history_type_id IN (4,5,6) THEN 'edit'
WHEN post_history_type_id IN (7,8,9) THEN 'rollback'
END AS grouped_activity_type,
CASE
WHEN post_history_type_id = 1 THEN 'create_title'
WHEN post_history_type_id = 2 THEN 'create_body'
WHEN post_history_type_id = 3 THEN 'create_tags'
WHEN post_history_type_id = 4 THEN 'edit_title'
WHEN post_history_type_id = 5 THEN 'edit_body'
WHEN post_history_type_id = 6 THEN 'edit_tags'
WHEN post_history_type_id = 10 THEN 'post_closed'
WHEN post_history_type_id = 11 THEN 'post_reopened'
WHEN post_history_type_id = 12 THEN 'post_deleted'
WHEN post_history_type_id = 13 THEN 'post_undeleted'
WHEN post_history_type_id = 14 THEN 'post_locked'
WHEN post_history_type_id = 15 THEN 'post_unlocked'
WHEN post_history_type_id = 16 THEN 'community_owned'
WHEN post_history_type_id = 17 THEN 'post_migrated'
WHEN post_history_type_id = 18 THEN 'question_merged'
WHEN post_history_type_id = 19 THEN 'question_protected'
WHEN post_history_type_id = 20 THEN 'question_unprotected'
WHEN post_history_type_id = 21 THEN 'post_disassociated'
WHEN post_history_type_id = 22 THEN 'question_unmerged'
WHEN post_history_type_id = 24 THEN 'suggested_edit_applied'
WHEN post_history_type_id = 25 THEN 'post_tweeted'
WHEN post_history_type_id = 31 THEN 'comment_discussion_moved_to_chat
'
WHEN post_history_type_id = 33 THEN 'post_notice_added'
WHEN post_history_type_id = 34 THEN 'post_notice_removed'
WHEN post_history_type_id = 35 THEN 'post_migrated'
WHEN post_history_type_id = 36 THEN 'post_migrated'
WHEN post_history_type_id = 37 THEN 'post_merge_source'
WHEN post_history_type_id = 38 THEN 'post_merge_destination'
WHEN post_history_type_id = 50 THEN 'bumped_by_community_user'
WHEN post_history_type_id = 52 THEN 'question_became_hot_network'
WHEN post_history_type_id = 53 THEN '
question_removed_from_hot_network'
WHEN post_history_type_id = 66 THEN 'created_from_ask_wizard'
END AS activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }}

101
Chapter 7: dbt Patterns

We also handle the mapping of the post_history_type_id. There are a lot more
types we didn’t see before because we didn’t have to, but now we can put them all
here so we only work with text later. Text descriptions make code more readable
and maintainable vs some magic number.
This is fine but do you notice how many times we had to copy paste the same
piece of code? Can we do better? With dbt we can. There’s a concept in dbt called
seed files, which are perfect for this type of mapping. This is basically a CSV file
with two columns post_history_type_id and text_description The file makes it a
lot easier to add or update mapping in the future.
This is what it looks like:
--seed file (partial listing)
post_history_type_id,activity_type,grouped_activity_type
1,create_title,create
2,create_body,create
3,create_tags,create
4,edit_title,edit
5,edit_body,edit
6,edit_tags,edit
7,rollback_title,rollback
8,rollback_body,rollback
9,rollback_tags,rollback
10,post_closed,post_closed
11,post_reopened,post_reopened
12,post_deleted,post_deleted
13,post_undeleted,post_undeleted
14,post_locked,post_locked
...

Now our code looks like this:

102
Chapter 7: dbt Patterns

--model post_history_clean
SELECT
id,
post_id,
post_history_type_id,
revision_guid,
user_id,
COALESCE(m.activity_type, 'unknown') AS activity_type,
COALESCE(m.grouped_activity_type, 'unknown') AS grouped_activity_type,
COALESCE(creation_date, '1900-01-01') AS creation_date,
COALESCE(text, 'unknown') AS text,
COALESCE(comment, 'unknown') AS comment
FROM
{{ ref('post_history') }} ph
LEFT JOIN {{ ref('post_history_type_mapping') }} m
ON ph.post_history_type_id = m.post_history_type_id

Notice a couple of things. First of all our code is a lot more compact, easy to
read, understand and maintain. Second we’re using a LEFT JOIN as explained
in Chapter 5 Pattern 3. Also notice how we assume NULL with activity_type and
grouped_activity_type and COALESCE() the input coming from the LEFT JOIN in
order to protect ourselves.

Applying Modularity Patterns

While CTEs provide a great way to decompose a single query into readable and
maintainable modules, they don’t go far enough. If you wanted to reuse any of
them you’d have to manually create views. And when views no longer cut it, due
to performance issues, you’d have to materialize them into tables.
Dbt makes both of those options easier while also allowing you to create linkages
across models forming a DAG as we saw in Chapter 3.

103
Chapter 7: dbt Patterns

Figure 7.1 - Example DAG

Let’s look at an example. We’ll take the query from the previous chapter and turn
all the CTEs into models. First let’s tackle the post_types CTE.
SELECT
id AS post_id,
'question' AS post_type,
FROM
posts_questions
UNION ALL
SELECT
id AS post_id,
'answer' AS post_type,
FROM
posts_answers
)

The CTE only selects the post_id and post_type columns but I think this can be a
very useful in the future so we create a more comprehensive model that unions
all the columns in a single view. To save ourselves from writing boilerplate SQL
and cover future cases where new columns are added to the base tables we use
the union_relations() macro from dbt-utils:
--listing 7.2 all_post_types_combined
{{
dbt_utils.union_relations(
relations=[ref('posts_answers_clean'), ref('posts_questions_clean')]
)
}}

The macro will compile into the appropriate SQL before execution. If you
want to see the code (which I won’t list here) simply run dbt compile -m
all_post_types_combined And if you want to see the beautiful DAG created, just
run dbt docs generate && dbt docs serve

104
Chapter 7: dbt Patterns

Figure 7.2 - dbt Post Types DAG Diagram

Ok let’s keep going. Next let’s take a look at the post_activity CTE. Since it’s mostly
a SELECT from the base table and a join with users we don’t need a separate
model for it. As far as defining the activity_type mapping we handled that already
in the previous section above.
Next we have the user_metrics CTE.
user_post_metrics AS (
SELECT
user_id,
user_name,
TRY_CAST(activity_date AS DATE) AS activity_date,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_created,
SUM(CASE WHEN activity_type = 'create' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_created,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'question'
THEN 1 ELSE 0 END) AS questions_edited,
SUM(CASE WHEN activity_type = 'edit' AND post_type = 'answer'
THEN 1 ELSE 0 END) AS answers_edited,
SUM(CASE WHEN activity_type = 'create'
THEN 1 ELSE 0 END) AS posts_created,
SUM(CASE WHEN activity_type = 'edit'
THEN 1 ELSE 0 END) AS posts_edited
FROM
post_types pt
JOIN post_activity pa ON pt.post_id = pa.post_id
GROUP BY 1,2,3

So what can we do to change this? Take a look at the code below:

105
Chapter 7: dbt Patterns

cte_all_posts_created_and_edited AS (
SELECT
pa.user_id,
TRY_CAST(pa.creation_date AS DATE) AS activity_date,
{{- sumif("pa.grouped_activity_type = 'create'
AND pt.post_type = 'question'", 1) }} AS
questions_created,
{{- sumif("pa.grouped_activity_type = 'create'
AND pt.post_type = 'answer'", 1) }} AS answers_created,
{{- sumif("pa.grouped_activity_type = 'edit'
AND pt.post_type = 'question'", 1) }} AS questions_edited,
{{- sumif("pa.grouped_activity_type = 'edit'
AND pt.post_type = 'answer'", 1) }} AS answers_edited,
{{- sumif("pa.grouped_activity_type = 'create'", 1) }} AS
posts_created,
{{- sumif("pa.grouped_activity_type = 'create'", 1) }} AS
posts_edited
FROM
{{ ref('all_post_types_combined') }} pt
INNER JOIN {{ ref('post_activity_history_clean') }} pa
ON pt.post_id = pa.post_id
WHERE
true
AND pa.grouped_activity_type in ('create', 'edit')
AND pt.post_type in ('question', 'answer')
AND pa.user_id > 0 --exclude automated processes
AND pa.user_id IS NOT NULL --exclude deleted accounts
GROUP BY 1,2
)

We do a few very interesting things here. First notice all that boilerplate SQL
with SUM and CASE statements. This where dbt really shines. We make a custom
macro to hide the functionality behind. This is a VERY important pattern unique
to dbt. Some might argue this make the code unnecessarily complex but I beg to
differ. This one pattern has saved me hours of drudgery.
{% macro sumif(condition, column) %}
SUM(CASE WHEN {{condition}} THEN {{column}} ELSE 0 END)
{%- endmacro %}

Applying SRP Patterns

At first the macro seems superfluous. Why bother right? In this case it does seem
like the macro is not adding any functionality, however by using a macro, we’re

106
Chapter 7: dbt Patterns

applying the Single Responsibility Principle. SR allows us to contain the logic in

a single file (the macro) so if we ever decide to change all we have to do is change
one file.
The logic behind the macro might be simple, but I’ve written some very complex
macros that have made my code incredibly easy to read, understand and
maintain. It’s a very good practice and one I unfortunately don’t see used very
often.
Let’s see another example of this pattern. Here’s the last part of the code from
Chapter 6 and we would like to use SRP to implement the SAFE_DIVIDE()
{% macro safe_divide(numerator, denominator) %}
CASE
WHEN {{denominator}} > 0 THEN
ROUND(CAST({{numerator}} AS NUMERIC) /
CAST({{denominator}} AS NUMERIC), 1)
ELSE 0
END
{%- endmacro %}

Now we can apply that macro to our final model that gets us the same result as
the query in the last chapter.

107
Chapter 7: dbt Patterns

WITH cte_metrics_per_user AS (
SELECT
user_id,
user_name,
SUM(posts_created) AS posts_created,
SUM(posts_edited) AS posts_edited,
SUM(answers_created) AS answers_created,
SUM(questions_created) AS questions_created,
SUM(total_upvotes) AS total_upvotes,
SUM(total_downvotes) AS total_downvotes,
SUM(comments_by_user) AS comments_by_user,
SUM(comments_on_post) AS comments_on_post,
COUNT(DISTINCT activity_date) AS streak_in_days
FROM
{{ ref('all_user_metrics_per_day') }}
GROUP BY
1,2
)
SELECT
user_id,
user_name,
posts_created,
answers_created,
questions_created,
total_upvotes,
comments_by_user,
comments_on_post,
streak_in_days,

-- per day metrics

{{- safe_divide('posts_created',
'streak_in_days') }} AS posts_per_day,
{{- safe_divide('posts_edited',
'streak_in_days') }} AS edits_per_day,
{{- safe_divide('answers_created',
'streak_in_days') }} AS answers_per_day,
{{- safe_divide('questions_created',
'streak_in_days') }} AS questions_per_day,
{{- safe_divide('comments_by_user',
'streak_in_days') }} AS user_comments_per_day,
{{- safe_divide('comments_by_user',
'streak_in_days') }} AS user_comments_per_day,

-- per post metrics

{{- safe_divide('answers_created',
'posts_created') }} AS answers_per_post,
{{- safe_divide('questions_created',
'posts_created') }} AS questions_per_post,
{{- safe_divide('total_upvotes',
'posts_created') }} AS upvotes_per_post,
{{- safe_divide('total_downvotes',
'posts_created') }} AS downvotes_per_post,
{{- safe_divide('comments_by_user',
'posts_created') }} AS user_comments_per_post,
{{- safe_divide('comments_on_post',
'posts_created') }} AS comments_per_post
FROM
cte_metrics_per_user
108
Chapter 7: dbt Patterns

Can you see how straightforward and simple it looks?

And this is what the final DAG looks like. We have elements that we can use to
build other queries without starting from scratch.

Figure 7.3 - dbt DAG Diagram

With that we wrap up this chapter on dbt patterns and the book.
If you’ve made it this far, thank you for reading. If you have questions or
comments you can reach me at ergest@gmail.com

109

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Instant download Graph Algorithms for Data Science MEAP v7 Tomaž Bratanič pdf all chapter
100% (2)
Instant download Graph Algorithms for Data Science MEAP v7 Tomaž Bratanič pdf all chapter
65 pages
Pankowecki Robert Domaindriven Rails
No ratings yet
Pankowecki Robert Domaindriven Rails
278 pages
SAP Business One 10.0 Highlights ES
No ratings yet
SAP Business One 10.0 Highlights ES
162 pages
MongoDB Manual
No ratings yet
MongoDB Manual
908 pages
IBM I DB2 Web Query For I Version 2.1 Implementation Guide
No ratings yet
IBM I DB2 Web Query For I Version 2.1 Implementation Guide
880 pages
Coq 8.10.2 Reference Manual PDF
No ratings yet
Coq 8.10.2 Reference Manual PDF
643 pages
SQL Server Integration Services Design Patterns
No ratings yet
SQL Server Integration Services Design Patterns
11 pages
greenSQL Database Firewall
No ratings yet
greenSQL Database Firewall
33 pages
NoSQL and SQL Data Modeling Bringing Together Data, Semantics, and Software (Hills Ted.)
No ratings yet
NoSQL and SQL Data Modeling Bringing Together Data, Semantics, and Software (Hills Ted.)
282 pages
Duckdb-Docs-0 9 2
No ratings yet
Duckdb-Docs-0 9 2
897 pages
AdventureWorks Entity Relationship Diagram
No ratings yet
AdventureWorks Entity Relationship Diagram
1 page
Python Full Stack
No ratings yet
Python Full Stack
10 pages
Data Mining Using Javascript
No ratings yet
Data Mining Using Javascript
30 pages
Microstrategy Tips and Techniques: Reporting Essentials Five Styles of Business Intelligence
No ratings yet
Microstrategy Tips and Techniques: Reporting Essentials Five Styles of Business Intelligence
20 pages
Programa Ciencia de Datos y Machine Learning Con Python - Feb23
No ratings yet
Programa Ciencia de Datos y Machine Learning Con Python - Feb23
13 pages
ISO 80000-3 A Complete Guide
From Everand
ISO 80000-3 A Complete Guide
Gerardus Blokdyk
No ratings yet
Ilog Jrules Tutorial
100% (3)
Ilog Jrules Tutorial
74 pages
Progress PDF
No ratings yet
Progress PDF
524 pages
Progress SQL 89
No ratings yet
Progress SQL 89
156 pages
The AI Hierarchy of Needs
No ratings yet
The AI Hierarchy of Needs
8 pages
Tableau CheatSheet Zep
No ratings yet
Tableau CheatSheet Zep
1 page
Raspberry Pi Monitoring and Alerting
No ratings yet
Raspberry Pi Monitoring and Alerting
130 pages
SQL Server Automation (Maintenance Plan)
No ratings yet
SQL Server Automation (Maintenance Plan)
18 pages
WWW - Yagc.ndo - Co.uk Cheatsheets PLSQL Cheatsheet - HTML
No ratings yet
WWW - Yagc.ndo - Co.uk Cheatsheets PLSQL Cheatsheet - HTML
3 pages
D3 Tips and Tricks PDF
No ratings yet
D3 Tips and Tricks PDF
562 pages
Ebooks File Database Systems. The Complete Book 2nd Ed. Hector Garcia-Molina - Ebook PDF All Chapters
100% (5)
Ebooks File Database Systems. The Complete Book 2nd Ed. Hector Garcia-Molina - Ebook PDF All Chapters
41 pages
Monash Data Science
No ratings yet
Monash Data Science
4 pages
MDX and DAX-compare and Contrast - Mark Whitehorn
No ratings yet
MDX and DAX-compare and Contrast - Mark Whitehorn
61 pages
Creating Dataframes Reshaping Data
100% (1)
Creating Dataframes Reshaping Data
2 pages
QA Training Session - PPT Series-PPT-01
100% (1)
QA Training Session - PPT Series-PPT-01
31 pages
Installation PowerCenter Express
No ratings yet
Installation PowerCenter Express
60 pages
Progress Developers Toolkit
No ratings yet
Progress Developers Toolkit
96 pages
Manual de Usuario y Operación FortiAnalyzer
No ratings yet
Manual de Usuario y Operación FortiAnalyzer
486 pages
5 Steps To Build A Business Case For Continuous Data Quality Assurance
No ratings yet
5 Steps To Build A Business Case For Continuous Data Quality Assurance
11 pages
File Sharing Web App
No ratings yet
File Sharing Web App
40 pages
IBM InfoSphere DataStage and QualityStage Version 11 Release 3 Designer Client Guide
No ratings yet
IBM InfoSphere DataStage and QualityStage Version 11 Release 3 Designer Client Guide
279 pages
Simulator - CertiProf ISO 27001 Foundation
No ratings yet
Simulator - CertiProf ISO 27001 Foundation
16 pages
Java - Basics To Advanced
No ratings yet
Java - Basics To Advanced
60 pages
My Power BI Report Is Slow - What Should I Do
No ratings yet
My Power BI Report Is Slow - What Should I Do
26 pages
Progress Report Programming
No ratings yet
Progress Report Programming
52 pages
Data Engineer - Roadmap and FREE Resources - Paper 2021
No ratings yet
Data Engineer - Roadmap and FREE Resources - Paper 2021
7 pages
Datastage
100% (1)
Datastage
404 pages
DataBase - Management DataBase - Administration
No ratings yet
DataBase - Management DataBase - Administration
848 pages
Java Programming Fundamentals
No ratings yet
Java Programming Fundamentals
21 pages
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
No ratings yet
MSTR Architect Project Design Essentials: Course Contents: Basic and Advanced
3 pages
Object-Oriented Concepts For Database Design: Michael R. Blaha and William J. Premerlani
No ratings yet
Object-Oriented Concepts For Database Design: Michael R. Blaha and William J. Premerlani
9 pages
MySQL Cheat Sheet
No ratings yet
MySQL Cheat Sheet
3 pages
The Studio 3T Field Guide To MongoDB Aggregation
No ratings yet
The Studio 3T Field Guide To MongoDB Aggregation
148 pages
Neo4j-Manual-2 0 1
No ratings yet
Neo4j-Manual-2 0 1
593 pages
Dataserver For Oracle Guide
No ratings yet
Dataserver For Oracle Guide
284 pages
Standard SQL Functions Cheat Sheet A4
No ratings yet
Standard SQL Functions Cheat Sheet A4
2 pages
T1 Dynamic Memory Allocation
No ratings yet
T1 Dynamic Memory Allocation
62 pages
By Ghazwan Khalid Auda
100% (1)
By Ghazwan Khalid Auda
17 pages
Monitor Logic Apps With Azure Monitor Logs - Azure Logic Apps - Microsoft Docs
No ratings yet
Monitor Logic Apps With Azure Monitor Logs - Azure Logic Apps - Microsoft Docs
19 pages
Borland C++ Version 3.1 Programmers Guide 1992
No ratings yet
Borland C++ Version 3.1 Programmers Guide 1992
483 pages
Getting Started With Alfresco Share Collaboration: Enterprise 3.4.0
No ratings yet
Getting Started With Alfresco Share Collaboration: Enterprise 3.4.0
23 pages
Application Development For CICS
No ratings yet
Application Development For CICS
384 pages
SQL Patterns
No ratings yet
SQL Patterns
80 pages
Databaser
No ratings yet
Databaser
137 pages
47951884
No ratings yet
47951884
65 pages
(Ebook) Data-oriented design: software engineering for limited resources and short schedules by Richard Fabian ISBN 9781916478701, 1916478700 all chapter instant download
100% (8)
(Ebook) Data-oriented design: software engineering for limited resources and short schedules by Richard Fabian ISBN 9781916478701, 1916478700 all chapter instant download
67 pages
feb221
No ratings yet
feb221
19 pages
ADE Project Along With CI_CD Pipeline
No ratings yet
ADE Project Along With CI_CD Pipeline
36 pages
feb220
No ratings yet
feb220
27 pages
sql_sub_query
No ratings yet
sql_sub_query
10 pages
Best Processors - January 2024
No ratings yet
Best Processors - January 2024
5 pages
jada_manikanta ATS Resume-1
No ratings yet
jada_manikanta ATS Resume-1
1 page
The Legendary Life of Upamanyu
No ratings yet
The Legendary Life of Upamanyu
15 pages
Type of Tables
No ratings yet
Type of Tables
1 page
Meal Pass Options
No ratings yet
Meal Pass Options
4 pages
Comparison Table: Power BI Licence
No ratings yet
Comparison Table: Power BI Licence
5 pages
Top Superstocks of India
No ratings yet
Top Superstocks of India
17 pages
Best Practices For Delivering and Sharing Content in The Power BI Service
No ratings yet
Best Practices For Delivering and Sharing Content in The Power BI Service
73 pages
Data Analyst - 6 - Financial Data Analyst
No ratings yet
Data Analyst - 6 - Financial Data Analyst
1 page
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
MDM Guide
No ratings yet
MDM Guide
13 pages
Managing Performance Tuning in An SAP HANA Environment
No ratings yet
Managing Performance Tuning in An SAP HANA Environment
2 pages
TCA C01 Exam Questions
No ratings yet
TCA C01 Exam Questions
18 pages
The Slow Query Log - MySQL 8 Query Performance Tuning - A Systematic Method For Improving Execution Speeds
No ratings yet
The Slow Query Log - MySQL 8 Query Performance Tuning - A Systematic Method For Improving Execution Speeds
6 pages
INFORMATION TECHNOLOGY_402 (1)
No ratings yet
INFORMATION TECHNOLOGY_402 (1)
5 pages
Example Healthcare Zachman
100% (1)
Example Healthcare Zachman
5 pages
Talend Open Studio For ESB Getting Started Guide
No ratings yet
Talend Open Studio For ESB Getting Started Guide
31 pages
Microsof VCEup - Com - DP-203 2022-July-10 107q
No ratings yet
Microsof VCEup - Com - DP-203 2022-July-10 107q
123 pages
Performance Testing
100% (5)
Performance Testing
35 pages
Spring Boot Interview Questions for Freshers
No ratings yet
Spring Boot Interview Questions for Freshers
17 pages
Acctsys Reviewer
No ratings yet
Acctsys Reviewer
9 pages
Consommer Une Calculation View Avec Procedure Stockée Et CDS Table Function Et Paramètres Avec Une Views A Definir
No ratings yet
Consommer Une Calculation View Avec Procedure Stockée Et CDS Table Function Et Paramètres Avec Une Views A Definir
15 pages
Unit 2
No ratings yet
Unit 2
15 pages
Historian Retrieval
100% (1)
Historian Retrieval
213 pages
Unit II
No ratings yet
Unit II
92 pages
TAFC R15 SP2 Release Notes
No ratings yet
TAFC R15 SP2 Release Notes
23 pages
Vaishnavi Dehuliya: B.Tech (Hons.), Computer Science
No ratings yet
Vaishnavi Dehuliya: B.Tech (Hons.), Computer Science
2 pages
Final Draft Spec of CTC
No ratings yet
Final Draft Spec of CTC
61 pages
Getting MEAN With Mongo Express Angular and Node Second Edition Simon D Holmes Clive Harber
No ratings yet
Getting MEAN With Mongo Express Angular and Node Second Edition Simon D Holmes Clive Harber
63 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
17 pages
Web Technology Question
No ratings yet
Web Technology Question
3 pages
IDS UNIT-1
No ratings yet
IDS UNIT-1
20 pages
Post Gres SQL Table Example
No ratings yet
Post Gres SQL Table Example
3 pages
TimeLess System
No ratings yet
TimeLess System
6 pages
0. Frontier Tech Course Brochure
No ratings yet
0. Frontier Tech Course Brochure
7 pages
Big Data and Supply Chain Management: A Review and Bibliometric Analysis
No ratings yet
Big Data and Supply Chain Management: A Review and Bibliometric Analysis
24 pages
AACS1304 07 - System Implementation 202005
No ratings yet
AACS1304 07 - System Implementation 202005
65 pages
Chorki
No ratings yet
Chorki
5 pages
Dirichlet Negative Multinomial Distribution
No ratings yet
Dirichlet Negative Multinomial Distribution
4 pages