Lazy Analysts Guide To Faster SQL
Lazy Analysts Guide To Faster SQL
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Table of Contents
About Periscope and Authors .............................................. 1
Introduction ................................................................... 2
Pre-Aggregated Data ........................................................ 3
Reducing Your Data Set .................................................... 3
Materialized Views .......................................................... 4
Approximations ..............................................................11
Hyperloglog ................................................................ 11
Sampling ................................................................... 14
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
About Periscope
Data analysts all over the world use Periscope to get and share insights fast. Our focus
on the people actually running the queries means you get a full featured SQL editor that
makes writing sophisticated queries quick and easy.
With our in-memory data caching system, customers realize an average of 150X
improvement in query speeds. Our intuitive sharing features allow you to quickly share
data throughout your organization. Just type SQL and get charts.
Periscope is built by a small team of hackers working out of a loft in San Francisco. We
love our customers, SQL, and that moment when a blip in the data makes you say, "wait
a minute"
The Authors
Jason Freidman
David Ganzhorn
Tom O'Neill
Harry Glaser
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
Page 1
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Introduction
Were All About SQL at Periscope
At Periscope we spend a lot of time working with SQL. Our product is built around
creating charts through SQL queries. Every day we help customers write and debug
SQL. Were on a mission to create the worlds best SQL editor. Even our blog is all about
SQL!
Why We Made This Book
As part of all this SQL work, we are constantly thinking about how to make our queries
faster. One of the most common problems our customers encounter is slow SQL
queries. Since they come to us for help, weve built a lot of expertise around optimizing
SQL for faster queries.
Some of our most popular blog posts are about speeding up SQL and they consistently
get positive responses from the community. We figured people must be eager to learn
more about it, so we made this ebook hoping it'd be a useful tool for SQL analysts.
Whats in This Book
This book is divided into four sections: Pre-aggregated Data, Avoiding Joins, Avoiding
Table Scans and Approximations. Each section has tips that weve either covered on our
blog or written exclusively for this ebook. The tactics here vary from beginner to
advanced: Theres something for everyone!
Now go forth and make your queries faster!
Hugs and queries,
The Periscope Team
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 2
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Pre-Aggregating Data
Its common when running reports to need to combine data for a query from different
tables. Depending on where you do this in your process, you can be looking at a severely
slow query. At the simplest form an aggregate is a simple summary table that can be
derived by performing a group by SQL query.
Aggregations are usually precomputed, partially summarized data, stored in new
aggregated tables. The most ideal point to aggregate data for faster queries is
aggregating as early in the query as possible.
In this section well go over reducing a data set, grouping data, and materialized views.
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 3
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Pre-Aggregating Data
And now for the big reveal: This query takes 0.7 seconds!
That's a 28X increase over the previous query, and a 68X
increase over the original query.
Materialized views
As promised, our group-and-aggregate comes before the
join. And, as a bonus, we can take advantage of the index
on the time_on_site_logs table.
First, Reduce The Data Set
We can do better. By doing the group-and-aggregate over
the whole logs table, we made our database process a lot
of data unnecessarily. Count distinct builds a hash set for
each group in this case, each dashboard_id to keep
track of which values have been seen in which buckets.
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
Page 4
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Pre-Aggregating Data
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 5
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Avoiding Joins
Depending on how your database scheme is structured, youre likely to have data
required for common analysis queries in different tables. Joins are typically used to
combine this data. Theyre a powerful tool, yet they do have their downside.
Your database has to scan each table thats joined and figure how each row matches up.
This makes joins expensive when it comes to query performance. You can mitigate this
through smart join usage, but your queries would be even faster if you could completely
avoid joins.
In this section, well go over avoiding joins by using generate_series and window
functions.
Generate Series
Calculating Lifetime Metrics
Lifetime metrics are a great way to get a long-term
sense of the health of business. They're particularly
useful for seeing how healthy a segment is by comparing
to others over the long term.
For example, here's a fictional graph of lifetime game
players by platform:
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 6
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Avoiding Joins
Window Functions
Every user started playing on a particular platform on
one day. Once they started, they count forever. So let's
start with the first gameplay for each user on each
platform:
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 7
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Avoiding Joins
No joins at all! And when we're sorting, it's only over the
relatively small daily_first_gameplays with-clause, and
the final aggregated result.
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 8
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Placeholder
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 9
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 10
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Approximations
For truly large data sets, approximations can be one of the most powerful performance
tools in your toolset. In certain contexts, an answer that is accurate to +/- 5% is just as
useful for making a decision and multiple orders of magnitude faster to get.
More than most techniques, this one comes with a few caveats. The first is making sure
the customers of your data understand the limitations. Error bars are a useful
visualization that are well-understood by data consumers.
The second caveat is to understand the assumptions your technique makes about the
distribution of your data. For example, if your data isn't normally distributed, sampling is
going to produce incorrect results.
For more details about various technique and their uses, read on!
Hyperloglog
We'll optimize a very simple query, which calculates the
daily distinct sessions for 5,000,000 gameplays
(~150,000/day):
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 11
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Approximations
Hashing
Bucketing
We use ~(1 << 31) to clear the leftmost bit of the hashed
number. Postgres uses that bit to determine if the
number is positive or negative, and we only want to deal
with positive numbers when taking the logarithm.
The floor(log(2,...)) does the heavy lifting: The integer part
of base-2 logarithm tells us the position (from the right)
of the MSB. Subtracting that from 31 gives us the
position of the MSB from the left, starting at 1.
With that line we've got our MSB per-hash of the
session_id field!
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 12
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Approximations
Counting
It's time to put together the buckets and the MSBs. The
paper linked above has a lengthy discussion on the
derivation of this function, so we'll only recreate the
result here. The new variables are m (the number of
buckets, 512 in our case) and M (the list of buckets
indexed by j, the rows of SQL in our case). The denominator of of this equation is the harmonic mean mentioned
earlier:
Correcting
Bonus: Parallelizing
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
Page 13
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Approximations
Sampling
Sampling is an incredibly powerful tool to speed up
analyses at scale. While it's not appropriate for all
datasets or all analyses, when it works, it really works. At
Periscope, we've realized several orders of magnitude in
speedups on large datasets with judicious use of
sampling.
However, when sampling from databases, it's easy to lose
all your speedups by using inefficient methods to select
the sample itself. In this post we'll show you how to
select random samples in fractions of a second.
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_
as distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 14
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Approximations
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 15
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
selectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as
distinct_logs group by distinct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct
dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from dashboards log_countsfrom dashboards join (select distinc
distinct_logs.dashboard_id, count(1) as ctfrom ( select distinct dashboard_id,user_id fromtime_on_site_ as distinct_logs group by distinct_logs.da nct_logs.daselectdashboards.name,dashboards.name,log_counts.ct from
Page 16