Week 4. Advanced SQL
Week 4. Advanced SQL
Keeyong Han
Table Of Contents
1. Recap of the 3rd Week
2. Overview of SQL
3. Basic SQL
4. Advanced SQL
5. Break
6. Methods for Checking Data Quality
7. More on Lab #1
8. Demo & Homework #3
9. Quiz #1
Recap of the 3rd Week
Key Concepts
● Columnar Storage vs. Row Storage
● Primary Key Uniqueness isn’t guaranteed
● Separation of Compute and Storage
● Evolution of Data Infra
● Bulk-update is preferred in Data Warehouse or Data Lake
● Database & schema: a way to organize your tables acting as a hierarchical
containers
Snowflake Characteristics (1)
● Storage layer and Compute layer are separated
● Compute layer is called “Virtual WareHouse”
○ Two types of Virtual WareHouse exist: Regular & Snowpark
○ A Virtual WH has a size: from X-Small to 6X-Large
● Storage layer supports “Time Travel” and “Zero Copy Cloning”
● Streaming processing is supported
● Runs on top of AWS, GCP and Azure
● For bulk-update, stage is used as a middle ground: internal vs. external
○ COPY INTO
Snowflake Characteristics (2)
● Account Structure: Organization -> 1+ Account -> 1+ Databases
● Data Marketplace & Data Sharing (“Share, Don’t Move”)
● Snowflake has 4 options
○ Standard, Enterprise, Business Critical and Virtual Private Snowflake
● Pricing itself has 3 components
○ Compute Costs: Determined by credits. A credit costs
■ Standard: $2, Enterprise: $3, Business-critical: $4
■ Virtual Warehouse size will determine credit consumption
● X-Small: 1 credit / hour, …, 6X-Large: 512 credits / hour
○ Storage Costs: Calculated per terabyte (TB)
○ Network Costs: Calculated per TB for data transfers
Overview of SQL
History of SQL
{
"isbn": "123-456-222",
"author":
{
"lastname": "Doe",
"firstname": "Jane"
},
"title": "The Ultimate Database Study Guide",
"category": ["Non-Fiction", "Technology"]
}
An Example of Converting semi structured data to structured
REGEX
remote_ip,
remote_user,
time,
method,
uri,
protocol,
status_code, …
Basic SQL
DDL, DML, WHERE, GROUP BY, COUNT, DISTINCT
DDL
● CREATE TABLE
○ CREATE TABLE … AS SELECT: CTAS
● DROP TABLE
● ALTER TABLE
NULL
SELECT COUNT(1) FROM adhoc.count_test 7
1
SELECT COUNT(0) FROM adhoc.count_test 7
1
SELECT COUNT(NULL) FROM adhoc.count_test 0
0
SELECT COUNT(value) FROM adhoc.count_test 6
0 SELECT COUNT(DISTINCT value) FROM adhoc.count_test 4
4
Table: adhoc.count_test
CASE WHEN
value sign
SELECT
value,
NULL null CASE
1 positive WHEN value > 0 THEN 'positive'
WHEN value = 0 THEN 'zero'
1 positive
WHEN value < 0 THEN 'negative'
0 zero ELSE 'null'
0 zero END sign
FROM dev.adhoc.count_test;
4 positive
3 positive
Table: adhoc.count_test
What is NULL?
1 3
2 2
GROUP BY color COUNT
3 1
Web Service: User ID & Session ID
Production
DB
Data Pipeline
(ETL) JOIN
Log File
GROUP BY & Aggregate Examples
● Group the records of a table and calculate various information for each group.
● This process consists of two steps:
○ First, decide the field(s) to group by (can be one or more fields).
■ Specify these fields using GROUP BY (using field names or field ordinal numbers).
○ Next, decide what to calculate for each group.
■ Here, aggregate functions are used, such as COUNT, SUM, AVG, MIN, MAX, etc.
■ It is common to specify field names (using aliases).
● For example: COUNT(1) AS cnt
SQL Practice #1
● Course GitHub Repo: https://github.com/keeyong/sjsu-data226-SP25/
● DDL
● DML: INSERT, UPDATE, DELETE
● SELECT
○ CASE WHEN
○ COUNT
○ NULL
○ GROUP BY
What is JOIN?
● JOIN merges two or more tables using a common field ("join key")
○ Used to integrate information that was dispersed across multiple tables
● JOIN creates a new table containing fields from both sides
● Depending on the join method, two things differ:
○ Which records are selected?
○ How are the fields populated?
LEFT RIGHT
JOIN
A New Table
Types of JOIN
Source: https://theartofpostgresql.com/blog/2019-09-sql-joins/
JOIN Syntax
1. Returns all records from both the left table and the right table
2. All fields from both tables are populated only when there's a match
1. Returns all combinations of records from the left table and the right table
SELECT *
FROM raw_data.Vital v
CROSS JOIN raw_data.Alert a;
v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID
100 1 2020-01-01 75 1 4 WeightIncrease 2020-01-01 101
100 3 2020-01-02 78 1 4 WeightIncrease 2020-01-01 101
101 2 2020-01-01 90 1 4 WeightIncrease 2020-01-01 101
101 4 2020-01-02 95 1 4 WeightIncrease 2020-01-01 101
100 1 2020-01-01 75 2 MissingVital 2020-01-04 100
100 3 2020-01-02 78 2 MissingVital 2020-01-04 100
101 2 2020-01-01 90 2 MissingVital 2020-01-04 100
101 4 2020-01-02 95 2 MissingVital 2020-01-04 100
100 1 2020-01-01 75 3 MissingVital 2020-01-04 101
100 3 2020-01-02 78 3 MissingVital 2020-01-04 101
101 2 2020-01-01 90 3 MissingVital 2020-01-04 101
101 4 2020-01-02 95 3 MissingVital 2020-01-04 101
SELF JOIN
SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM analytics.session_summary MAU: Monthly Active User
GROUP BY 1 DAU: Daily Active User
ORDER BY 1 DESC; WAU: Weekly Active User
CTE (Common Table Expression)
WITH tmp AS (
SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid
)
SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM tmp
GROUP BY 1
ORDER BY 1 DESC;
Subquery based MAU computation
SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM (
SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid
)
GROUP BY 1
ORDER BY 1 DESC;
SQL Practice #3
● CTAS
● CTE & Subquery
Break
NULLIF
● Assigns a unique sequential number to each row within a result set defined by
ORDER BY and PARTITION BY
● What if you want to assign a sequential number per user based on the
timestamp?
● When all sessions are sorted in ascending order by time for each user
○ What is the channel of the next session?
○ What is the channel of the previous session?
SELECT COUNT(1)
FROM analytics.session_summary;
SELECT COUNT(1)
FROM (
SELECT DISTINCT *
FROM analytics.session_summary
);
Checking for the Presence of Recent Data (Freshness)
● Find timestamp fields, check the range (min & max) and see if it is
within your expectation
● Group by the primary key and count. See if any count is bigger than 1
SELECT
COUNT(CASE WHEN sessionId is NULL THEN 1 END) sessionid_null_count,
COUNT(CASE WHEN userId is NULL THEN 1 END) userid_null_count,
COUNT(CASE WHEN ts is NULL THEN 1 END) ts_null_count,
COUNT(CASE WHEN channel is NULL THEN 1 END) channel_null_count
FROM analytics.session_summary;
dbt to the Rescue