0% found this document useful (0 votes)

11 views

Week 4. Advanced SQL

The document outlines a Week 4 curriculum on Advanced SQL, covering key concepts from previous weeks, an overview of SQL, and advanced SQL techniques. It discusses the characteristics of Snowflake, SQL components, advantages and disadvantages, as well as practical examples of SQL commands such as DDL, DML, and JOIN operations. Additionally, it includes exercises and resources for further practice in SQL.

Uploaded by

Rajarajeswari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Week 4. Advanced SQL

Uploaded by

Rajarajeswari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Week 4: Advanced SQL

Keeyong Han
Table Of Contents
1. Recap of the 3rd Week
2. Overview of SQL
3. Basic SQL
4. Advanced SQL
5. Break
6. Methods for Checking Data Quality
7. More on Lab #1
8. Demo & Homework #3
9. Quiz #1
Recap of the 3rd Week
Key Concepts
● Columnar Storage vs. Row Storage
● Primary Key Uniqueness isn’t guaranteed
● Separation of Compute and Storage
● Evolution of Data Infra
● Bulk-update is preferred in Data Warehouse or Data Lake
● Database & schema: a way to organize your tables acting as a hierarchical
containers
Snowflake Characteristics (1)
● Storage layer and Compute layer are separated
● Compute layer is called “Virtual WareHouse”
○ Two types of Virtual WareHouse exist: Regular & Snowpark
○ A Virtual WH has a size: from X-Small to 6X-Large
● Storage layer supports “Time Travel” and “Zero Copy Cloning”
● Streaming processing is supported
● Runs on top of AWS, GCP and Azure
● For bulk-update, stage is used as a middle ground: internal vs. external
○ COPY INTO
Snowflake Characteristics (2)
● Account Structure: Organization -> 1+ Account -> 1+ Databases
● Data Marketplace & Data Sharing (“Share, Don’t Move”)
● Snowflake has 4 options
○ Standard, Enterprise, Business Critical and Virtual Private Snowflake
● Pricing itself has 3 components
○ Compute Costs: Determined by credits. A credit costs
■ Standard: $2, Enterprise: $3, Business-critical: $4
■ Virtual Warehouse size will determine credit consumption
● X-Small: 1 credit / hour, …, 6X-Large: 512 credits / hour
○ Storage Costs: Calculated per terabyte (TB)
○ Network Costs: Calculated per TB for data transfers
Overview of SQL
History of SQL

● SQL: Structured Query Language

● SQL was developed by IBM Almaden Center in early 1970s
● A language to manipulate tables in Relational Databases
○ Best in handling structured data
○ Survived in the era of Big Data
Components of SQL

● DDL (Data Definition Language):

○ SQL language for defining table structure (create,drop,alter)
● DML (Data Manipulation Language):
○ Query language for retrieving desired records from a table
■ SELECT
○ Language used for adding/deleting/updating records in a table
■ INSERT, DELETE, UPDATE, COPY, …
Advantage of SQL

● SQL is used for structured data regardless of data scale

● All large-scale data warehouses are SQL-based
○ Redshift, Snowflake, BigQuery, ClickHouse, …
● Spark and Hadoop are no exception
○ SQL languages like SparkSQL and HiveQL are supported
● A fundamental skill in the data field
○ Data engineers, data analysts, and data scientists all need to know it
○ Anyone who needs to work with data should know it, regardless of their job roles
Disadvantage of SQL

● SQL alone cannot process unstructured data

○ Can handle unstructured data to some extent using regular expressions with limitations
○ Many relational databases only support flat structures (no nesting like JSON)
■ Google BigQuery and Snowflake support nested structures though
○ Spark/Pandas (“dataframes”) are needed to handle unstructured data
● SQL syntax varies slightly between different relational databases

{
"isbn": "123-456-222",
"author":
{
"lastname": "Doe",
"firstname": "Jane"
},
"title": "The Ultimate Database Study Guide",
"category": ["Non-Fiction", "Technology"]
}
An Example of Converting semi structured data to structured

47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316

"https://domain1.com/?p=1" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/72.0.3626.119 Safari/537.36" "2.75"

REGEX

remote_ip,
remote_user,
time,
method,
uri,
protocol,
status_code, …
Basic SQL
DDL, DML, WHERE, GROUP BY, COUNT, DISTINCT
DDL
● CREATE TABLE
○ CREATE TABLE … AS SELECT: CTAS
● DROP TABLE
● ALTER TABLE

CREATE TABLE IF NOT EXISTS dev.raw_data.user_session_channel (

userId int not NULL,
sessionId varchar(32) primary key,
channel varchar(32) default 'direct'
);
DROP TABLE IF EXISTS dev.raw_data.user_session_channel;

ALTER TABLE dev.raw_data.user_session_channel RENAME COLUMN channel to channelName;

DML Overview
● Record loading
○ INSERT INTO
○ COPY
● Record deletion: DELETE
● Record update: UPDATE
● Record query: SELECT
○ GROUP BY
○ JOIN
○ UNION, INTERSECT, …
○ WINDOW function
SELECT
● A query language used to utilize information stored in tables.
● It also extracts new information from the values of fields.
○ Transforms using functions, arithmetic operations, CASE WHEN, etc.
● Groups records with the same field values and performs various operations
○ GROUP BY
○ Performs aggregations like count, sum, average, standard deviation, etc.
● Combines multiple tables to generate new information.
○ JOIN
○ UNION
SELECT Syntax

● Used to read records (or the number of records) from tables.

● Uses WHERE to select records that meet specific conditions.

SELECT [DISTINCT] field1, field2, … fields can be transformed (CASE WHEN,

FROM table1 functions, arithmetic operations, …)
[JOIN table2 ON JOIN condition]
[WHERE filter_condition]
[GROUP BY field1, field2, …]
[ORDER BY field1 [ASC|DESC]]
[LIMIT N];
COUNT Function
● value

NULL
SELECT COUNT(1) FROM adhoc.count_test 7
1
SELECT COUNT(0) FROM adhoc.count_test 7
1
SELECT COUNT(NULL) FROM adhoc.count_test 0
0
SELECT COUNT(value) FROM adhoc.count_test 6
0 SELECT COUNT(DISTINCT value) FROM adhoc.count_test 4
4

Table: adhoc.count_test
CASE WHEN

value sign
SELECT
value,
NULL null CASE
1 positive WHEN value > 0 THEN 'positive'
WHEN value = 0 THEN 'zero'
1 positive
WHEN value < 0 THEN 'negative'
0 zero ELSE 'null'
0 zero END sign
FROM dev.adhoc.count_test;
4 positive

3 positive

Table: adhoc.count_test
What is NULL?

● A constant that indicates the absence of a value.

○ It is different from 0 or "" (empty string), which represent some values
● When specifying a field, NULL or NOT NULL can be set
○ Primary key field can’t have NULL
● Comparing whether a field's value is NULL requires special syntax:
○ field1 IS NULL or field1 IS NOT NULL
○ field1 = NULL or field1 != NULL
● If NULL is used in arithmetic operations, what will be the result?
○ SELECT 0 + NULL, 0 - NULL, 0 * NULL, 0/NULL
Common Data Analysis Cases

● Create groups of records based on specific field(s) and compare them.

a. The GROUP BY clause is used for this purpose: it creates groups based on specific field(s).
● Within each group, aggregate functions are used to calculate specific
statistics.
Count

1 3

2 2
GROUP BY color COUNT

3 1
Web Service: User ID & Session ID

Every page visits from User ID 100

First Visit via Google

Revisit 30 minutes later Revisit via Instagram
User 100 Search

Session #1 Session #2 Session #3

UserID SessionID Channel Channel Creation Time

Tables to use in some practices

● User ID raw_data (schema)

● Session ID
Field Type Field Type
● Channel
● Session Creation Time userId int sessionId varchar(32)
sessionId varchar(32) ts timestamp
channel varchar(32)
user_session_channel session_timestamp
Table Table

Production
DB
Data Pipeline
(ETL) JOIN
Log File
GROUP BY & Aggregate Examples

SELECT channel, COUNT(1) AS cnt

FROM dev.raw_data.user_session_channel
GROUP BY channel -- it can be 1 instead
ORDER BY cnt DESC; -- it can be 2 instead
GROUP BY & Aggregate Functions

● Group the records of a table and calculate various information for each group.
● This process consists of two steps:
○ First, decide the field(s) to group by (can be one or more fields).
■ Specify these fields using GROUP BY (using field names or field ordinal numbers).
○ Next, decide what to calculate for each group.
■ Here, aggregate functions are used, such as COUNT, SUM, AVG, MIN, MAX, etc.
■ It is common to specify field names (using aliases).
● For example: COUNT(1) AS cnt
SQL Practice #1
● Course GitHub Repo: https://github.com/keeyong/sjsu-data226-SP25/
● DDL
● DML: INSERT, UPDATE, DELETE
● SELECT
○ CASE WHEN
○ COUNT
○ NULL
○ GROUP BY
What is JOIN?

● JOIN merges two or more tables using a common field ("join key")
○ Used to integrate information that was dispersed across multiple tables
● JOIN creates a new table containing fields from both sides
● Depending on the join method, two things differ:
○ Which records are selected?
○ How are the fields populated?
LEFT RIGHT

JOIN

A New Table
Types of JOIN

Source: https://theartofpostgresql.com/blog/2019-09-sql-joins/
JOIN Syntax

SELECT A., B.

FROM raw_data.table1 A
____ JOIN raw_data.table2 B ON A.key1 = B.key1 and A.key2 = B.key2
WHERE A.ts >= '2019-01-01';

INNER, FULL, LEFT, RIGHT, CROSS

Two Tables for JOIN practices

UserID VitalID Date Weight AlertID VitalID AlertType Date UserID

100 1 2020-01-01 75 1 4 WeightIncrease 2020-01-02 101
100 3 2020-01-02 78 2 NULL MissingVital 2020-01-04 100
101 2 2020-01-01 90 3 NULL MissingVital 2020-01-04 101
101 4 2020-01-02 95
raw_data.alert
raw_data.vital
JOIN KEY
INNER JOIN

1. Returns only the records that match from both tables

2. Returns with all fields from both tables populated

SELECT * FROM raw_data.Vital v

JOIN raw_data.Alert a ON v.vitalID = a.vitalID;

v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID

101 4 2020-01-02 95 1 4 WeightIncrease 2021-01-02 101
LEFT JOIN

1. Returns all records from the left table (Base)

2. Fields from the right table are populated only when they match a record from
the left table

SELECT * FROM raw_data.Vital v

LEFT JOIN raw_data.Alert a ON v.vitalID = a.vitalID;
v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID
100 1 2020-01-01 75 NULL NULL NULL NULL NULL
100 3 2020-01-02 78 NULL NULL NULL NULL NULL
101 2 2020-01-01 90 NULL NULL NULL NULL NULL
101 4 2020-01-02 95 1 4 WeightIncrease 2021-01-02 101
FULL JOIN

1. Returns all records from both the left table and the right table
2. All fields from both tables are populated only when there's a match

SELECT * FROM raw_data.Vital v

FULL JOIN raw_data.Alert a ON v.vitalID = a.vitalID;

v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID

100 1 2020-01-01 75 NULL NULL NULL NULL NULL
100 3 2020-01-02 78 NULL NULL NULL NULL NULL
101 2 2020-01-01 90 NULL NULL NULL NULL NULL
101 4 2020-01-02 95 1 4 WeightIncrease 2021-01-02 101
NULL NULL NULL NULL 2 NULL MissingVital 2020-01-04 100
NULL NULL NULL NULL 3 NULL MissingVital 2020-01-04 101
CROSS JOIN

1. Returns all combinations of records from the left table and the right table
SELECT *
FROM raw_data.Vital v
CROSS JOIN raw_data.Alert a;
v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID
100 1 2020-01-01 75 1 4 WeightIncrease 2020-01-01 101
100 3 2020-01-02 78 1 4 WeightIncrease 2020-01-01 101
101 2 2020-01-01 90 1 4 WeightIncrease 2020-01-01 101
101 4 2020-01-02 95 1 4 WeightIncrease 2020-01-01 101
100 1 2020-01-01 75 2 MissingVital 2020-01-04 100
100 3 2020-01-02 78 2 MissingVital 2020-01-04 100
101 2 2020-01-01 90 2 MissingVital 2020-01-04 100
101 4 2020-01-02 95 2 MissingVital 2020-01-04 100
100 1 2020-01-01 75 3 MissingVital 2020-01-04 101
100 3 2020-01-02 78 3 MissingVital 2020-01-04 101
101 2 2020-01-01 90 3 MissingVital 2020-01-04 101
101 4 2020-01-02 95 3 MissingVital 2020-01-04 101
SELF JOIN

1. Joins a table with itself using different aliases

SELECT *
FROM raw_data.Vital v1
JOIN raw_data.Vital v2 ON v1.vitalID = v2.vitalID;

v1.UserID v1.VitalID v1.Date v1.Weight v2.UserID v2.VitalID v2.Date v2.Weight

100 1 2020-01-01 75 100 1 2020-01-01 75
100 3 2020-01-02 78 100 3 2020-01-02 78
101 2 2020-01-01 90 101 2 2020-01-01 90
101 4 2020-01-02 95 101 4 2020-01-02 95
SQL Practice #2
● INNER JOIN
● LEFT JOIN (RIGHT JOIN)
● FULL JOIN
● CROSS JOIN
● SELF JOIN
Advanced SQL
CTAS, UNION/INTERSECT/EXCEPT, Window Function
CTAS: The Simplest ELT

● A simple way to create a new table from existing tables

● If you have to join tables frequently,
○ Periodically create the table with the same JOINs using CTAS
○ Tools like Airflow are often used for this scheduling

CREATE TABLE analytics.session_summary AS

SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid;
MAU calculation with the ELT table (analytics.session_summary)

SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM analytics.session_summary MAU: Monthly Active User
GROUP BY 1 DAU: Daily Active User
ORDER BY 1 DESC; WAU: Weekly Active User
CTE (Common Table Expression)

● Create and use temporary tables before performing a SELECT as a

part of SELECT
○ This temp tables are created as a part of SELECT and will disappear later
○ They are created in a single SQL statement at the beginning of the SELECT query
● The syntax is as follows (creating temp tables named 'channel' and 'temp')
WITH channel AS (
select DISTINCT channel from raw_data.user_session_channel
),
temp AS (
select …
),
...
SELECT *
FROM channel c
JOIN temp t ON c.userId = t.userId
CTE based MAU computation

WITH tmp AS (
SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid
)
SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM tmp
GROUP BY 1
ORDER BY 1 DESC;
Subquery based MAU computation

SELECT
LEFT(ts, 7) AS year_month,
COUNT(DISTINCT userid) AS mau
FROM (
SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid
)
GROUP BY 1
ORDER BY 1 DESC;
SQL Practice #3
● CTAS
● CTE & Subquery
Break
NULLIF

● NULLIF(num1, num2): if num1=num2, then return NULL else return num1

● num1/num2
○ if num2 is 0, then it causes a "divide by 0" error
● How to prevent this? Use NULLIF to change 0 to NULL
○ num1/NULLIF(num2, 0)
○ Remember once again: if NULL is used in arithmetic operations, the result becomes NULL!
COALESCE
value
● A function that replaces NULL values with other values, NULL
● COALESCE(exp1, exp2, exp3, …) 1
○ The function checks each argument starting from exp1, and 1
returns the first non-NULL value it encounters. 0
○ If all arguments are NULL, it will ultimately return NULL. 0
4
SELECT
3
value,
COALESCE(value, 0) -- If value is NULL, return 0
FROM raw_data.count_test; raw_data.count_test
UNION, EXCEPT, INTERSECT

● Each SELECT statement must have matching # of fields and types

● UNION (Union)
○ Combines multiple tables or SELECT results into a single result.
○ UNION vs. UNION ALL
■ UNION removes duplicates.
● EXCEPT (Difference)
○ Allows you to subtract one SELECT result from another.
● INTERSECT (Intersection)
○ Finds only the common records among multiple SELECT statements.
Examples of UNION, EXCEPT, and INTERSECT

SELECT 'mark' as first_name, 'zuckerberg' as last_name

UNION -- UNION ALL
SELECT 'elon', 'musk'
UNION
SELECT 'mark', 'zuckerberg';

SELECT sessionId FROM raw_data.user_session_channel

EXCEPT
SELECT sessionId FROM raw_data.session_transaction;

SELECT sessionId FROM raw_data.user_session_channel

INTERSECT
SELECT sessionId FROM raw_data.session_transaction;
WINDOW Functions

● Unlike GROUP BY aggregation functions, Window functions don’t collapse

rows. In other words, the number of records doesn’t change
● Syntax:
○ Window_functions OVER (PARTITION BY … ORDER BY …)
● Will learn a few functions
○ ROW_NUMBER
○ LAG
○ FIRST_VALUE/LAST_VALUE
○ SUM
WINDOW Functions - ROW_NUMBER (1)

userid ts channel userid ts channel nn

10 2021-01-01 google 10 2021-01-01 google 1

11 2021-01-03 facebook 10 2021-01-02 facebook 2 3. ROW_NUMBER

can be used to
11 2021-01-01 naver 10 2021-01-03 youtube 3 implement

10 2021-01-02 facebook 11 2021-01-01 naver 1

11 2021-01-04 google 11 2021-01-03 facebook 2

10 2021-01-03 youtube 11 2021-01-04 google 3

1. What if you want to assign a 2. Add a new column!

sequential number per user based on Partition records by user, sort them by time within
the timestamp? each group, and assign numbers starting from 1.
WINDOW Functions - ROW_NUMBER (2)

● Assigns a unique sequential number to each row within a result set defined by
ORDER BY and PARTITION BY
● What if you want to assign a sequential number per user based on the
timestamp?

SELECT usc.userid, usc.channel, ROW_NUMBER() OVER (PARTITION BY

usc.userid ORDER BY st.ts) nn
FROM raw_data.user_session_channel usc
JOIN raw_data.session_timestamp st ON usc.sessionid = st.sessionid;
WINDOW Functions - LAG Function (1)

● When all sessions are sorted in ascending order by time for each user
○ What is the channel of the next session?
○ What is the channel of the previous session?

userId sessionId channel ts previous channel

27 a67c8c9a961b4182688768dd9ba015fe Youtube 2019-05-01 17:04:00
27 b04c387c8384ca083a71b8da516f65f6 Google 2019-05-02 19:21:30 Youtube
27 abebb7c39f4b5e46bbcfab2b565ef32b Naver 2019-05-03 20:38:41 Google
27 ab49ef78e2877bfd2c2bfa738e459bf0 Facebook 2019-05-04 21:48:07 Naver
27 f740c8d9c193f16d8a07d3a8a751d13f Facebook 2019-05-05 18:15:31 Facebook

LAG(channel,1) OVER (PARTITION BY userId ORDER BY ts) prev_channel

WINDOW Functions - LAG Function (2)
-- Find the previous channel
SELECT usc.*, st.ts,
LAG(channel, 1) OVER (PARTITION BY userId ORDER BY ts) prev_channel
FROM raw_data.user_session_channel usc
JOIN raw_data.session_timestamp st ON usc.sessionid = st.sessionid
ORDER BY usc.userid, st.ts

userId sessionId channel ts previous channel

27 a67c8c9a961b4182688768dd9ba015fe Youtube 2019-05-01 17:04:00

27 b04c387c8384ca083a71b8da516f65f6 Google 2019-05-02 19:21:30 Youtube
27 abebb7c39f4b5e46bbcfab2b565ef32b Naver 2019-05-03 20:38:41 Google
27 ab49ef78e2877bfd2c2bfa738e459bf0 Facebook 2019-05-04 21:48:07 Naver
27 f740c8d9c193f16d8a07d3a8a751d13f Facebook 2019-05-05 18:15:31 Facebook
SQL Practice #4
● NULLIF, COALESCE
● UNION, EXCEPT, INTERSECT
● WINDOW FUNCTIONS
○ ROW_NUMBER, LAG, FIRST_VALUE/LAST_VALUE, …
Methods for Checking Data Quality
Data quality checks that should always be attempted

● Check for Duplicate Records

● Check for the Presence of Recent Data (Freshness)
● Check if Primary Key Uniqueness is Maintained
● Check for Columns with Missing Values

● When using CTAS, it is important to apply the following tests:

○ Input Tables
○ Output Tables

CREATE TABLE analytics.session_summary AS

SELECT B.*, A.ts
FROM raw_data.session_timestamp A
JOIN raw_data.user_session_channel B ON A.sessionid = B.sessionid;
Checking for duplicate records

● Compare the following two counts:

SELECT COUNT(1)
FROM analytics.session_summary;

SELECT COUNT(1)
FROM (
SELECT DISTINCT *
FROM analytics.session_summary
);
Checking for the Presence of Recent Data (Freshness)

● Find timestamp fields, check the range (min & max) and see if it is
within your expectation

SELECT MIN(ts), MAX(ts)

FROM analytics.session_summary;
Checking for primary key uniqueness

● Group by the primary key and count. See if any count is bigger than 1

SELECT sessionId, COUNT(1)

FROM analytics.session_summary
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
Check for columns with missing values

SELECT
COUNT(CASE WHEN sessionId is NULL THEN 1 END) sessionid_null_count,
COUNT(CASE WHEN userId is NULL THEN 1 END) userid_null_count,
COUNT(CASE WHEN ts is NULL THEN 1 END) ts_null_count,
COUNT(CASE WHEN channel is NULL THEN 1 END) channel_null_count
FROM analytics.session_summary;
dbt to the Rescue

● Data Build Tool (https://www.getdbt.com/)

○ ELT Open source by dbt Labs ($4.2B valuation as of 2022)
■ Provides a Cloud version (dbt Cloud)
○ Coined a term called “Analytics Engineer”
● Supports various Data Warehouses
○ Redshift, Snowflake, Bigquery, Spark, DuckDB, …
● Competitors
○ Coalesce
○ SQLMesh
What you can do with dbt

● Data quality testing and error reporting

● Ability to check data lineage between datasets
● Incremental update of fact tables
● Change tracking for dimension tables (history tables)
● Easy document creation
● Git integration for version control and collaboration
SQL Practice #5
● Table Data Quality Validations
More on Lab #1
Lab 1: Building a Finance Data Analytics
● Make sure you signed up for a pair
● Place source codes (Python and SQL) in Github
● ETL & Stock price prediction are the main topics
● Grading details will be provided in the Files -> Lab
Demo: Advanced Snowflake
Features
Forecasting
Homework #3
Sales Forecasting (14 pt)
Assignment - Snowflake Forecasting
Please follow the demo and complete the below assignment

1.(+1) Create database dev and schema ANALYTICS

2. (+2) Create a Table PROD_HST_TBL
3. (+3) Create a view to forecast SKU = ‘219029’ of STORE ‘9490’
4. (+3) Create a forecast model ‘books_mdl’
5. (+1) Display the Results
6. (+3) Explain your understanding about the Forecasting Process.
Quiz #1
Ground Rules
● No bathroom breaks allowed
● Put your digital devices in your own bags.
● Keep your bags on the floor
● Seating Criteria
○ Aligned in the same vertical columns.
○ Seats must be alternated within each row, with
no students sitting directly beside each other.
Next Week: Data Pipeline Overview
Refresh Your Python Knowledge
Behavioral Interview Question

Data Analysis With SQL: Mysql Cheat Sheet
No ratings yet
Data Analysis With SQL: Mysql Cheat Sheet
4 pages
Heat Exchanger.
No ratings yet
Heat Exchanger.
10 pages
Theory of Mechanism Design
No ratings yet
Theory of Mechanism Design
279 pages
SQL
No ratings yet
SQL
9 pages
662a5089e0494246e350140dslides - Data Wrangling With SQL
No ratings yet
662a5089e0494246e350140dslides - Data Wrangling With SQL
85 pages
SQL-Query
No ratings yet
SQL-Query
14 pages
Lecture 06
No ratings yet
Lecture 06
65 pages
SQL for Data Analysis.pdf
No ratings yet
SQL for Data Analysis.pdf
10 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
11 pages
SQL__1721960421
No ratings yet
SQL__1721960421
131 pages
Introductory SQL 2
No ratings yet
Introductory SQL 2
43 pages
Newgen Management Trainee: Oracle Technical Orientation Program
No ratings yet
Newgen Management Trainee: Oracle Technical Orientation Program
41 pages
Wa0003.
No ratings yet
Wa0003.
20 pages
RDBMS
No ratings yet
RDBMS
49 pages
DBMS
No ratings yet
DBMS
11 pages
SQL Theory With Query
No ratings yet
SQL Theory With Query
11 pages
SQL12
No ratings yet
SQL12
7 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
38 pages
Database SQL
No ratings yet
Database SQL
24 pages
SQL Concepts and Queries
No ratings yet
SQL Concepts and Queries
11 pages
Shivanesh Dbms
No ratings yet
Shivanesh Dbms
22 pages
SQL Notes
100% (1)
SQL Notes
8 pages
SQL Notes
No ratings yet
SQL Notes
3 pages
Unit IV SQL
No ratings yet
Unit IV SQL
156 pages
SQL - Structured Query Language A Standard That Specifies How
No ratings yet
SQL - Structured Query Language A Standard That Specifies How
66 pages
Crack the top 40 SQL interview questions by The Educative Team Jun, 2022 Grokking the Tech Interview
No ratings yet
Crack the top 40 SQL interview questions by The Educative Team Jun, 2022 Grokking the Tech Interview
1 page
RDBMS and DBMS Concepts
No ratings yet
RDBMS and DBMS Concepts
5 pages
220731068-Rdbms-SQL-Basics
No ratings yet
220731068-Rdbms-SQL-Basics
33 pages
Rdbms & SQL Basics
100% (1)
Rdbms & SQL Basics
33 pages
SQL Workshop
No ratings yet
SQL Workshop
22 pages
SQL Slides (1)
No ratings yet
SQL Slides (1)
80 pages
Use Advanced Structured Query Language: Module Title: Nominal Duration
No ratings yet
Use Advanced Structured Query Language: Module Title: Nominal Duration
19 pages
Database Nest Quiz
No ratings yet
Database Nest Quiz
22 pages
SQL Basics
No ratings yet
SQL Basics
15 pages
Unit 3 notes DBMS final
No ratings yet
Unit 3 notes DBMS final
14 pages
DB Chapter3
No ratings yet
DB Chapter3
56 pages
Basic Database Concept
No ratings yet
Basic Database Concept
25 pages
SQL Cheatshet
100% (1)
SQL Cheatshet
15 pages
SQL Commands
No ratings yet
SQL Commands
31 pages
Database1 Final Revision ٠٤٥٢٢٤
100% (1)
Database1 Final Revision ٠٤٥٢٢٤
14 pages
Basics of SQL 1
No ratings yet
Basics of SQL 1
21 pages
SQL_Unit II-2
No ratings yet
SQL_Unit II-2
131 pages
SQL Basic Cheat Sheet
100% (1)
SQL Basic Cheat Sheet
1 page
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
29 pages
Structured Query Language
No ratings yet
Structured Query Language
13 pages
DBMS & SQL
No ratings yet
DBMS & SQL
45 pages
cs new sm-103-122
No ratings yet
cs new sm-103-122
20 pages
SQL Tutorial for Beginners
No ratings yet
SQL Tutorial for Beginners
10 pages
Untitled document
No ratings yet
Untitled document
41 pages
Hsslive-CS-chapt-9-Structured-Query-Language
No ratings yet
Hsslive-CS-chapt-9-Structured-Query-Language
3 pages
Lec4 - SQL ASR
No ratings yet
Lec4 - SQL ASR
55 pages
SQL
No ratings yet
SQL
42 pages
SQL Notes-1
No ratings yet
SQL Notes-1
28 pages
Class - 1 Introduction To SQL Querying
No ratings yet
Class - 1 Introduction To SQL Querying
21 pages
SQL Notes
No ratings yet
SQL Notes
21 pages
Mysql Final
No ratings yet
Mysql Final
14 pages
SQL - 1 cheat sheet
No ratings yet
SQL - 1 cheat sheet
5 pages
SQL Notes by Apna College
No ratings yet
SQL Notes by Apna College
29 pages
SQL INFO
No ratings yet
SQL INFO
12 pages
DBMS Queries Overview
No ratings yet
DBMS Queries Overview
98 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
sql_lab1
No ratings yet
sql_lab1
2 pages
Week 5. Data Pipelines
No ratings yet
Week 5. Data Pipelines
51 pages
Beverage Preparation and Bottling Unit Using PLC & SCADA System As A Proposed Automation For A Small Scale Industries
No ratings yet
Beverage Preparation and Bottling Unit Using PLC & SCADA System As A Proposed Automation For A Small Scale Industries
6 pages
Ei2202 Electrical Measurements 4 0 0 4
No ratings yet
Ei2202 Electrical Measurements 4 0 0 4
1 page
Caterpillar 3114, 3116, 3126, C7, C9 Parts Catalog
No ratings yet
Caterpillar 3114, 3116, 3126, C7, C9 Parts Catalog
48 pages
Autocad Autocad: Autocade Hardware Software Data People
No ratings yet
Autocad Autocad: Autocade Hardware Software Data People
35 pages
Avid Media Composer Manual
No ratings yet
Avid Media Composer Manual
78 pages
Balanced Growth Theory
83% (6)
Balanced Growth Theory
16 pages
All Questions Transferred From The Relevant "Old" Sub-Module Exam Banks 26/06/08
No ratings yet
All Questions Transferred From The Relevant "Old" Sub-Module Exam Banks 26/06/08
228 pages
SAC VB Expressions API Functions
No ratings yet
SAC VB Expressions API Functions
4 pages
Ponent - Sockets.wiley Inter Science Ebook Spy
100% (1)
Ponent - Sockets.wiley Inter Science Ebook Spy
227 pages
Mcgregor'S Theory X and Theory Y
No ratings yet
Mcgregor'S Theory X and Theory Y
11 pages
Bernardo's Book
100% (2)
Bernardo's Book
81 pages
KDC MP148CR PDF
No ratings yet
KDC MP148CR PDF
43 pages
Dress Shirt Fabrics, Shirting Fabrics - Proper Cloth Reference
No ratings yet
Dress Shirt Fabrics, Shirting Fabrics - Proper Cloth Reference
15 pages
Matthew Brown Unit 4: Pre-Production Portfolio
No ratings yet
Matthew Brown Unit 4: Pre-Production Portfolio
12 pages
Intertraffic Amsterdam Exhibitor List 2016
No ratings yet
Intertraffic Amsterdam Exhibitor List 2016
15 pages
Physics Investigatory Project KENDRIYA V
67% (3)
Physics Investigatory Project KENDRIYA V
18 pages
Mec515 Ppe Lab Manual
No ratings yet
Mec515 Ppe Lab Manual
47 pages
Sakata Inks
No ratings yet
Sakata Inks
2 pages
BONE et al., 2022. The intrinsic primary bioreceptivity of concrete in the coastal environment
No ratings yet
BONE et al., 2022. The intrinsic primary bioreceptivity of concrete in the coastal environment
13 pages
Practice Development Guide and Evaluation Rubric - Unit 3 - Phase 4 - Practical Component - Simulated Practices
No ratings yet
Practice Development Guide and Evaluation Rubric - Unit 3 - Phase 4 - Practical Component - Simulated Practices
7 pages
Iad Questions.
No ratings yet
Iad Questions.
3 pages
Global Leadership Program 2017
No ratings yet
Global Leadership Program 2017
14 pages
Managing a Library
No ratings yet
Managing a Library
5 pages
NOSS Development Guideline (27.06.2012)
100% (2)
NOSS Development Guideline (27.06.2012)
75 pages
IT Consulting in The US - IBIS
No ratings yet
IT Consulting in The US - IBIS
38 pages
Regulation of Oxidative Pentose Phosphate Pathway
No ratings yet
Regulation of Oxidative Pentose Phosphate Pathway
14 pages
Skill BVFX Datesheet End Term Exam , Dec-2024- Jan-2025I,III & V.xlsx - Sem-I & 2
No ratings yet
Skill BVFX Datesheet End Term Exam , Dec-2024- Jan-2025I,III & V.xlsx - Sem-I & 2
1 page
Crosby G-2130 Shackle Data Sheet PDF
No ratings yet
Crosby G-2130 Shackle Data Sheet PDF
1 page
Datasheet Switch 2960
No ratings yet
Datasheet Switch 2960
5 pages
Spring Creek Sun January 14 Issue
No ratings yet
Spring Creek Sun January 14 Issue
24 pages

Week 4. Advanced SQL

Uploaded by

Week 4. Advanced SQL

Uploaded by

Week 4: Advanced SQL

● SQL: Structured Query Language

● DDL (Data Definition Language):

● SQL is used for structured data regardless of data scale

● SQL alone cannot process unstructured data

47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316

CREATE TABLE IF NOT EXISTS dev.raw_data.user_session_channel (

ALTER TABLE dev.raw_data.user_session_channel RENAME COLUMN channel to channelName;

● Used to read records (or the number of records) from tables.

SELECT [DISTINCT] field1, field2, … fields can be transformed (CASE WHEN,

● A constant that indicates the absence of a value.

● Create groups of records based on specific field(s) and compare them.

Every page visits from User ID 100

First Visit via Google

Session #1 Session #2 Session #3

UserID SessionID Channel Channel Creation Time

● User ID raw_data (schema)

SELECT channel, COUNT(1) AS cnt

SELECT A.*, B.*

INNER, FULL, LEFT, RIGHT, CROSS

UserID VitalID Date Weight AlertID VitalID AlertType Date UserID

1. Returns only the records that match from both tables

SELECT * FROM raw_data.Vital v

v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID

1. Returns all records from the left table (Base)

SELECT * FROM raw_data.Vital v

SELECT * FROM raw_data.Vital v

v.UserID v.VitalID v.Date v.Weight a.AlertID a.VitalID a.AlertType a.Date a.UserID

1. Joins a table with itself using different aliases

v1.UserID v1.VitalID v1.Date v1.Weight v2.UserID v2.VitalID v2.Date v2.Weight

● A simple way to create a new table from existing tables

CREATE TABLE analytics.session_summary AS

● Create and use temporary tables before performing a SELECT as a

● NULLIF(num1, num2): if num1=num2, then return NULL else return num1

● Each SELECT statement must have matching # of fields and types

SELECT 'mark' as first_name, 'zuckerberg' as last_name

SELECT sessionId FROM raw_data.user_session_channel

SELECT sessionId FROM raw_data.user_session_channel

● Unlike GROUP BY aggregation functions, Window functions don’t collapse

userid ts channel userid ts channel nn

10 2021-01-01 google 10 2021-01-01 google 1

11 2021-01-03 facebook 10 2021-01-02 facebook 2 3. ROW_NUMBER

10 2021-01-02 facebook 11 2021-01-01 naver 1

11 2021-01-04 google 11 2021-01-03 facebook 2

10 2021-01-03 youtube 11 2021-01-04 google 3

1. What if you want to assign a 2. Add a new column!

SELECT usc.userid, usc.channel, ROW_NUMBER() OVER (PARTITION BY

userId sessionId channel ts previous channel

LAG(channel,1) OVER (PARTITION BY userId ORDER BY ts) prev_channel

userId sessionId channel ts previous channel

27 a67c8c9a961b4182688768dd9ba015fe Youtube 2019-05-01 17:04:00

● Check for Duplicate Records

● When using CTAS, it is important to apply the following tests:

CREATE TABLE analytics.session_summary AS

● Compare the following two counts:

SELECT MIN(ts), MAX(ts)

SELECT sessionId, COUNT(1)

● Data Build Tool (https://www.getdbt.com/)

● Data quality testing and error reporting

1.(+1) Create database dev and schema ANALYTICS

You might also like

SELECT A., B.