SQL for Data Analysis
Advanced Techniques for Transforming
Data into Insights
Cathy Tanimura
Beijing + Boston + Farnham + Sebastopol + Tokyo @Mad= 1m NGTable of Contents
Preface.
1. Analysis with SQL... 1
What Is Data Analysi 1
Why SQL? 4
What Is SQL? 4
Benefits of SQL 7
SQL Versus R or Python 8
SQL as Part of the Data Analysis Workflow 9
Database Types and How to Work with Them 12
Row-Store Databases 13
Column-Store Databases 15
Other ‘Types of Data Infrastructure 16
Conclusion 17
2. Preparing Data for Analysi pe 09
‘Types of Data 20
Database Data Types 20
Structured Versus Unstructured. 22
Quantitative Versus Qualitative Data 2
First-, Second-, and Third-Party Data 23
Sparse Data m4
SQL Query Structure 25
Profiling: Distributions 7
Histograms and Frequencies 28
Binning 31
n-Tiles 33Profiling: Data Quality
Detecting Duplicates
Deduplication with GROUP BY and DISTINCT
Preparing: Data Cleaning
Cleaning Data with CASE Transformations
‘Type Conversions and Casting
Dealing with Nulls: coalesce, nullif, nvl Functions
Missing Data
Preparing: Shaping Data
For Which Output: BI, Visualization, Statistics, ML
Pivoting with CASE Statements
Unpivoting with UNION Statements
pivot and unpivot Functions
Conclusion
Time Series Analysis. . . (on
Date, Datetime, and Time Manipulations
‘Time Zone Conversions
Date and Timestamp Format Conversions
Date Math
‘Time Math
Joining Data from Different Sources
The Retail Sales Data Set
‘Trending the Data
Simple Trends
Comparing Components
Percent of Total Calculations
Indexing to See Percent Change over Time
Rolling Time Windows
Calculating Rolling Time Windows
Rolling Time Windows with Sparse Data
Calculating Cumulative Values
‘Analyzing with Seasonality
Period-over-Period Comparisons: YoY and MoM
Period-over-Period Comparisons: Same Month Versus Last Year
‘Comparing to Multiple Prior Periods
Conclusion
4, Cohort Analysis.
Cohorts: A Useful Analysis Framework
‘The Legislators Data Set
7
119
121
122
125
iv. | Tableof Contents2
Table of Contents
Retention 127
SQL for a Basic Retention Curve 128
Adjusting Time Series to Increase Retention Accuracy 131
Cohorts Derived from the Time Series Itself 137
Defining the Cohort from a Separate Table 142
Dealing with Sparse Cohorts 146
Defining Cohorts from Dates Other Than the First Date 151
Related Cohort Analyses 153
Survivorship 154
Returnship, or Repeat Purchase Behavior 158
‘Cumulative Calculations 163
Cross-Section Analysis, Through a Cohort Lens 166
Conclusion 174
. Text Analysis. eee » 175
Why Text Analysis with SQL? 175
‘What Is Text Analysis? 176
Why SQL Is a Good Choice for Text Analysis 176
When SQL Is Not a Good Choice v7
‘The UFO Sightings Data Set 178
Text Characteristics 179
Text Parsing 182
Text ‘Transformations 187
Finding Elements Within Larger Blocks of Text 195
Wildcard Matches: LIKE, ILIKE 195
Exact Matches: IN, NOT IN 200
Regular Expressions 203
Constructing and Reshaping Text 218
Concatenation 218
Reshaping Text 222
Conclusion 226
). Anomaly Detection......... .
Capabilities and Limits of SQL for Anomaly Detection 228
The Data Set 229
Detecting Outliers 230
Sorting to Find Anomalies 231
Calculating Percentiles and Standard Deviations to Find Anomalies 234
Graphing to Find Anomalies Visually 241
Forms of Anomalies 250
Anomalous Values 250
lvvi
. Experiment Analysis
. Creating Complex Data Sets for Analysis.
‘Anomalous Counts or Frequencies
Anomalies from the Absence of Data
Handling Anomalies
Investigation
Removal
Replacement with Alternate Values
Rescaling
Conclusion
Strengths and Limits of Experiment Analysis with SQL
The Data Set
‘Types of Experiments
Experiments with Binary Outcomes: The Chi-Squared Test
Experiments with Continuous Outcomes: The t-Test
Challenges with Experiments and Options for Rescuing Flawed Experiments
Variant Assignment
Outliers
‘Time Boxing
Repeated Exposure Experiments
When Controlled Experiments Arer’t Possible: Alternative Analyses
Pre-/Post-Analysis
Natural Experiment Analysis
Analysis of Populations Around a Threshold
Conclusion
When to Use SQL for Complex Data Sets
‘Advantages of Using SQL
When to Build into ETL Instead
When to Put Logic in Other Tools
Code Organization
Commenting
Capitalization, Indentation, Parentheses, and Other Formatting Tricks
Storing Code
Organizing Computations
Understanding Order of SQL Clause Evaluation
Subqueries
‘Temporary Tables
Common Table Expressions
grouping sets
| Table of contents
254
258
260
260
260
262
264
266
269
270
272
272
274
276
277
278
279
282
282
284
286
286
287
288
288
290
292
292
293
296
296
296
300
302
3052
). Conclusion.
Managing Data Set Size and Privacy Concerns
Sampling with %, mod
Reducing Dimensionality
PIl and Data Privacy
Conclusion
Funnel Analysis
Churn, Lapse, and Other Definitions of Departure
Basket Analysis
Resources
Books and Blogs
Data Sets
Final Thoughts,
308
308
314
316
317
Table of Contents
vii