Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
345 views

SQL For Data Analysis

Uploaded by

Saulo Antônio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
345 views

SQL For Data Analysis

Uploaded by

Saulo Antônio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
SQL for Data Analysis Advanced Techniques for Transforming Data into Insights Cathy Tanimura Beijing + Boston + Farnham + Sebastopol + Tokyo @Mad= 1m NG Table of Contents Preface. 1. Analysis with SQL... 1 What Is Data Analysi 1 Why SQL? 4 What Is SQL? 4 Benefits of SQL 7 SQL Versus R or Python 8 SQL as Part of the Data Analysis Workflow 9 Database Types and How to Work with Them 12 Row-Store Databases 13 Column-Store Databases 15 Other ‘Types of Data Infrastructure 16 Conclusion 17 2. Preparing Data for Analysi pe 09 ‘Types of Data 20 Database Data Types 20 Structured Versus Unstructured. 22 Quantitative Versus Qualitative Data 2 First-, Second-, and Third-Party Data 23 Sparse Data m4 SQL Query Structure 25 Profiling: Distributions 7 Histograms and Frequencies 28 Binning 31 n-Tiles 33 Profiling: Data Quality Detecting Duplicates Deduplication with GROUP BY and DISTINCT Preparing: Data Cleaning Cleaning Data with CASE Transformations ‘Type Conversions and Casting Dealing with Nulls: coalesce, nullif, nvl Functions Missing Data Preparing: Shaping Data For Which Output: BI, Visualization, Statistics, ML Pivoting with CASE Statements Unpivoting with UNION Statements pivot and unpivot Functions Conclusion Time Series Analysis. . . (on Date, Datetime, and Time Manipulations ‘Time Zone Conversions Date and Timestamp Format Conversions Date Math ‘Time Math Joining Data from Different Sources The Retail Sales Data Set ‘Trending the Data Simple Trends Comparing Components Percent of Total Calculations Indexing to See Percent Change over Time Rolling Time Windows Calculating Rolling Time Windows Rolling Time Windows with Sparse Data Calculating Cumulative Values ‘Analyzing with Seasonality Period-over-Period Comparisons: YoY and MoM Period-over-Period Comparisons: Same Month Versus Last Year ‘Comparing to Multiple Prior Periods Conclusion 4, Cohort Analysis. Cohorts: A Useful Analysis Framework ‘The Legislators Data Set 7 119 121 122 125 iv. | Tableof Contents 2 Table of Contents Retention 127 SQL for a Basic Retention Curve 128 Adjusting Time Series to Increase Retention Accuracy 131 Cohorts Derived from the Time Series Itself 137 Defining the Cohort from a Separate Table 142 Dealing with Sparse Cohorts 146 Defining Cohorts from Dates Other Than the First Date 151 Related Cohort Analyses 153 Survivorship 154 Returnship, or Repeat Purchase Behavior 158 ‘Cumulative Calculations 163 Cross-Section Analysis, Through a Cohort Lens 166 Conclusion 174 . Text Analysis. eee » 175 Why Text Analysis with SQL? 175 ‘What Is Text Analysis? 176 Why SQL Is a Good Choice for Text Analysis 176 When SQL Is Not a Good Choice v7 ‘The UFO Sightings Data Set 178 Text Characteristics 179 Text Parsing 182 Text ‘Transformations 187 Finding Elements Within Larger Blocks of Text 195 Wildcard Matches: LIKE, ILIKE 195 Exact Matches: IN, NOT IN 200 Regular Expressions 203 Constructing and Reshaping Text 218 Concatenation 218 Reshaping Text 222 Conclusion 226 ). Anomaly Detection......... . Capabilities and Limits of SQL for Anomaly Detection 228 The Data Set 229 Detecting Outliers 230 Sorting to Find Anomalies 231 Calculating Percentiles and Standard Deviations to Find Anomalies 234 Graphing to Find Anomalies Visually 241 Forms of Anomalies 250 Anomalous Values 250 lv vi . Experiment Analysis . Creating Complex Data Sets for Analysis. ‘Anomalous Counts or Frequencies Anomalies from the Absence of Data Handling Anomalies Investigation Removal Replacement with Alternate Values Rescaling Conclusion Strengths and Limits of Experiment Analysis with SQL The Data Set ‘Types of Experiments Experiments with Binary Outcomes: The Chi-Squared Test Experiments with Continuous Outcomes: The t-Test Challenges with Experiments and Options for Rescuing Flawed Experiments Variant Assignment Outliers ‘Time Boxing Repeated Exposure Experiments When Controlled Experiments Arer’t Possible: Alternative Analyses Pre-/Post-Analysis Natural Experiment Analysis Analysis of Populations Around a Threshold Conclusion When to Use SQL for Complex Data Sets ‘Advantages of Using SQL When to Build into ETL Instead When to Put Logic in Other Tools Code Organization Commenting Capitalization, Indentation, Parentheses, and Other Formatting Tricks Storing Code Organizing Computations Understanding Order of SQL Clause Evaluation Subqueries ‘Temporary Tables Common Table Expressions grouping sets | Table of contents 254 258 260 260 260 262 264 266 269 270 272 272 274 276 277 278 279 282 282 284 286 286 287 288 288 290 292 292 293 296 296 296 300 302 305 2 ). Conclusion. Managing Data Set Size and Privacy Concerns Sampling with %, mod Reducing Dimensionality PIl and Data Privacy Conclusion Funnel Analysis Churn, Lapse, and Other Definitions of Departure Basket Analysis Resources Books and Blogs Data Sets Final Thoughts, 308 308 314 316 317 Table of Contents vii

You might also like