SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights
With the explosion of data, computing power, and cloud data warehouses, SQL has become an even more indispensable tool for the savvy analyst or data scientist. This practical book reveals new and hidden ways to improve your SQL skills, solve problems, and make the most of SQL as part of your workflow.
You’ll learn how to use both common and exotic SQL functions such as joins, window functions, subqueries, and regular expressions in new, innovative ways–as well as how to combine SQL techniques to accomplish your goals faster, with understandable code. If you work with SQL databases, this is a must-have reference.
- Learn the key steps for preparing your data for analysis
- Perform time series analysis using SQL’s date and time manipulations
- Use cohort analysis to investigate how groups change over time
- Use SQL’s powerful functions and operators for text analysis
- Detect outliers in your data and replace them with alternate values
- Establish causality using experiment analysis, also known as A/B testing
Preface Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments 1. Analysis with SQL What Is Data Analysis? Why SQL? What Is SQL? Benefits of SQL SQL Versus R or Python SQL as Part of the Data Analysis Workflow Database Types and How to Work with Them Row-Store Databases Column-Store Databases Other Types of Data Infrastructure Conclusion 2. Preparing Data for Analysis Types of Data Database Data Types Structured Versus Unstructured Quantitative Versus Qualitative Data First-, Second-, and Third-Party Data Sparse Data SQL Query Structure Profiling: Distributions Histograms and Frequencies Binning n-Tiles Profiling: Data Quality Detecting Duplicates Deduplication with GROUP BY and DISTINCT Preparing: Data Cleaning Cleaning Data with CASE Transformations Type Conversions and Casting Dealing with Nulls: coalesce, nullif, nvl Functions Missing Data Preparing: Shaping Data For Which Output: BI, Visualization, Statistics, ML Pivoting with CASE Statements Unpivoting with UNION Statements pivot and unpivot Functions Conclusion 3. Time Series Analysis Date, Datetime, and Time Manipulations Time Zone Conversions Date and Timestamp Format Conversions Date Math Time Math Joining Data from Different Sources The Retail Sales Data Set Trending the Data Simple Trends Comparing Components Percent of Total Calculations Indexing to See Percent Change over Time Rolling Time Windows Calculating Rolling Time Windows Rolling Time Windows with Sparse Data Calculating Cumulative Values Analyzing with Seasonality Period-over-Period Comparisons: YoY and MoM Period-over-Period Comparisons: Same Month Versus Last Year Comparing to Multiple Prior Periods Conclusion 4. Cohort Analysis Cohorts: A Useful Analysis Framework The Legislators Data Set Retention SQL for a Basic Retention Curve Adjusting Time Series to Increase Retention Accuracy Cohorts Derived from the Time Series Itself Defining the Cohort from a Separate Table Dealing with Sparse Cohorts Defining Cohorts from Dates Other Than the First Date Related Cohort Analyses Survivorship Returnship, or Repeat Purchase Behavior Cumulative Calculations Cross-Section Analysis, Through a Cohort Lens Conclusion 5. Text Analysis Why Text Analysis with SQL? What Is Text Analysis? Why SQL Is a Good Choice for Text Analysis When SQL Is Not a Good Choice The UFO Sightings Data Set Text Characteristics Text Parsing Text Transformations Finding Elements Within Larger Blocks of Text Wildcard Matches: LIKE, ILIKE Exact Matches: IN, NOT IN Regular Expressions Constructing and Reshaping Text Concatenation Reshaping Text Conclusion 6. Anomaly Detection Capabilities and Limits of SQL for Anomaly Detection The Data Set Detecting Outliers Sorting to Find Anomalies Calculating Percentiles and Standard Deviations to Find Anomalies Graphing to Find Anomalies Visually Forms of Anomalies Anomalous Values Anomalous Counts or Frequencies Anomalies from the Absence of Data Handling Anomalies Investigation Removal Replacement with Alternate Values Rescaling Conclusion 7. Experiment Analysis Strengths and Limits of Experiment Analysis with SQL The Data Set Types of Experiments Experiments with Binary Outcomes: The Chi-Squared Test Experiments with Continuous Outcomes: The t-Test Challenges with Experiments and Options for Rescuing Flawed Experiments Variant Assignment Outliers Time Boxing Repeated Exposure Experiments When Controlled Experiments Aren’t Possible: Alternative Analyses Pre-/Post-Analysis Natural Experiment Analysis Analysis of Populations Around a Threshold Conclusion 8. Creating Complex Data Sets for Analysis When to Use SQL for Complex Data Sets Advantages of Using SQL When to Build into ETL Instead When to Put Logic in Other Tools Code Organization Commenting Capitalization, Indentation, Parentheses, and Other Formatting Tricks Storing Code Organizing Computations Understanding Order of SQL Clause Evaluation Subqueries Temporary Tables Common Table Expressions grouping sets Managing Data Set Size and Privacy Concerns Sampling with %, mod Reducing Dimensionality PII and Data Privacy Conclusion 9. Conclusion Funnel Analysis Churn, Lapse, and Other Definitions of Departure Basket Analysis Resources Books and Blogs Data Sets Final Thoughts Index
How to download source code?
1. Go to:
2. Search the book title:
SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights, sometime you may not get the results, please search the main title
3. Click the book title in the search results
Publisher resources section, click
Download Example Code.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.