Statistics for Data Scientists: An Introduction to Probability, Statistics, and Data Analysis

by Edwin van den Heuvel, Maurits Kaptein

Length: 345 pages
Edition: 1
Language: English
Publisher: Springer
Publication Date: 2020-02-11
ISBN-10: 303010530X
ISBN-13: 9783030105303
Sales Rank: #0 (See Top 100 Books)

This book provides an undergraduate introduction to analysing data for data science, computer science, and quantitative social science students. It uniquely combines a hands-on approach to data analysis – supported by numerous real data examples and reusable [R] code – with a rigorous treatment of probability and statistical principles.

Where contemporary undergraduate textbooks in probability theory or statistics often miss applications and an introductory treatment of modern methods (bootstrapping, Bayes, etc.), and where applied data analysis books often miss a rigorous theoretical treatment, this book provides an accessible but thorough introduction into data analysis, using statistical methods combining the two viewpoints. The book further focuses on methods for dealing with large data-sets and streaming-data and hence provides a single-course introduction of statistical methods for data science.

Preface
	For Whom is This Book for?
	What Makes This Book Different?
	Structure of the Book and Its Chapters
	Teaching Statistics to Data Scientists Using This Book
	Datasets Used Throughout This Book
	Assignments
	Reference
Contents
Notation and Code Conventions
1 A First Look at Data
	1.1 Overview and Learning Goals
	1.2 Getting Started with R
		1.2.1 Opening a Dataset: face-data.csv
		1.2.2 Some Useful Commands for Exploring a Dataset
		1.2.3 Scalars, Vectors, Matrices, Data.frames, Objects
	1.3 Measurement Levels
		1.3.1 Outliers and Unrealistic Values
	1.4 Describing Data
		1.4.1 Frequency
		1.4.2 Central Tendency
		1.4.3 Dispersion, Skewness, and Kurtosis
		1.4.4 A Note on Aggregated Data
	1.5 Visualizing Data
		1.5.1 Describing Nominal/ordinal Variables
		1.5.2 Describing Interval/ratio Variables
		1.5.3 Relations Between Variables
		1.5.4 Multi-panel Plots
		1.5.5 Plotting Mathematical Functions
		1.5.6 Frequently Used Arguments
	1.6 Other R Plotting Systems (And Installing Packages)
		1.6.1 Lattice
		1.6.2 GGplot2
	References
2 Sampling Plans and Estimates
	2.1 Introduction
	2.2 Definitions and Standard Terminology
	2.3 Non-representative Sampling
		2.3.1 Convenience Sampling
		2.3.2 Haphazard Sampling
		2.3.3 Purposive Sampling
	2.4 Representative Sampling
		2.4.1 Simple Random Sampling
		2.4.2 Systematic Sampling
		2.4.3 Stratified Sampling
		2.4.4 Cluster Sampling
	2.5 Evaluating Estimators Given Different Sampling Plans
		2.5.1 Generic Formulation of Sampling Plans
		2.5.2 Bias, Standard Error, and Mean Squared Error
		2.5.3 Illustration of a Comparison of Sampling Plans
		2.5.4 Comparing Sampling Plans Using R
	2.6 Estimation of the Population Mean
		2.6.1 Simple Random Sampling
		2.6.2 Systematic Sampling
		2.6.3 Stratified Sampling
		2.6.4 Cluster Sampling
	2.7 Estimation of the Population Proportion
	2.8 Estimation of the Population Variance
		2.8.1 Estimation of the MSE
	2.9 Conclusions
	References
3 Probability Theory
	3.1 Introduction
	3.2 Definitions of Probability
	3.3 Probability Axioms
		3.3.1 Example: Using the Probability Axioms
	3.4 Conditional Probability
		3.4.1 Example: Using Conditional Probabilities
		3.4.2 Computing Probabilities Using R
	3.5 Measures of Risk
		3.5.1 Risk Difference
		3.5.2 Relative Risk
		3.5.3 Odds Ratio
		3.5.4 Example: Using Risk Measures
	3.6 Sampling from Populations: Different Study Designs
		3.6.1 Cross-Sectional Study
		3.6.2 Cohort Study
		3.6.3 Case-Control Study
	3.7 Simpson's Paradox
	3.8 Conclusion
	References
4 Random Variables and Distributions
	4.1 Introduction
	4.2 Probability Density Functions
		4.2.1 Normal Density Function
		4.2.2 Lognormal Density Function
		4.2.3 Uniform Density Function
		4.2.4 Exponential Density Function
	4.3 Distribution Functions and Continuous Random Variables
	4.4 Expected Values of Continuous Random Variables
	4.5 Distributions of Discrete Random Variables
	4.6 Expected Values of Discrete Random Variables
	4.7 Well-Known Discrete Distributions
		4.7.1 Bernoulli Probability Mass Function
		4.7.2 Binomial Probability Mass Function
		4.7.3 Poisson Probability Mass Function
		4.7.4 Negative Binomial Probability Mass Function
		4.7.5 Overview of Moments for Well-Known Discrete Distributions
	4.8 Working with Distributions in R
		4.8.1 R Built-In Functions
		4.8.2 Using Monte-Carlo Methods
		4.8.3 Obtaining Draws from Distributions: Inverse Transform Sampling
	4.9 Relationships Between Distributions
		4.9.1 Binomial—Poisson
		4.9.2 Binomial—Normal
	4.10 Calculation Rules for Random Variables
		4.10.1 Rules for Single Random Variables
		4.10.2 Rules for Two Random Variables
	4.11 Conclusion
	References
5 Estimation
	5.1 Introduction
	5.2 From Population Characteristics to Sample Statistics
		5.2.1 Population Characteristics
		5.2.2 Sample Statistics Under Simple Random Sampling
	5.3 Distributions of Sample Statistic Tn
		5.3.1 Distribution of the Sample Maximum or Minimum
		5.3.2 Distribution of the Sample Average barX
		5.3.3 Distribution of the Sample Variance S2
		5.3.4 The Central Limit Theorem
		5.3.5 Asymptotic Confidence Intervals
	5.4 Normally Distributed Populations
		5.4.1 Confidence Intervals for Normal Populations
		5.4.2 Lognormally Distributed Populations
	5.5 Methods of Estimation
		5.5.1 Method of Moments
		5.5.2 Maximum Likelihood Estimation
	Reference
6 Multiple Random Variables
	6.1 Introduction
	6.2 Multivariate Distributions
		6.2.1 Definition of Independence
		6.2.2 Discrete Random Variables
		6.2.3 Continuous Random Variables
	6.3 Constructing Bivariate Probability Distributions
		6.3.1 Using Sums of Random Variables
		6.3.2 Using the Farlie–Gumbel–Morgenstern Family of Distributions
		6.3.3 Using Mixtures of Probability Distributions
		6.3.4 Using the Fréchet Family of Distributions
	6.4 Properties of Multivariate Distributions
		6.4.1 Expectations
		6.4.2 Covariances
	6.5 Measures of Association
		6.5.1 Pearson's Correlation Coefficient
		6.5.2 Kendall's Tau Correlation
		6.5.3 Spearman's Rho Correlation
		6.5.4 Cohen's Kappa Statistic
	6.6 Estimators of Measures of Association
		6.6.1 Pearson's Correlation Coefficient
		6.6.2 Kendall's Tau Correlation Coefficient
		6.6.3 Spearman's Rho Correlation Coefficient
		6.6.4 Should We Use Pearson's Rho, Spearman's Rho or Kendall's Tau Correlation?
		6.6.5 Cohen's Kappa Statistic
		6.6.6 Risk Difference, Relative Risk, and Odds Ratio
	6.7 Other Sample Statistics for Association
		6.7.1 Nominal Association Statistics
		6.7.2 Ordinal Association Statistics
		6.7.3 Binary Association Statistics
	6.8 Exploring Multiple Variables Using R
		6.8.1 Associations Between Continuous Variables
		6.8.2 Association Between Binary Variables
		6.8.3 Association Between Categorical Variables
	6.9 Conclusions
	References
7 Making Decisions in Uncertainty
	7.1 Introduction
	7.2 Bootstrapping
		7.2.1 The Basic Idea Behind the Bootstrap
		7.2.2 Applying the Bootstrap: The Non-parametric Bootstrap
		7.2.3 Applying the Bootstrap: The Parametric Bootstrap
		7.2.4 Applying the Bootstrap: Bootstrapping Massive Datasets
		7.2.5 A Critical Discussion of the Bootstrap
	7.3 Hypothesis Testing
		7.3.1 The One-Sided z-Test for a Single Mean
		7.3.2 The Two-Sided z-Test for a Single Mean
		7.3.3 Confidence Intervals and Hypothesis Testing
		7.3.4 The t-Tests for Means
		7.3.5 Non-parametric Tests for Medians
		7.3.6 Tests for Equality of Variation from Two Independent Samples
		7.3.7 Tests for Independence Between Two Variables
		7.3.8 Tests for Normality
		7.3.9 Tests for Outliers
		7.3.10 Equivalence Testing
	7.4 Conclusions
	References
8 Bayesian Statistics
	8.1 Introduction
	8.2 Bayes' Theorem for Population Parameters
		8.2.1 Bayes' Law for Multiple Events
		8.2.2 Bayes' Law for Competing Hypotheses
		8.2.3 Bayes' Law for Statistical Models
		8.2.4 The Fundamentals of Bayesian Data Analysis
	8.3 Bayesian Data Analysis by Example
		8.3.1 Estimating the Parameter of a Bernoulli Population
		8.3.2 Estimating the Parameters of a Normal Population
		8.3.3 Bayesian Analysis for Normal Populations Based on Single Observation
		8.3.4 Bayesian Analysis for Normal Populations Based on Multiple Observations
		8.3.5 Bayesian Analysis for Normal Populations with Unknown Mean and Variance
	8.4 Bayesian Decision-Making in Uncertainty
		8.4.1 Providing Point Estimates of Parameters
		8.4.2 Providing Interval Estimates of the Parameters
		8.4.3 Testing Hypotheses
	8.5 Challenges Involved in the Bayesian Approach
		8.5.1 Choosing a Prior
		8.5.2 Bayesian Computation
	8.6 Software for Bayesian Analysis
		8.6.1 A Simple Bernoulli Model Using Stan
	8.7 Bayesian and Frequentist Thinking Compared
	8.8 Conclusion
	References