R in Action, Third Edition: Data analysis and graphics with R and Tidyverse, 3rd Edition

Length: 656 pages
Edition: 3
Language: English
Publisher: Manning
Publication Date: 2022-05-03
ISBN-10: 1617296058
ISBN-13: 9781617296055
Sales Rank: #436442 (See Top 100 Books)

R is the most powerful tool you can use for statistical analysis. This definitive guide smooths R’s steep learning curve with practical solutions and real-world applications for commercial environments.

In R in Action, Third Edition you will learn how to:

Set up and install R and RStudio
Clean, manage, and analyze data with R
Use the ggplot2 package for graphs and visualizations
Solve data management problems using R functions
Fit and interpret regression models
Test hypotheses and estimate confidence
Simplify complex multivariate data with principal components and exploratory factor analysis
Make predictions using time series forecasting
Create dynamic reports and stunning visualizations
Techniques for debugging programs and creating packages

R in Action, Third Edition makes learning R quick and easy. That’s why thousands of data scientists have chosen this guide to help them master the powerful language. Far from being a dry academic tome, every example you’ll encounter in this book is relevant to scientific and business developers, and helps you solve common data challenges. R expert Rob Kabacoff takes you on a crash course in statistics, from dealing with messy and incomplete data to creating stunning visualizations. This revised and expanded third edition contains fresh coverage of the new tidyverse approach to data analysis and R’s state-of-the-art graphing capabilities with the ggplot2 package.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology

Used daily by data scientists, researchers, and quants of all types, R is the gold standard for statistical data analysis. This free and open source language includes packages for everything from advanced data visualization to deep learning. Instantly comfortable for mathematically minded users, R easily handles practical problems without forcing you to think like a software engineer.

About the book

R in Action, Third Edition teaches you how to do statistical analysis and data visualization using R and its popular tidyverse packages. In it, you’ll investigate real-world data challenges, including forecasting, data mining, and dynamic report writing. This revised third edition adds new coverage for graphing with ggplot2, along with examples for machine learning topics like clustering, classification, and time series analysis.

What’s inside

Clean, manage, and analyze data
Use the ggplot2 package for graphs and visualizations
Techniques for debugging programs and creating packages
A complete learning resource for R and tidyverse

About the reader

Requires basic math and statistics. No prior experience with R needed.

About the author

Dr. Robert I Kabacoff is a professor of quantitative analytics at Wesleyan University and a seasoned data scientist with more than 20 years of experience.

R in Action
Copyright
Praise for the previous edition of R in Action
brief contents
contents
Front matter
    preface
    acknowledgments
    about this book
        What's new in the third edition
        Who should read this book
        How this book is organized: A road map
        Advice for data miners
        About the code
        liveBook discussion forum
    about the author
    about the cover illustration
Part 1. Getting started
1 Introduction to R
    1.1 Why use R?
    1.2 Obtaining and installing R
    1.3 Working with R
        1.3.1 Getting started
        1.3.2 Using RStudio
        1.3.3 Getting help
        1.3.4 The workspace
        1.3.5 Projects
    1.4 Packages
        1.4.1 What are packages?
        1.4.2 Installing a package
        1.4.3 Loading a package
        1.4.4 Learning about a package
    1.5 Using output as input: Reusing results
    1.6 Working with large datasets
    1.7 Working through an example
    Summary
2 Creating a dataset
    2.1 Understanding datasets
    2.2 Data structures
        2.2.1 Vectors
        2.2.2 Matrices
        2.2.3 Arrays
        2.2.4 Data frames
        2.2.5 Factors
        2.2.6 Lists
        2.2.7 Tibbles
    2.3 Data input
        2.3.1 Entering data from the keyboard
        2.3.2 Importing data from a delimited text file
        2.3.3 Importing data from Excel
        2.3.4 Importing data from JSON
        2.3.5 Importing data from the web
        2.3.6 Importing data from SPSS
        2.3.7 Importing data from SAS
        2.3.8 Importing data from Stata
        2.3.9 Accessing database management systems
        2.3.10 Importing data via Stat/Transfer
    2.4 Annotating datasets
        2.4.1 Variable labels
        2.4.2 Value labels
    2.5 Useful functions for working with data objects
    Summary
3 Basic data management
    3.1 A working example
    3.2 Creating new variables
    3.3 Recoding variables
    3.4 Renaming variables
    3.5 Missing values
        3.5.1 Recoding values to missing
        3.5.2 Excluding missing values from analyses
    3.6 Date values
        3.6.1 Converting dates to character variables
        3.6.2 Going further
    3.7 Type conversions
    3.8 Sorting data
    3.9 Merging datasets
        3.9.1 Adding columns to a data frame
        3.9.2 Adding rows to a data frame
    3.10 Subsetting datasets
        3.10.1 Selecting variables
        3.10.2 Dropping variables
        3.10.3 Selecting observations
        3.10.4 The subset() function
        3.10.5 Random samples
    3.11 Using dplyr to manipulate data frames
        3.11.1 Basic dplyr functions
        3.11.2 Using pipe operators to chain statements
    3.12 Using SQL statements to manipulate data frames
    Summary
4 Getting started with graphs
    4.1 Creating a graph with ggplot2
        4.1.1 ggplot
        4.1.2 Geoms
        4.1.3 Grouping
        4.1.4 Scales
        4.1.5 Facets
        4.1.6 Labels
        4.1.7 Themes
    4.2 ggplot2 details
        4.2.1 Placing the data and mapping options
        4.2.2 Graphs as objects
        4.2.3 Saving graphs
        4.2.4 Common mistakes
    Summary
5 Advanced data management
    5.1 A data management challenge
    5.2 Numerical and character functions
        5.2.1 Mathematical functions
        5.2.2 Statistical functions
        5.2.3 Probability functions
        5.2.4 Character functions
        5.2.5 Other useful functions
        5.2.6 Applying functions to matrices and data frames
        5.2.7 A solution for the data management challenge
    5.3 Control flow
        5.3.1 Repetition and looping
        5.3.2 Conditional execution
    5.4 User-written functions
    5.5 Reshaping data
        5.5.1 Transposing
        5.5.2 Converting from wide to long dataset formats
    5.6 Aggregating data
    Summary
Part 2. Basic methods
6 Basic graphs
    6.1 Bar charts
        6.1.1 Simple bar charts
        6.1.2 Stacked, grouped, and filled bar charts
        6.1.3 Mean bar charts
        6.1.4 Tweaking bar charts
    6.2 Pie charts
    6.3 Tree maps
    6.4 Histograms
    6.5 Kernel density plots
    6.6 Box plots
        6.6.1 Using parallel box plots to compare groups
        6.6.2 Violin plots
    6.7 Dot plots
    Summary
7 Basic statistics
    7.1 Descriptive statistics
        7.1.1 A menagerie of methods
        7.1.2 Even more methods
        7.1.3 Descriptive statistics by group
        7.1.4 Summarizing data interactively with dplyr
        7.1.5 Visualizing results
    7.2 Frequency and contingency tables
        7.2.1 Generating frequency tables
        7.2.2 Tests of independence
        7.2.3 Measures of association
        7.2.4 Visualizing results
    7.3 Correlations
        7.3.1 Types of correlations
        7.3.2 Testing correlations for significance
        7.3.3 Visualizing correlations
    7.4 T-tests
        7.4.1 Independent t-test
        7.4.2 Dependent t-test
        7.4.3 When there are more than two groups
    7.5 Nonparametric tests of group differences
        7.5.1 Comparing two groups
        7.5.2 Comparing more than two groups
    7.6 Visualizing group differences
    Summary
Part 3. Intermediate methods
8 Regression
    8.1 The many faces of regression
        8.1.1 Scenarios for using OLS regression
        8.1.2 What you need to know
    8.2 OLS regression
        8.2.1 Fitting regression models with lm()
        8.2.2 Simple linear regression
        8.2.3 Polynomial regression
        8.2.4 Multiple linear regression
        8.2.5 Multiple linear regression with interactions
    8.3 Regression diagnostics
        8.3.1 A typical approach
        8.3.2 An enhanced approach
        8.3.3 Multicollinearity
    8.4 Unusual observations
        8.4.1 Outliers
        8.4.2 High-leverage points
        8.4.3 Influential observations
    8.5 Corrective measures
        8.5.1 Deleting observations
        8.5.2 Transforming variables
        8.5.3 Adding or deleting variables
        8.5.4 Trying a different approach
    8.6 Selecting the “best” regression model
        8.6.1 Comparing models
        8.6.2 Variable selection
    8.7 Taking the analysis further
        8.7.1 Cross-validation
        8.7.2 Relative importance
    Summary
9 Analysis of variance
    9.1 A crash course on terminology
    9.2 Fitting ANOVA models
        9.2.1 The aov() function
        9.2.2 The order of formula terms
    9.3 One-way ANOVA
        9.3.1 Multiple comparisons
        9.3.2 Assessing test assumptions
    9.4 One-way ANCOVA
        9.4.1 Assessing test assumptions
        9.4.2 Visualizing the results
    9.5 Two-way factorial ANOVA
    9.6 Repeated measures ANOVA
    9.7 Multivariate analysis of variance (MANOVA)
        9.7.1 Assessing test assumptions
        9.7.2 Robust MANOVA
    9.8 ANOVA as regression
    Summary
10 Power analysis
    10.1 A quick review of hypothesis testing
    10.2 Implementing power analysis with the pwr package
        10.2.1 T-tests
        10.2.2 ANOVA
        10.2.3 Correlations
        10.2.4 Linear models
        10.2.5 Tests of proportions
        10.2.6 Chi-square tests
        10.2.7 Choosing an appropriate effect size in novel situations
    10.3 Creating power analysis plots
    10.4 Other packages
    Summary
11 Intermediate graphs
    11.1 Scatter plots
        11.1.1 Scatter plot matrices
        11.1.2 High-density scatter plots
        11.1.3 3D scatter plots
        11.1.4 Spinning 3D scatter plots
        11.1.5 Bubble plots
    11.2 Line charts
    11.3 Corrgrams
    11.4 Mosaic plots
    Summary
12 Resampling statistics and bootstrapping
    12.1 Permutation tests
    12.2 Permutation tests with the coin package
        12.2.1 Independent two-sample and k-sample tests
        12.2.2 Independence in contingency tables
        12.2.3 Independence between numeric variables
        12.2.4 Dependent two-sample and k-sample tests
        12.2.5 Going further
    12.3 Permutation tests with the lmPerm package
        12.3.1 Simple and polynomial regression
        12.3.2 Multiple regression
        12.3.3 One-way ANOVA and ANCOVA
        12.3.4 Two-way ANOVA
    12.4 Additional comments on permutation tests
    12.5 Bootstrapping
    12.6 Bootstrapping with the boot package
        12.6.1 Bootstrapping a single statistic
        12.6.2 Bootstrapping several statistics
    Summary
Part 4. Advanced methods
13 Generalized linear models
    13.1 Generalized linear models and the glm() function
        13.1.1 The glm() function
        13.1.2 Supporting functions
        13.1.3 Model fit and regression diagnostics
    13.2 Logistic regression
        13.2.1 Interpreting the model parameters
        13.2.2 Assessing the impact of predictors on the probability of an outcome
        13.2.3 Overdispersion
        13.2.4 Extensions
    13.3 Poisson regression
        13.3.1 Interpreting the model parameters
        13.3.2 Overdispersion
        13.3.3 Extensions
    Summary
14 Principal components and factor analysis
    14.1 Principal components and factor analysis in R
    14.2 Principal components
        14.2.1 Selecting the number of components to extract
        14.2.2 Extracting principal components
        14.2.3 Rotating principal components
        14.2.4 Obtaining principal component scores
    14.3 Exploratory factor analysis
        14.3.1 Deciding how many common factors to extract
        14.3.2 Extracting common factors
        14.3.3 Rotating factors
        14.3.4 Factor scores
        14.3.5 Other EFA-related packages
    14.4 Other latent variable models
    Summary
15 Time series
    15.1 Creating a time-series object in R
    15.2 Smoothing and seasonal decomposition
        15.2.1 Smoothing with simple moving averages
        15.2.2 Seasonal decomposition
    15.3 Exponential forecasting models
        15.3.1 Simple exponential smoothing
        15.3.2 Holt and Holt–Winters exponential smoothing
        15.3.3 The ets() function and automated forecasting
    15.4 ARIMA forecasting models
        15.4.1 Prerequisite concepts
        15.4.2 ARMA and ARIMA models
        15.4.3 Automated ARIMA forecasting
    15.5 Going further
    Summary
16 Cluster analysis
    16.1 Common steps in cluster analysis
    16.2 Calculating distances
    16.3 Hierarchical cluster analysis
    16.4 Partitioning-cluster analysis
        16.4.1 K-means clustering
        16.4.2 Partitioning around medoids
    16.5 Avoiding nonexistent clusters
    16.6 Going further
    Summary
17 Classification
    17.1 Preparing the data
    17.2 Logistic regression
    17.3 Decision trees
        17.3.1 Classical decision trees
        17.3.2 Conditional inference trees
    17.4 Random forests
    17.5 Support vector machines
        17.5.1 Tuning an SVM
    17.6 Choosing a best predictive solution
    17.7 Understanding black box predictions
        17.7.1 Break-down plots
        17.7.2 Plotting Shapley values
    17.8 Going further
    Summary
18 Advanced methods for missing data
    18.1 Steps in dealing with missing data
    18.2 Identifying missing values
    18.3 Exploring missing-values patterns
        18.3.1 Visualizing missing values
        18.3.2 Using correlations to explore missing values
    18.4 Understanding the sources and impact of missing data
    18.5 Rational approaches for dealing with incomplete data
    18.6 Deleting missing data
        18.6.1 Complete-case analysis (listwise deletion)
        18.6.2 Available case analysis (pairwise deletion)
    18.7 Single imputation
        18.7.1 Simple imputation
        18.7.2 K-nearest neighbor imputation
        18.7.3 missForest
    18.8 Multiple imputation
    18.9 Other approaches to missing data
    Summary
Part 5. Expanding your skills
19 Advanced graphs
    19.1 Modifying scales
        19.1.1 Customizing axes
        19.1.2 Customizing colors
    19.2 Modifying themes
        19.2.1 Prepackaged themes
        19.2.2 Customizing fonts
        19.2.3 Customizing legends
        19.2.4 Customizing the plot area
    19.3 Adding annotations
    19.4 Combining graphs
    19.5 Making graphs interactive
    Summary
20 Advanced programming
    20.1 A review of the language
        20.1.1 Data types
        20.1.2 Control structures
        20.1.3 Creating functions
    20.2 Working with environments
    20.3 Non-standard evaluation
    20.4 Object-oriented programming
        20.4.1 Generic functions
        20.4.2 Limitations of the S3 model
    20.5 Writing efficient code
        20.5.1 Efficient data input
        20.5.2 Vectorization
        20.5.3 Correctly sizing objects
        20.5.4 Parallelization
    20.6 Debugging
        20.6.1 Common sources of errors
        20.6.2 Debugging tools
        20.6.3 Session options that support debugging
        20.6.4 Using RStudio’s visual debugger
    20.7 Going further
    Summary
21 Creating dynamic reports
    21.1 A template approach to reports
    21.2 Creating a report with R and R Markdown
    21.3 Creating a report with R and LaTeX
        21.3.1 Creating a parameterized report
    21.4 Avoiding common R Markdown problems
    21.5 Going further
    Summary
22 Creating a package
    22.1 The edatools package
    22.2 Creating a package
        22.2.1 Installing development tools
        22.2.2 Creating a package project
        22.2.3 Writing the package functions
        22.2.4 Adding function documentation
        22.2.5 Adding a general help file (optional)
        22.2.6 Adding sample data to the package (optional)
        22.2.7 Adding a vignette (optional)
        22.2.8 Editing the DESCRIPTION file
        22.2.9 Building and installing the package
    22.3 Sharing your package
        22.3.1 Distributing a source package file
        22.3.2 Submitting to CRAN
        22.3.3 Hosting on GitHub
        22.3.4 Creating a package website
    22.4 Going further
    Summary
Afterword. Into the rabbit hole
Appendix A. Graphical user interfaces
Appendix B. Customizing the startup environment
Appendix C. Exporting data from R
    C.1 Delimited text file
    C.2 Excel spreadsheet
    C.3 Statistical applications
Appendix D. Matrix algebra in R
Appendix E. Packages used in this book
Appendix F. Working with large datasets
    F.1 Efficient programming
    F.2 Storing data outside of RAM
    F.3 Analytic packages for out-of-memory data
    F.4 Comprehensive solutions for working with enormous datasets
Appendix G. Updating an R installation
    G.1 Automated installation (Windows only)
    G.2 Manual installation (Windows and macOS)
    G.3 Updating an R installation (Linux)
References
index