Data Analytics for the Social Sciences: Applications in R

Length: 672 pages
Edition: 1
Language: English
Publisher: Routledge
Publication Date: 2021-11-30
ISBN-10: 036762429X
ISBN-13: 9780367624293
Sales Rank: #0 (See Top 100 Books)

Data Analytics for the Social Sciences is an introductory, graduate-level treatment of data analytics for social science. It features applications in the R language, arguably the fastest growing and leading statistical tool for researchers.

The book starts with an ethics chapter on the uses and potential abuses of data analytics. Chapters 2 and 3 show how to implement a broad range of statistical procedures in R. Chapters 4 and 5 deal with regression and classification trees and with random forests. Chapter 6 deals with machine learning models and the “caret” package, which makes available to the researcher hundreds of models. Chapter 7 deals with neural network analysis and Chapter 8 with network analysis and visualization of network data. A final chapter treats text analysis, including web scraping, comparative word frequency tables, word clouds, word maps, sentiment analysis, topic analysis, and more. All empirical chapters have two “Quick Start” exercises designed to allow quick immersion in chapter topics, followed by “In Depth” coverage. Data are available for all examples and runnable R code is provided in a “Command Summary”. An appendix provides an extended tutorial on R and RStudio. Over 30 online supplements, covering all chapters, provide “books within the book” on a variety of topics, such as agent-based modelling.

Rather than focusing on equations, derivations and proofs, this book emphasises hands-on obtaining of output for various social science models and on how to interpret the output. It is suitable for all advanced level undergraduate and postgraduate students learning statistical data analysis.

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Acknowledgments
Preface
1. Using and abusing data analytics in social science
	1.1. Introduction
	1.2. The promise of data analytics for social science
		1.2.1. Data analytics in public affairs and public policy
		1.2.2. Data analytics in the social sciences
		1.2.3. Data analytics in the humanities
	1.3. Research design issues in data analytics
		1.3.1. Beware the true believer
		1.3.2. Pseudo-objectivity in data analytics
		1.3.3. The bias of scholarship based on algorithms using big data
		1.3.4. The subjectivity of algorithms
		1.3.5. Big data and big noise
		1.3.6. Limitations of the leading data science dissemination models
	1.4. Social and ethical issues in data analytics
		1.4.1. Types of ethical issues in data analytics
		1.4.2. Bias toward the privileged
		1.4.3. Discrimination
		1.4.4. Diversity and data analytics
		1.4.5. Distortion of democratic processes
		1.4.6. Undermining of professional ethics
		1.4.7. Privacy, profiling, and surveillance issues
		1.4.8. The transparency issue
	1.5. Summary: Technology and power
	Endnotes
2. Statistical analytics with R, Part 1
	PART I: OVERVIEW OF STATISTICAL ANALYSIS WITH R
	2.1. Introduction
	2.2. Data and packages used in this chapter
		2.2.1. Example data
		2.2.2. R packages used
	PART II: QUICK START ON STATISTICAL ANALYSIS WITH R
	2.3. Descriptive statistics
	2.4. Linear multiple regression
	PART III: STATISTICAL ANALYSIS WITH R IN DETAIL
	2.5. Hypothesis testing
		2.5.1. One-sample test of means
		2.5.2. Means test for two independent samples
		2.5.3. Means test for two dependent samples
	2.6. Crosstabulation, significance, and association
	2.7. Loglinear analysis for categorical variables
	2.8. Correlation, correlograms, and scatterplots
	2.9. Factor analysis (exploratory)
	2.10. Multidimensional scaling
	2.11. Reliability analysis
		2.11.1. Cronbach’s alpha and Guttman’s lower bounds
		2.11.2. Guttman’s lower bounds and Cronbach’s alpha
		2.11.3. Krippendorff’s alpha and Cohen’s kappa
	2.12. Cluster analysis
		2.12.1. Hierarchical cluster analysis
		2.12.2. K-means clustering
		2.12.3. Nearest neighbor analysis
	2.13. Analysis of variance
		2.13.1. Data and packages used
		2.13.2. GLM univariate: ANOVA
		2.13.3. GLM univariate: ANCOVA
		2.13.4. GLM multivariate: MANOVA
		2.13.5. GLM multivariate: MANCOVA
	2.14. Logistic regression
		2.14.1. ROC and AUC analysis
		2.14.2. Confusion table and accuracy
	2.15. Mediation and moderation
	2.16. Chapter 2 command summary
	Endnotes
3. Statistical analytics with R, Part 2
	PART I: OVERVIEW OF STATISTICAL ANALYTICS WITH R
	3.1. Introduction
	3.2. Data and packages used in this chapter
		3.2.1. Example data
		3.2.2. R Packages used
	PART II: QUICK START ON STATISTICAL ANALYSIS PART 2
	3.3. Quick start: Linear regression as a generalized linear modeling (GZLM)
		3.3.1. Background to GZLM
		3.3.2. The linear model in glm()
		3.3.3. GZLM output
		3.3.4. Fitted value, residuals, and plots
		3.3.5. Noncanonical custom links
		3.3.6. Multiple comparison tests
		3.3.7. Estimated marginal means (EMM)
	3.4. Quick start: Testing if multilevel modeling is needed
	PART III: STATISTICAL ANALYSIS, PART 2, IN DETAIL
	3.5. Generalized linear models (GZLM)
		3.5.1. Introduction
		3.5.2. Setup for GZLM models in R
		3.5.3. Binary logistic regression example
		3.5.4. Gamma regression model
		3.5.5. Poisson regression model
		3.5.6. Negative binomial regression
	3.6. Multilevel modeling (MLM)
		3.6.1. Introduction
		3.6.2. Setup and data
		3.6.3. The random coefficients model
		3.6.4. Likelihood ratio test
	3.7. Panel data regression (PDR)
		3.7.1. Introduction
		3.7.2. Types of PDR model
		3.7.3. The Hausman test
		3.7.4. Setup and data
		3.7.5. PDR with the plm package
		3.7.6. PDR with the panelr package
	3.8. Structural equation modeling (SEM)
	3.9. Missing data analysis and data imputation
	3.10. Chapter 3 command summary
	Endnotes
4. Classification and regression trees in R
	PART I: OVERVIEW OF CLASSIFICATION AND REGRESSION TREES WITH R
	4.1. Introduction
	4.2. Advantages of decision tree analysis
	4.3. Limitations of decision tree analysis
	4.4. Decision tree terminology
	4.5. Steps in decision tree analysis
	4.6. Decision tree algorithms
	4.7. Random forests and ensemble methods
	4.8. Software
		4.8.1. R language
		4.8.2. Stata
		4.8.3. SAS
		4.8.4. SPSS
		4.8.5. Python language
	4.9. Data and packages used in this chapter
		4.9.1. Example data
		4.9.2. R packages used
	PART II: QUICK START - CLASSIFICATION AND REGRESSION TREES
	4.10. Classification tree example: Survival on the Titanic
	4.11. Regression tree example: Correlates of murder
	PART III: CLASSIFICATION AND REGRESSION TREES, IN DETAIL
	4.12. Overview
	4.13. The rpart() program
		4.13.1. Introduction
		4.13.2. Training and validation datasets
		4.13.3. Setup for rpart() trees
	4.14. Classification trees with the rpart package
		4.14.1. The basic rpart classification tree
		4.14.2. Printing tree rules
		4.14.3. Visualization with prp() and draw.tree()
		4.14.4. Visualization with fancyRpartPlot()
		4.14.5. Interpreting tree summaries
		4.14.6. Listing nodes by country and countries by node
		4.14.7. Node distribution plots
		4.14.8. Saving predictions and residuals
		4.14.9. Cross-validation and pruning
		4.14.10. The confusion matrix and model performance metrics
		4.14.11. The ROC curve and AUC
		4.14.12. Lift plots
		4.14.13. Gains plots
		4.14.14. Precision vs. recall plot
	4.15. Regression trees with the rpart package
		4.15.1. Setup
		4.15.2. Creating an rpart regression tree
		4.15.3. Printing tree rules
		4.15.4. Visualization with prp() and fancyRpartPlot()
		4.15.5. Interpreting tree summaries
		4.15.6. The CP table
		4.15.7. Listing nodes by country and countries by node
		4.15.8. Saving predictions and residuals
		4.15.9. Plotting residuals
		4.15.10. Cross-validation and pruning
		4.15.11. R-squared for regression trees
		4.15.12. MSE for regression trees
		4.15.13. The confusion matrix
		4.15.14. The ROC curve and AUC
		4.15.15. Gains plots
		4.15.16. Gains plot with OLS comparison
	4.16. The tree package
	4.17. The ctree() program for conditional decision trees
	4.18. More decision trees programs for R
	4.19. Chapter 4 command summary
	Endnotes
5. Random forests
	PART I: OVERVIEW OF RANDOM FORESTS IN R
	5.1. Introduction
		5.1.1. Social science examples of random forest models
		5.1.2. Advantages of random forests
		5.1.3. Limitations of random forests
		5.1.4. Data and packages
	PART II: QUICK START – RANDOM FORESTS
	5.2. Classification forest example: Searching for the causes of happiness
	5.3. Regression forest example: Why so much crime in my town?
	PART III: RANDOM FORESTS, IN DETAIL
	5.4. Classification forests with randomForest()
		5.4.1. Setup
		5.4.2. A basic classification model
		5.4.3. Output components of randomForest() objects for classification models
		5.4.4. Graphing a randomForest tree?
		5.4.5. Comparing randomForest() and rpart() performance
		5.4.6. Tuning the random forest model
		5.4.7. MDS cluster analysis of the RF classification model
	5.5. Regression forests with randomForest()
		5.5.1. Introduction
		5.5.2. Setup
		5.5.3. A basic regression model
		5.5.4. Output components for regression forest models
		5.5.5. Graphing a randomForest tree?
		5.5.6. MDS plots
		5.5.7. Quartile plots
		5.5.8. Comparing randomForest() and rpart() regression models
		5.5.9. Tuning the randomForest() regression model
		5.5.10. Outliers: Identifying and removing
	5.6. The randomForestExplainer package
		5.6.1. Setup for the randomForestExplainer package
		5.6.2. Minimal depth plots
		5.6.3. Multiway variable importance plots
		5.6.4. Multiway ranking of variable importance
		5.6.5. Comparing randomForest and OLS rankings of predictors
		5.6.6. Which importance criteria?
		5.6.7. Interaction analysis
		5.6.8. The explain _ forest() function
	5.7. Summary
	5.8. Conditional inference forests
	5.9. MDS plots for random forests
	5.10. More random forest programs for R
	5.11. Command summary
	Endnotes
6. Modeling and machine learning
	PART I: OVERVIEW OF MODELING AND MACHINE LEARNING
	6.1. Introduction
		6.1.1. Social science examples of modeling and machine learning in R
		6.1.2. Advantages of modeling and machine learning in R
		6.1.3. Limitations of modeling and machine learning in R
		6.1.4. Data, packages, and default directory
	PART II: QUICK START – MODELING AND MACHINE LEARNING
	6.2. Example 1: Bayesian modeling of county-level poverty
		6.2.1. Introduction
		6.2.2. Setup
		6.2.3. Correlation plot
		6.2.4. The Bayes generalized linear model
	6.3. Example 2: Predicting diabetes among Pima Indians with mlr3
		6.3.1. Introduction
		6.3.2. Setup
		6.3.3. How mlr3 works
		6.3.4. The Pima Indian data
	PART III: MODELING AND MACHINE LEARNING IN DETAIL
	6.4. Illustrating modeling and machine learning with SVM in caret
		6.4.1. How SVM works
		6.4.2. SVM algorithms compared to logistic and OLS regression
		6.4.3. SVM kernels, types, and parameters
		6.4.4. Tuning SVM models
		6.4.5. SVM and longitudinal data
	6.5. SVM versus OLS regression
	6.6. SVM with the caret package: Predicting world literacy rates
		6.6.1. Setup
		6.6.2. Constructing the SVM regression model with caret
		6.6.3. Obtaining predicted values and residuals
		6.6.4. Model performance metrics
		6.6.5. Variable importance
		6.6.6. Other output elements
		6.6.7. SVM plots
	6.7. Tuning SVM models
		6.7.1. Tuning for the train() command from the caret package
		6.7.2. Tuning for the svm() command from the e1071 package
		6.7.3. Cross-validating SVM models
		6.7.4. Using e1071 in caret rather than the default kern package
	6.8. SVM classification models: Classifying U.S. Senators
		6.8.1. The “senate” example and setup
		6.8.2. SVM classification with alternative kernels: Senate example
		6.8.3. Tuning the SVM binary classification model
	6.9. Gradient boosting machines (GBM)
		6.9.1. Introduction
		6.9.2. Setup and example data
		6.9.3. Metrics for comparing models
		6.9.4. The caret control object
		6.9.5. Training the GBM model under caret
	6.10. Learning vector quantization (LVQ)
		6.10.1. Introduction
		6.10.2. Setup and example data
		6.10.3. Metrics for comparing models
		6.10.4. The caret control object
		6.10.5. Training the LVQ model under caret
	6.11. Comparing models
	6.12. Variable importance
		6.12.1. Leave-one-out modeling
		6.12.2. Recursive feature elimination (RFE) with caret
		6.12.3. Other approaches to variable importance
	6.13. SVM classification for a multinomial outcome
	6.14. Command summary
	Endnotes
7. Neural network models and deep learning
	PART I: OVERVIEW OF NEURAL NETWORK MODELS AND DEEP LEARNING
	7.1. Overview
	7.2. Data and packages
	7.3. Social science examples
	7.4. Pros and cons of neural networks
	7.5. Artificial neural network (ANN) concepts
		7.5.1. ANN terms
		7.5.2. R software programs for ANN
		7.5.3. Training methods for ANN
		7.5.4. Algorithms in neuralnet
		7.5.5. Algorithms in nnet
		7.5.6. Tuning ANN models
	PART II: QUICK START - MODELING AND MACHINE LEARNING
	7.6. Example 1: Analyzing NYC airline delays
		7.6.1. Introduction
		7.6.2. General setup
		7.6.3. Data preparation
		7.6.4. Modeling NYC airline delays
	7.7. Example 2: The classic iris classification example
		7.7.1. Setup
		7.7.2. Exploring separation with a violin plot
		7.7.3. Normalizing the data
		7.7.4. Training the model with nnet in caret
		7.7.5. Obtain model predictions
		7.7.6. Display the neural model
	PART III: NEURAL NETWORK MODELS IN DETAIL
	7.8. Analyzing Boston crime via the neuralnet package
		7.8.1. Setup
		7.8.2. The linear regression model for unscaled data
		7.8.3. The neuralnet model for unscaled data
		7.8.4. Scaling the data
		7.8.5. The linear regression model for scaled data
		7.8.6. The neuralnet model for scaled data
		7.8.7. Neuralnet results for the training data
		7.8.8. Model performance plots
		7.8.9. Visualizing the neuralnet model
		7.8.10. Variable importance for the neuralnet model
	7.9. Analyzing Boston crime via neuralnet under the caret package
	7.10. Analyzing Boston crime via nnet in caret
		7.10.1. Setup
		7.10.2. The nnet/caret model of Boston crime
		7.10.3. Variable importance for the nnet/caret model
		7.10.4. Further tuning the nnet model outside caret
	7.11. A classification model of marital status using nnet
		7.11.1. Setup
		7.11.2. The nnet classification model of marital status
	7.12. Neural network analysis using “mlr3keras”
	7.13. Command summary
	Endnotes
8. Network analysis
	PART I: OVERVIEW OF NETWORK ANALYSIS WITH R
	8.1. Introduction
	8.2. Data and packages used in this chapter
	8.3. Concepts in network analysis
	8.4. Getting data into network format
	PART II: QUICK START ON NETWORK ANALYSIS WITH R
	8.5. Quick start exercise 1: The Medici family network
	8.6. Quick start exercise 2: Marvel hero network communities
	PART III: NETWORK ANALYSIS WITH R IN DETAIL
	8.7. Interactive network analysis with visNetwork
		8.7.1. Undirected networks: Research team management
		8.7.2. Clustering by group: Research team grouped by gender
		8.7.3. A larger network with navigation and circle layout
		8.7.4. Visualizing classification and regression trees: National literacy
		8.7.5. A directed network (asymmetrical relationships in a research team)
	8.8. Network analysis with igraph
		8.8.1. Term adjacency networks: Gubernatorial websites and the covid pandemic
		8.8.2. Similarity/distance networks with igraph: Senate interest group ratings
		8.8.3. Communities, modularity, and centrality
		8.8.4. Similarity network analysis: All senators
	8.9. Using intergraph for network conversions
	8.10. Network-on-a-map with the diagram and maps packages
	8.11. Network analysis with the statnet and network packages
		8.11.1. Introduction
		8.11.2. Visualization
		8.11.3. Neighborhoods
		8.11.4. Cluster analysis
	8.12. Clique analysis with sna
		8.12.1. A simplified clique analysis
		8.12.2. A clique analysis of the DHHS formal network
		8.12.3. K-core analysis of the DHHS formal network
	8.13. Mapping international trade flow with statnet and Intergraph
	8.14. Correlation networks with corrr
	8.15. Network analysis with tidygraph
		8.15.1. Introduction
		8.15.2. A simple tidygraph example
		8.15.3. Network conversions with tidygraph
		8.15.4. Finding community clusters with tidygraph
	8.16. Simulating networks
		8.16.1. Agent-based network modeling with SchellingR
		8.16.2. Agent-based network modeling with RSiena
		8.16.3. Agent-based network modeling with NetLogoR
	8.17. Summary
	8.18. Command summary
	Endnotes
9. Text analytics
	PART I: OVERVIEW OF TEXT ANALYTICS WITH R
	9.1. Overview
	9.2. Data used in this chapter
	9.3. Packages used in this chapter
	9.4. What is a corpus?
	9.5. Text files
		9.5.1. Overview
		9.5.2. Archived texts
		9.5.3. Project Gutenberg archive
		9.5.4. Comma-separated values (.csv) files
		9.5.5. Text from Word .docx files with the textreadr package
		9.5.6. Text from other formats with the readtext package
		9.5.7. Text from raw text files
	PART II: QUICK START ON TEXT ANALYTICS WITH R
	9.6. Quick start exercise 1: Key word in context (kwic) indexing
	9.7. Quick start exercise 2: Word frequencies and histograms
	PART III: NETWORK ANALYSIS WITH R IN DETAIL
	9.8. Web scraping
		9.8.1. Overview
		9.8.2. Web scraping: The “htm2txt” package
		9.8.3. Web scraping: The “rvest” package
	9.9. Social media scraping
		9.9.1. Analysis of Twitter data: Trump and the New York Times
		9.9.2. Social media scraping with twitter
	9.10. Leading text formats in R
		9.10.1. Overview
		9.10.2. Formats related to the “tidytext” package
		9.10.3. Formats related to the “tm” package
		9.10.4. Formats related to the “quanteda” package
		9.10.5. Common text file conversions
	9.11. Tokenization
		9.11.1. Overview
		9.11.2. Word tokenization
	9.12. Character encoding
	9.13. Text cleaning and preparation
	9.14. Analysis: Multigroup word frequency comparisons
		9.14.1. Multigroup analysis in tidytext
		9.14.2. Multigroup analysis with quanteda’s textstat _ keyness() command
		9.14.3. Multigroup analysis with textstat _ frequency() in quanteda and ggplot2
	9.15. Analysis: Word clouds
	9.16. Analysis: Comparison clouds
	9.17. Analysis: Word maps and word correlations
		9.17.1. Working with the tdm format
		9.17.2. Working with the dtm format
		9.17.3. Word frequencies and word correlations
		9.17.4. Correlation plots of word and document associations
		9.17.5. Plotting word stem correlations for word pairs
		9.17.6. Word correlation maps
	9.18. Analysis: Sentiment analysis
		9.18.1. Overview
		9.18.2. Example: sentiment analysis of news articles
	9.19. Analysis: Topic modeling
		9.19.1. Overview
		9.19.2. Topic analysis example 1: Modeling topic frequency over time
		9.19.3. Topic analysis example 2: LDA analysis
	9.20. Analysis: Lexical dispersion plots
	9.21. Analysis: Bigrams and ngrams
	9.22. Command Summary
	Endnotes
Appendix 1: Introduction to R and R studio
Appendix 2: Data used in this book
References
Index