How to Think about Data Science

Length: 276 pages
Edition: 1
Language: English
Publisher: Chapman and Hall/CRC
Publication Date: 2022-12-23
ISBN-10: 1032369639
ISBN-13: 9781032369631
Sales Rank: #0 (See Top 100 Books)

This book is a timely and critical introduction for those interested in what data science is (and isn’t), and how it should be applied. The language is conversational and the content is accessible for readers without a quantitative or computational background; but, at the same time, it is also a practical overview of the field for the more technical readers. The overarching goal is to demystify the field and teach the reader how to develop an analytical mindset instead of following recipes. The book takes the scientist’s approach of focusing on asking the right question at every step as this is the single most important factor contributing to the success of a data science project. Upon finishing this book, the reader should be asking more questions than I have answered. This book is, therefore, a practising scientist’s approach to explaining data science through questions and examples.

Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Foreword
Preface
Acknowledgements
List of Figures
List of Tables
CHAPTER 1: A Bird's-Eye View and the Art of Asking Questions
1.1. WHO IS THIS BOOK FOR?
1.2. ASKING GOOD QUESTIONS IS THE ULTIMATE SUPERPOWER
1.2.1. The Data Scientific Method
1.2.2. What Makes a Good Question?
1.2.3. What Kinds of Questions Can Data Science Not Answer?
1.2.4. The Apollo 11 Mission: Raw Processing Power Is Not as Important as Asking the Right Questions
1.3. THE DATA LIFE CYCLE
1.4. DATA VOLUME AND PROCESSING REQUIREMENTS DETERMINE LARGE-SCALE COMPUTER ARCHITECTURES
1.4.1. The Evolution of Uber's Data Science Needs
1.5. FURTHER READING
1.6. CHAPTER REVIEW QUESTIONS
CHAPTER 2: Descriptive Analytics
2.1. DESCRIPTIVE AND PREDICTIVE ANALYTICS: WHAT IS THE DIFFERENCE?
2.2. DESCRIPTIVE STATISTICS
2.3. WHY IS DATA VISUALISATION ESSENTIAL?
2.4. THE BOX PLOT: EXTENDING VISUALISATION OF KEY STATISTICS TO MULTIPLE SAMPLES
2.4.1. Outliers: Keep or Drop Them?
2.5. FAMOUS HISTORICAL VISUALISATIONS
2.5.1. Florence Nightingale's Soldiers
2.5.2. John Snow and the Transmission of Cholera
2.5.3. Napoleon's Defeat or the Most Brutal Military Tactic in History
2.6. HOW CAN WE CHOOSE THE RIGHT VISUALISATION?
2.7. CLUSTER ANALYSIS: MOST DATA HAVE AN INTRINSIC STRUCTURE
2.7.1. Clustering in Practice: What's the Most Dangerous State in the USA?
2.7.2. Advantages and Limitations of K-Means Clustering
2.8. ASSOCIATION RULES
2.8.1. Quantifying the Relevance of Association Rules
2.8.2. Limitations of Association Rule Mining
2.8.3. Predictive Rule Learning
2.9. FURTHER READING
2.10. CHAPTER REVIEW QUESTIONS
CHAPTER 3: Predictive Analytics
3.1. WHAT IS PREDICTIVE ANALYTICS?
3.2. THE THREE MAIN LEARNING PARADIGMS
3.2.1. Pseudolabelling
3.3. AN OVERVIEW OF MACHINE LEARNING ALGORITHMS
3.4. LINEAR REGRESSION
3.4.1. How Good Is Our Linear Model?
3.4.2. Assumptions of Linearity and Independence, Multiple Regression and Non-Linearity
3.4.3. Linear Thinking or How We Can Be Easily Fooled
3.4.4. And a Short Historical Note
3.5. LOGISTIC REGRESSION
3.5.1. Interpreting Simple Logistic Regression
3.5.2. Multiple Logistic Regression
3.6. NAÏVE BAYES CLASSIFICATION
3.6.1. The Assumption of Predictor Variable Independence
3.7. TREES
3.7.1. Tree Pruning
3.8. ENSEMBLE LEARNING: INDIVIDUAL MODELS ARE WEAK, FAMILIES ARE NOT
3.8.1. Bootstrap Aggregating (Bagging)
3.8.2. Random Forests or the Wisdom of the Crowd
3.8.3. Boosting Machines
3.9. SUPPORT VECTOR MACHINES
3.9.1. The Kernel Trick
3.9.2. Multi-Category Separation by SVMs
3.9.3. Advantages and Disadvantages of SVMs
3.10. ARTIFICIAL NEURAL NETWORKS
3.10.1. The Perceptron
3.10.2. Beyond the Perceptron
3.10.3. How Do ANNs Learn?
3.10.4. Advantages and Disadvantages of ANNs
3.11. DEEP LEARNING
3.11.1. AlexNet and the ImageNet Competition
3.11.2. Transfer Learning
3.12. WHY THERE ARE SO MANY DIFFERENT LEARNING ALGORITHMS AND HOW TO CHOOSE
3.12.1. Why Making Assumptions Is Good and the No Free Lunch Theorem
3.12.2. Some Families of Algorithms Are Better than Others
3.12.3. How Can We Choose the Right Algorithm for Our Problem?
3.12.4. Interpretability vs Explainability
3.13. FEATURE SELECTION AND ENGINEERING IS WHERE A LOT OF MAGIC HAPPENS
3.13.1. The Curse of Dimensionality
3.13.2. Feature Selection Methods
3.13.3. What Is the Best Feature Selection Method and How Many Features Are Enough?
3.13.4. Dimensionality Reduction
3.13.5. Feature Engineering
3.14. HOW DOES AMAZON RECOMMEND BOOKS THAT YOU WILL ACTUALLY LIKE?
3.14.1. Item-to-Item Collaborative Filtering Is Faster and More Scalable
3.14.2. Are Recommender Systems Supervised or Unsupervised Learning?
3.14.3. Democracy of Choice, the Real Potential of Recommender Systems
3.15. BUILDING A MOVIE RECOMMENDER SYSTEM MANUALLY
3.15.1. Calculation of an Item-to-Item Similarity Matrix
3.15.2. Prediction of the Ratings for Items That Have Not Yet Been Rated
3.15.3. Making Movie Recommendations
3.16. THE NETFLIX GRAND PRIZE WAS WON BY AN ALGORITHM THAT WAS NEVER IMPLEMENTED
3.17. FURTHER READING
3.18. CHAPTER REVIEW QUESTIONS
CHAPTER 4: How Are Predictive Models Trained and Evaluated?
4.1. HOW DO PREDICTIVE METHODS LEARN?
4.2. UNDERFITTING AND OVERFITTING
4.3. OVERFITTING IN THE CONTEXT OF THE FUKUSHIMA NUCLEAR PLANT DISASTER
4.4. EVALUATING MODEL PERFORMANCE BY CROSS-VALIDATION
4.4.1. Beyond the Test Set: The Validation and Final Testing Sets
4.5. WAYS TO DESCRIBE THE PERFORMANCE OF A CLASSIFIER
4.6. ACCURACY, PRECISION AND RECALL IN PRACTICE
4.6.1. Image-Based Identification of Terrorists
4.6.2. The Precision-Recall Curve
4.6.3. The F1-Score
4.6.4. The Receiver Operating Characteristic (ROC) Curve
4.7. CLASSIFICATION ERRORS OF CONTINUOUS DATA PREDICTIONS
4.8. FURTHER READING
4.9. CHAPTER REVIEW QUESTIONS
CHAPTER 5: Are Our Algorithms Racist, Sexist and Discriminating?
5.1. HOW DO SMART ALGORITHMS MEDDLE IN OUR LIVES?
5.1.1. Outlook
5.2. RELEASE ON PAROLE: NOT IF YOU ARE BLACK
5.3. MEN ARE PROFESSIONALS, WOMEN ARE HOT
5.4. ‘FUCK THE ALGORITHM’
5.5. PREDICTIVE POLICING OR HOW TO REPRODUCE THE BIAS OF THE POLICE WITH AN ALGORITHM
5.6. WHOSE FAULT WAS IT? DIFFERENT TYPES OF BIAS
5.6.1. Bias on the Web
5.7. HOW CAN WE BUILD FAIRER ALGORITHMS?
5.7.1. What Is Actually ‘Fair’?
5.7.2. Is Being Fair Enough?
5.7.3. The Bias Impact Statement
5.7.4. Fair Machine Learning
5.7.5. The Key Is in the Design Team
5.8. THE FAILURE TO CONTROL THE COVID-19 PANDEMIC FROM BIASED DATA
5.9. FURTHER READING
5.10. CHAPTER REVIEW QUESTIONS
CHAPTER 6: Personal Data, Privacy and Cybersecurity
6.1. HOW MUCH IS YOUR DATA WORTH?
6.1.1. The Cost of a Data Breach
6.2. WHY IS PRIVACY IMPORTANT?
6.3. ARE COMPANIES KEEPING OUR DATA PRIVATE? THE ADA HEALTH CASE STUDY
6.3.1. Business Summary
6.3.1.1. Application Details
6.3.1.2. Technical
6.3.1.3. Market and Legal
6.3.1.4. Application Walk-Through
6.3.2. Data Audit
6.3.2.1. Description of Data Collected and Processed
6.3.2.2. Storage of Personal Data
6.3.2.3. Why Is This Data Held?
6.3.2.4. Legal Basis for Holding the Data?
6.3.2.5. How Long Is the User Data Kept For?
6.3.2.6. Who Has Access to the Data?
6.3.2.7. Security Controls in Place?
6.3.3. Privacy by Design
6.3.4. Experimental Privacy Analysis
6.3.5. End-to-End Security
6.3.5.1. Technical Controls
6.3.5.2. Procedural and Administrative Controls
6.3.5.3. Physical Security Controls
6.3.5.4. Governance and Legal/Compliance Controls
6.4. A FEW SIMPLE RULES FOR OWNING YOUR PRIVACY
6.5. FURTHER READING
6.6. CHAPTER REVIEW QUESTIONS
CHAPTER 7: What Are the Limits of Artificial Intelligence?
7.1. MACHINES OUTPERFORM HUMANS BUT ONLY AT VERY SPECIFIC TASKS
7.1.1. The Protein-Folding Problem
7.2. WHY ‘TORONTO’ AND OTHER EXAMPLES OF HOW MACHINES THINK DIFFERENTLY
7.2.1. Adversarial Examples
7.2.2. (Lack of) Common Sense
7.2.3. Catastrophic Forgetting and Continual Learning
7.2.4. Mathematical Reasoning
7.3. HOW CAN WE TELL IF A MACHINE IS BEHAVING INTELLIGENTLY?
7.3.1. The Turing Test
7.3.2. Computers Can Only Solve Problems That Have Clear-Cut Answers
7.4. IS THE TECHNOLOGICAL SINGULARITY THE REAL THREAT?
7.4.1. Consciousness and the Chinese Room Experiment
7.4.2. Being a Maverick Is a Very Human Trait
7.5. THE TROLLEY PROBLEM
7.5.1. Reinforcement Learning, the 4th Learning Paradigm
7.5.1.1. A Simple Example with an Autonomous Vehicle
7.5.1.2. Autonomous Driving in the Real World
7.5.2. Is the Trolley Problem a Real Problem?
7.5.2.1. Disentangling the Statistics Also Helps with Acceptance
7.5.2.2. Unanswered Ethical Questions
7.6. ROBOTS REFLECT OUR OWN HUMANITY
7.7. FURTHER READING
7.8. CHAPTER REVIEW QUESTIONS
Appendix—Answers to Chapter Review Questions
Chapter 1: A Bird's-Eye View and the Art of Asking Questions.
Chapter 2: Descriptive Analytics.
Chapter 3: Predictive Analytics.
Chapter 4: How are Predictive Models Trained and Evaluated?
Chapter 5: Are Our Algorithms Racist, Sexist and Discriminating?
Chapter 6: Personal Data, Privacy and Cybersecurity.
Chapter 7: What Are the Limits of Artificial Intelligence?
Bibliography
Index