Introduction to Machine Learning, 4th edition

Length: 712 pages
Edition: 4
Language: English
Publisher: The MIT Press
Publication Date: 2020-03-24
ISBN-10: 0262043793
ISBN-13: 9780262043793
Sales Rank: #226533 (See Top 100 Books)

A substantially revised fourth edition of a comprehensive textbook, including new coverage of recent advances in deep learning and neural networks.

The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Machine learning underlies such exciting new technologies as self-driving cars, speech recognition, and translation applications. This substantially revised fourth edition of a comprehensive, widely used machine learning textbook offers new coverage of recent advances in the field in both theory and practice, including developments in deep learning and neural networks.

The book covers a broad array of topics not usually included in introductory machine learning texts, including supervised learning, Bayesian decision theory, parametric methods, semiparametric methods, nonparametric methods, multivariate analysis, hidden Markov models, reinforcement learning, kernel machines, graphical models, Bayesian estimation, and statistical testing. The fourth edition offers a new chapter on deep learning that discusses training, regularizing, and structuring deep neural networks such as convolutional and generative adversarial networks; new material in the chapter on reinforcement learning that covers the use of deep networks, the policy gradient methods, and deep reinforcement learning; new material in the chapter on multilayer perceptrons on autoencoders and the word2vec network; and discussion of a popular method of dimensionality reduction, t-SNE. New appendixes offer background material on linear algebra and optimization. End-of-chapter exercises help readers to apply concepts learned. Introduction to Machine Learning can be used in courses for advanced undergraduate and graduate students and as a reference for professionals.

Cover
Copyright
Contents
Preface
Notations
1 Introduction
    1.1 What Is Machine Learning?
    1.2 Examples of Machine Learning Applications
        1.2.1 Association Rules
        1.2.2 Classification
        1.2.3 Regression
        1.2.4 Unsupervised Learning
        1.2.5 Reinforcement Learning
    1.3 History
    1.4 Related Topics
        1.4.1 High-Performance Computing
        1.4.2 Data Privacy and Security
        1.4.3 Model Interpretability and Trust
        1.4.4 Data Science
    1.5 Exercises
    1.6 References
2 Supervised Learning
    2.1 Learning a Class from Examples
    2.2 Vapnik-Chervonenkis Dimension
    2.3 Probably Approximately Correct Learning
    2.4 Noise
    2.5 Learning Multiple Classes
    2.6 Regression
    2.7 Model Selection and Generalization
    2.8 Dimensions of a Supervised Machine Learning Algorithm
    2.9 Notes
    2.10 Exercises
    2.11 References
3 Bayesian Decision Theory
    3.1 Introduction
    3.2 Classification
    3.3 Losses and Risks
    3.4 Discriminant Functions
    3.5 Association Rules
    3.6 Notes
    3.7 Exercises
    3.8 References
4 Parametric Methods
    4.1 Introduction
    4.2 Maximum Likelihood Estimation
        4.2.1 Bernoulli Density
        4.2.2 Multinomial Density
        4.2.3 Gaussian (Normal) Density
    4.3 Evaluating an Estimator: Bias and Variance
    4.4 The Bayes’ Estimator
    4.5 Parametric Classification
    4.6 Regression
    4.7 Tuning Model Complexity: Bias/Variance Dilemma
    4.8 Model Selection Procedures
    4.9 Notes
    4.10 Exercises
    4.11 References
5 Multivariate Methods
    5.1 Multivariate Data
    5.2 Parameter Estimation
    5.3 Estimation of Missing Values
    5.4 Multivariate Normal Distribution
    5.5 Multivariate Classification
    5.6 Tuning Complexity
    5.7 Discrete Features
    5.8 Multivariate Regression
    5.9 Notes
    5.10 Exercises
    5.11 References
6 Dimensionality Reduction
    6.1 Introduction
    6.2 Subset Selection
    6.3 Principal Component Analysis
    6.4 Feature Embedding
    6.5 Factor Analysis
    6.6 Singular Value Decomposition and Matrix Factorization
    6.7 Multidimensional Scaling
    6.8 Linear Discriminant Analysis
    6.9 Canonical Correlation Analysis
    6.10 Isomap
    6.11 Locally Linear Embedding
    6.12 Laplacian Eigenmaps
    6.13 t-Distributed Stochastic Neighbor Embedding
    6.14 Notes
    6.15 Exercises
    6.16 References
7 Clustering
    7.1 Introduction
    7.2 Mixture Densities
    7.3 k-Means Clustering
    7.4 Expectation-Maximization Algorithm
    7.5 Mixtures of Latent Variable Models
    7.6 Supervised Learning after Clustering
    7.7 Spectral Clustering
    7.8 Hierarchical Clustering
    7.9 Choosing the Number of Clusters
    7.10 Notes
    7.11 Exercises
    7.12 References
8 Nonparametric Methods
    8.1 Introduction
    8.2 Nonparametric Density Estimation
        8.2.1 Histogram Estimator
        8.2.2 Kernel Estimator
        8.2.3 k-Nearest Neighbor Estimator
    8.3 Generalization to Multivariate Data
    8.4 Nonparametric Classification
    8.5 Condensed Nearest Neighbor
    8.6 Distance-Based Classification
    8.7 Outlier Detection
    8.8 Nonparametric Regression: Smoothing Models
        8.8.1 Running Mean Smoother
        8.8.2 Kernel Smoother
        8.8.3 Running Line Smoother
    8.9 How to Choose the Smoothing Parameter
    8.10 Notes
    8.11 Exercises
    8.12 References
9 Decision Trees
    9.1 Introduction
    9.2 Univariate Trees
        9.2.1 Classification Trees
        9.2.2 Regression Trees
    9.3 Pruning
    9.4 Rule Extraction from Trees
    9.5 Learning Rules from Data
    9.6 Multivariate Trees
    9.7 Notes
    9.8 Exercises
    9.9 References
10 Linear Discrimination
    10.1 Introduction
    10.2 Generalizing the Linear Model
    10.3 Geometry of the Linear Discriminant
        10.3.1 Two Classes
        10.3.2 Multiple Classes
    10.4 Pairwise Separation
    10.5 Parametric Discrimination Revisited
    10.6 Gradient Descent
    10.7 Logistic Discrimination
        10.7.1 Two Classes
        10.7.2 Multiple Classes
        10.7.3 Multiple Labels
    10.8 Learning to Rank
    10.9 Notes
    10.10 Exercises
    10.11 References
11 Multilayer Perceptrons
    11.1 Introduction
        11.1.1 Understanding the Brain
        11.1.2 Neural Networks as a Paradigm for Parallel Processing
    11.2 The Perceptron
    11.3 Training a Perceptron
    11.4 Learning Boolean Functions
    11.5 Multilayer Perceptrons
    11.6 MLP as a Universal Approximator
    11.7 Backpropagation Algorithm
        11.7.1 Nonlinear Regression
        11.7.2 Two-Class Discrimination
        11.7.3 Multiclass Discrimination
        11.7.4 Multilabel Discrimination
    11.8 Overtraining
    11.9 Learning Hidden Representations
    11.10 Autoencoders
    11.11 Word2vec Architecture
    11.12 Notes
    11.13 Exercises
    11.14 References
12 Deep Learning
    12.1 Introduction
    12.2 How to Train Multiple Hidden Layers
        12.2.1 Rectified Linear Unit
        12.2.2 Initialization
        12.2.3 Generalizing Backpropagation to Multiple Hidden Layers
    12.3 Improving Training Convergence
        12.3.1 Momentum
        12.3.2 Adaptive Learning Factor
        12.3.3 Batch Normalization
    12.4 Regularization
        12.4.1 Hints
        12.4.2 Weight Decay
        12.4.3 Dropout
    12.5 Convolutional Layers
        12.5.1 The Idea
        12.5.2 Formalization
        12.5.3 Examples: LeNet-5 and AlexNet
        12.5.4 Extensions
        12.5.5 Multimodal Deep Networks
    12.6 Tuning the Network Structure
        12.6.1 Structure and Hyperparameter Search
        12.6.2 Skip Connections
        12.6.3 Gating Units
    12.7 Learning Sequences
        12.7.1 Example Tasks
        12.7.2 Time-Delay Neural Networks
        12.7.3 Recurrent Networks
        12.7.4 Long Short-Term Memory Unit
        12.7.5 Gated Recurrent Unit
    12.8 Generative Adversarial Network
    12.9 Notes
    12.10 Exercises
    12.11 References
13 Local Models
    13.1 Introduction
    13.2 Competitive Learning
        13.2.1 Online k-Means
        13.2.2 Adaptive Resonance Theory
        13.2.3 Self-Organizing Maps
    13.3 Radial Basis Functions
    13.4 Incorporating Rule-Based Knowledge
    13.5 Normalized Basis Functions
    13.6 Competitive Basis Functions
    13.7 Learning Vector Quantization
    13.8 The Mixture of Experts
        13.8.1 Cooperative Experts
        13.8.2 Competitive Experts
    13.9 Hierarchical Mixture of Experts and Soft Decision Trees
    13.10 Notes
    13.11 Exercises
    13.12 References
14 Kernel Machines
    14.1 Introduction
    14.2 Optimal Separating Hyperplane
    14.3 The Nonseparable Case: Soft Margin Hyperplane
    14.4 ν-SVM
    14.5 Kernel Trick
    14.6 Vectorial Kernels
    14.7 Defining Kernels
    14.8 Multiple Kernel Learning
    14.9 Multiclass Kernel Machines
    14.10 Kernel Machines for Regression
    14.11 Kernel Machines for Ranking
    14.12 One-Class Kernel Machines
    14.13 Large Margin Nearest Neighbor Classifier
    14.14 Kernel Dimensionality Reduction
    14.15 Notes
    14.16 Exercises
    14.17 References
15 Graphical Models
    15.1 Introduction
    15.2 Canonical Cases for Conditional Independence
    15.3 Generative Models
    15.4 d-Separation
    15.5 Belief Propagation
        15.5.1 Chains
        15.5.2 Trees
        15.5.3 Polytrees
        15.5.4 Junction Trees
    15.6 Undirected Graphs: Markov Random Fields
    15.7 Learning the Structure of a Graphical Model
    15.8 Influence Diagrams
    15.9 Notes
    15.10 Exercises
    15.11 References
16 Hidden Markov Models
    16.1 Introduction
    16.2 Discrete Markov Processes
    16.3 Hidden Markov Models
    16.4 Three Basic Problems of HMMs
    16.5 Evaluation Problem
    16.6 Finding the State Sequence
    16.7 Learning Model Parameters
    16.8 Continuous Observations
    16.9 The HMM as a Graphical Model
    16.10 Model Selection in HMMs
    16.11 Notes
    16.12 Exercises
    16.13 References
17 Bayesian Estimation
    17.1 Introduction
    17.2 Bayesian Estimation of the Parameters of a Discrete Distribution
        17.2.1 K > 2 States: Dirichlet Distribution
        17.2.2 K = 2 States: Beta Distribution
    17.3 Bayesian Estimation of the Parameters of a Gaussian Distribution
        17.3.1 Univariate Case: Unknown Mean, Known Variance
        17.3.2 Univariate Case: Unknown Mean, Unknown Variance
        17.3.3 Multivariate Case: Unknown Mean, Unknown Covariance
    17.4 Bayesian Estimation of the Parameters of a Function
        17.4.1 Regression
        17.4.2 Regression with Prior on Noise Precision
        17.4.3 The Use of Basis/Kernel Functions
        17.4.4 Bayesian Classification
    17.5 Choosing a Prior
    17.6 Bayesian Model Comparison
    17.7 Bayesian Estimation of a Mixture Model
    17.8 Nonparametric Bayesian Modeling
    17.9 Gaussian Processes
    17.10 Dirichlet Processes and Chinese Restaurants
    17.11 Latent Dirichlet Allocation
    17.12 Beta Processes and Indian Buffets
    17.13 Notes
    17.14 Exercises
    17.15 References
18 Combining Multiple Learners
    18.1 Rationale
    18.2 Generating Diverse Learners
    18.3 Model Combination Schemes
    18.4 Voting
    18.5 Error-Correcting Output Codes
    18.6 Bagging
    18.7 Boosting
    18.8 The Mixture of Experts Revisited
    18.9 Stacked Generalization
    18.10 Fine-Tuning an Ensemble
        18.10.1 Choosing a Subset of the Ensemble
        18.10.2 Constructing Metalearners
    18.11 Cascading
    18.12 Notes
    18.13 Exercises
    18.14 References
19 Reinforcement Learning
    19.1 Introduction
    19.2 Single State Case: K-Armed Bandit
    19.3 Elements of Reinforcement Learning
    19.4 Model-Based Learning
        19.4.1 Value Iteration
        19.4.2 Policy Iteration
    19.5 Temporal Difference Learning
        19.5.1 Exploration Strategies
        19.5.2 Deterministic Rewards and Actions
        19.5.3 Nondeterministic Rewards and Actions
        19.5.4 Eligibility Traces
    19.6 Generalization
    19.7 Partially Observable States
        19.7.1 The Setting
        19.7.2 Example: The Tiger Problem
    19.8 Deep Q Learning
    19.9 Policy Gradients
    19.10 Learning to Play Backgammon and Go
    19.11 Notes
    19.12 Exercises
    19.13 References
20 Design and Analysis of Machine Learning Experiments
    20.1 Introduction
    20.2 Factors, Response, and Strategy of Experimentation
    20.3 Response Surface Design
    20.4 Randomization, Replication, and Blocking
    20.5 Guidelines for Machine Learning Experiments
    20.6 Cross-Validation and Resampling Methods
        20.6.1 K-Fold Cross-Validation
        20.6.2 5 × 2 Cross-Validation
        20.6.3 Bootstrapping
    20.7 Measuring Classifier Performance
    20.8 Interval Estimation
    20.9 Hypothesis Testing
    20.10 Assessing a Classification Algorithm’s Performance
        20.10.1 Binomial Test
        20.10.2 Approximate Normal Test
        20.10.3 t Test
    20.11 Comparing Two Classification Algorithms
        20.11.1 McNemar’s Test
        20.11.2 K-Fold Cross-Validated Paired t Test
        20.11.3 5 × 2 cv Paired t Test
        20.11.4 5 × 2 cv Paired F Test
    20.12 Comparing Multiple Algorithms: Analysis of Variance
    20.13 Comparison over Multiple Datasets
        20.13.1 Comparing Two Algorithms
        20.13.2 Multiple Algorithms
    20.14 Multivariate Tests
        20.14.1 Comparing Two Algorithms
        20.14.2 Comparing Multiple Algorithms
    20.15 Notes
    20.16 Exercises
    20.17 References
A Probability
    A.1 Elements of Probability
        A.1.1 Axioms of Probability
        A.1.2 Conditional Probability
    A.2 Random Variables
        A.2.1 Probability Distribution and Density Functions
        A.2.2 Joint Distribution and Density Functions
        A.2.3 Conditional Distributions
        A.2.4 Bayes’ Rule
        A.2.5 Expectation
        A.2.6 Variance
        A.2.7 Weak Law of Large Numbers
    A.3 Special Random Variables
        A.3.1 Bernoulli Distribution
        A.3.2 Binomial Distribution
        A.3.3 Multinomial Distribution
        A.3.4 Uniform Distribution
        A.3.5 Normal (Gaussian) Distribution
        A.3.6 Chi-Square Distribution
        A.3.7 t Distribution
        A.3.8 F Distribution
    A.4 References
B Linear Algebra
    B.1 Vectors
    B.2 Matrices
    B.3 Similarity of Vectors
    B.4 Square Matrices
    B.5 Linear Dependence and Ranks
    B.6 Inverses
    B.7 Positive Definite Matrices
    B.8 Trace and Determinant
    B.9 Eigenvalues and Eigenvectors
    B.10 Spectral Decomposition
    B.11 Singular Value Decomposition
    B.12 References
C Optimization
    C.1 Introduction
    C.2 Linear Optimization
    C.3 Convex Optimization
    C.4 Duality
    C.5 Local Optimization
    C.6 References
Index