Machine Learning – A Journey to Deep Learning: With Exercises and Answers

by Andreas Miroslaus Wichert, Luis Sa-Couto

Length: 624 pages
Edition: 1
Language: English
Publisher: World Scientific Pub Co Inc
Publication Date: 2021-02-08
ISBN-10: 9811234051
ISBN-13: 9789811234057
Sales Rank: #7420053 (See Top 100 Books)

This unique compendium discusses some core ideas for the development and implementation of machine learning from three different perspectives — the statistical perspective, the artificial neural network perspective and the deep learning methodology.The useful reference text represents a solid foundation in machine learning and should prepare readers to apply and understand machine learning algorithms as well as to invent new machine learning methods. It tells a story outgoing from a perceptron to deep learning highlighted with concrete examples, including exercises and answers for the students.

Cover Page
Title Page
Copyright Page
Dedication
Preface
Contents
1. Introduction
    1.1 What is Machine Learning
        1.1.1 Symbolical Learning
        1.1.2 Statistical Machine Learning
        1.1.3 Supervised and Unsupervised Machine Learning
    1.2 It all began with the Perceptron
        1.2.1 Artificial Neuron
        1.2.2 Perceptron
        1.2.3 XOR-Problem
    1.3 Road to Deep Learning
        1.3.1 Backpropagation
    1.4 Synopsis
        1.4.1 Content
    1.5 Exercises and Answers
2. Probability and Information
    2.1 Probability Theory
        2.1.1 Conditional probability
        2.1.2 Law of Total Probability
        2.1.3 Bayes’s rule
        2.1.4 Expectation
        2.1.5 Covariance
    2.2 Distribution
        2.2.1 Gaussian Distribution
        2.2.2 Laplace Distribution
        2.2.3 Bernoulli Distribution
    2.3 Information Theory
        2.3.1 Surprise and Information
        2.3.2 Entropy
        2.3.3 Conditional Entropy
        2.3.4 Relative Entropy
        2.3.5 Mutual Information
        2.3.6 Relationship
    2.4 Cross Entropy
    2.5 Exercises and Answers
3. Linear Algebra and Optimization
    3.1 Vectors
        3.1.1 Norm
        3.1.2 Distance function
        3.1.3 Scalar Product
        3.1.4 Linear Independent Vectors
        3.1.5 Matrix Operations
        3.1.6 Tensor Product
        3.1.7 Hadamard product
        3.1.8 Element-wise division
    3.2 Matrix Calculus
        3.2.1 Gradient
        3.2.2 Jacobian
        3.2.3 Hessian Matrix
    3.3 Gradient based Numerical Optimization
        3.3.1 Gradient descent
        3.3.2 Newton’s Method
        3.3.3 Second and First Order Optimization
    3.4 Dilemmas in Machine Learning
        3.4.1 The Curse of Dimensionality
        3.4.2 Numerical Computation
    3.5 Exercises and Answers
4. Linear and Nonlinear Regression
    4.1   Linear Regression
        4.1.1  Regression of a Line
        4.1.2 Multiple Linear Regression
        4.1.3 Design Matrix
        4.1.4 Squared-Error
        4.1.5 Closed-Form Solution
        4.1.6 Example
        4.1.7 Moore-Penrose Matrix
    4.2 Linear Basis Function Models
        4.2.1 Example Logarithmic Curve
        4.2.2 Example Polynomial Regression
    4.3 Model selection
    4.4 Bayesian Regression
        4.4.1 Maximizing the Likelihood or the Posterior . . .
        4.4.2 Bayesian Learning
        4.4.3 Maximizing a posteriori
        4.4.4 Relation between Regularized Least-Squares and MAP
        4.4.5 LASSO Regularizer
    4.5 Linear Regression for classification
    4.6 Exercises and Answers
5. Perceptron
    5.1 Linear Regression and Linear Artificial Neuron
        5.1.1 Regularization
        5.1.2 Stochastic gradient descent
    5.2 Continuous Differentiable Activation Functions
        5.2.1 Sigmoid Activation Functions
        5.2.2 Perceptron with sgn0
        5.2.3 Cross Entropy Loss Function
        5.2.4 Linear Unit versus Sigmoid Unit
        5.2.5 Logistic Regression
    5.3 Multiclass Linear Discriminant
        5.3.1 Cross Entropy Loss Function for softmax
        5.3.2 Logistic Regression Algorithm
    5.4 Multilayer Perceptron
    5.5 Exercises and Answers
6. Multilayer Perceptron
    6.1   Motivations
    6.2 Networks with Hidden Nonlinear Layers
        6.2.1 Backpropagation
        6.2.2 Example
        6.2.3 Activation Function
    6.3 Cross Entropy Error Function
        6.3.1 Backpropagation
        6.3.2 Comparison
        6.3.3 Computing Power
        6.3.4 Generalization
    6.4 Training
        6.4.1 Overfitting
        6.4.2 Early-Stopping Rule
        6.4.3 Regularization
    6.5 Deep Learning and Backpropagation
    6.6 Exercises and Answers
7. Learning Theory
    7.1 Supervised Classification Problem
    7.2 Probability of a bad sample
    7.3 Infinite hypotheses set
    7.4 The VC Dimension
    7.5 A Fundamental Trade-off
    7.6 Computing VC Dimension
        7.6.1 The VC Dimension of a Perceptron
        7.6.2 A Heuristic way to measure hypotheses space complexity
    7.7 The Regression Problem
        7.7.1  Example
    7.8 Exercises and Answers
8. Model Selection
    8.1 The confusion matrix
        8.1.1 Precision and Recall
        8.1.2 Several Classes
    8.2 Validation Set and Test Set
    8.3 Cross-Validation
    8.4 Minimum-Description-Length
        8.4.1  Occam’s razor
        8.4.2 Kolmogorov complexity theory
        8.4.3 Learning as Data Compression
        8.4.4 Two-part code MDL principle
    8.5 Paradox of Deep Learning Complexity
    8.6 Exercises and Answers
9. Clustering
    9.1 Introduction
    9.2 K-means Clustering
        9.2.1 Standard K-means
        9.2.2 Sequential K-means
    9.3 Mixture of Gaussians
        9.3.1 EM for Gaussian Mixtures
        9.3.2 Algorithm: EM for Gaussian mixtures
        9.3.3 Example
    9.4 EM and K-means Clustering
    9.5 Exercises and Answers
10. Radial Basis Networks
    10.1 Cover’s theorem
        10.1.1  Cover’s theorem on the separability (1965)
    10.2 Interpolation Problem
        10.2.1 Micchelli’s Theorem
    10.3 Radial Basis Function Networks
        10.3.1 Modifications of Radial Basis Function Networks
        10.3.2 Interpretation of Hidden Units
    10.4 Exercises and Answers
11. Support Vector Machines
    11.1 Margin
    11.2 Optimal Hyperplane for Linear Separable Patterns
    11.3 Support Vectors
    11.4 Quadratic Optimization for Finding the Optimal Hyperplane
        11.4.1 Dual Problem
    11.5 Optimal Hyperplane for Non-separable Patterns
        11.5.1  Dual Problem
    11.6 Support Vector Machine as a Kernel Machine
        11.6.1 Kernel Trick
        11.6.2 Dual Problem
        11.6.3 Classification
    11.7 Constructing Kernels
        11.7.1 Gaussian Kernel
        11.7.2 Sigmoidal Kernel
        11.7.3 Generative mode Kernels
    11.8 Conclusion
        11.8.1  SVMs, MLPs and RBFNs
    11.9 Exercises and Answers
12.  Deep Learning
    12.1 Introduction
        12.1.1 Loss Function
        12.1.2 Mini-Batch
    12.2 Why Deep Networks?
        12.2.1 Hierarchical Organization
        12.2.2 Boolean Functions
        12.2.3 Curse of dimensionality
        12.2.4 Local Minima
        12.2.5 Can represent big training sets
        12.2.6 Efficient Model Selection
        12.2.7 Criticism of Deep Neural Networks
    12.3 Vanishing Gradients Problem
        12.3.1 Rectified Linear Unit (ReLU)
        12.3.2 Residual Learning
        12.3.3 Batch Normalization
    12.4 Regularization by Dropout
    12.5 Weight Initialization
    12.6 Faster Optimizers
        12.6.1 Momentum
        12.6.2 Nestrov Momentum
        12.6.3 AdaGrad
        12.6.4 RMSProp
        12.6.5 Adam
        12.6.6 Notation
    12.7 Transfer Learning
    12.8 Conclusion
    12.9 Exercises and Answers
13. Convolutional Networks
    13.1 Hierarchical Networks
        13.1.1 Biological Vision
        13.1.2 Neocognitron
        13.1.3 Map transformation cascade
    13.2 Convolutional Neural Networks
        13.2.1 CNNs and Kernels in Image Processing
        13.2.2 Data Augmentation
        13.2.3 Case Studies
    13.3 Exercises and Answers
14. Recurrent Networks
    14.1 Sequence Modelling
    14.2 Recurrent Neural Networks
        14.2.1 Elman recurrent neural networks
        14.2.2 Jordan recurrent neural networks
        14.2.3 Single Output
        14.2.4 Backpropagation Trough Time
        14.2.5 Deep Recurrent Networks
    14.3 Long Short Term Memory
    14.4 Process Sequences
    14.5 Exercises and Answers
15. Autoencoders
    15.1 Eigenvectors and Eigenvalues
    15.2 The Karhunen-Loève transform
        15.2.1  Principal component analysis
    15.3 Singular Value Decomposition
        15.3.1 Example
        15.3.2 Pseudoinverse
        15.3.3 SVD and PCA
    15.4 Autoencoders
    15.5 Undercomplete Autoencoders
    15.6 Overcomplete Autoencoders
        15.6.1  Denoising Autoencoders
    15.7 Exercises and Answers
16. Epilogue
Bibliography
Index