Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide

Length: 1187 pages
Edition: 1
Language: English
Publisher: Leanpub
Publication Date: 2021-05-18

If you’re looking for a book where you can learn about Deep Learning and PyTorch without having to spend hours deciphering cryptic text and code, and that’s easy and enjoyable to read, this is it 🙂

The book covers from the basics of gradient descent all the way up to fine-tuning large NLP models (BERT and GPT-2) using HuggingFace. It is divided into four parts:

Part I: Fundamentals (gradient descent, training linear and logistic regressions in PyTorch)
Part II: Computer Vision (deeper models and activation functions, convolutions, transfer learning, initialization schemes)
Part III: Sequences (RNN, GRU, LSTM, seq2seq models, attention, self-attention, transformers)
Part IV: Natural Language Processing (tokenization, embeddings, contextual word embeddings, ELMo, BERT, GPT-2)

This is not a typical book: most tutorials start with some nice and pretty image classification problem to illustrate how to use PyTorch. It may seem cool, but I believe it distracts you from the main goal: how PyTorch works? In this book, I present a structured, incremental, and from first principles approach to learn PyTorch (and get to the pretty image classification problem in due time).

Moreover, this is not a formal book in any way: I am writing this book as if I were having a conversation with you, the reader. I will ask you questions (and give you answers shortly afterward) and I will also make (silly) jokes.

My job here is to make you understand the topic, so I will avoid fancy mathematical notation as much as possible and spell it out in plain English.

In this book, I will guide you through the development of many models in PyTorch, showing you why PyTorch makes it much easier and more intuitive to build models in Python: autograd, dynamic computation graph, model classes and much, much more.

We will build, step-by-step, not only the models themselves but also your understanding as I show you both the reasoning behind the code and how to avoid some common pitfalls and errors along the way.

I wrote this book for beginners in general – not only PyTorch beginners. Every now and then I will spend some time explaining some fundamental concepts which I believe are key to have a proper understanding of what’s going on in the code.

Maybe you already know well some of those concepts: if this is the case, you can simply skip them, since I’ve made those explanations as independent as possible from the rest of the content.

Preface
Acknowledgements
About the Author
Frequently Asked Questions (FAQ)
    Why PyTorch?
    Why this book?
    Who should read this book?
    What do I need to know?
    How to read this book?
    What’s Next?
Setup Guide
    Official Repository
    Environment
        Google Colab
        Binder
        Local Installation
            1. Anaconda
            2. Conda (Virtual) Environments
            3. PyTorch
                Using GPU / CUDA
                Using CPU
            4. TensorBoard
            5. GraphViz and Torchviz (optional)
            6. Git
            7. Jupyter
    Moving On
Part I: Fundamentals
    Chapter 0: Visualizing Gradient Descent
        Spoilers
        Jupyter Notebook
            Imports
        Visualizing Gradient Descent
        Model
        Data Generation
            Synthetic Data Generation
            Train-Validation-Test Split
        Step 0 - Random Initialization
        Step 1 - Compute Model’s Predictions
        Step 2 - Compute the Loss
            Loss Surface
            Cross-Sections
        Step 3 - Compute the Gradients
            Visualizing Gradients
            Backpropagation
        Step 4 - Update the Parameters
            Learning Rate
                Small Learning Rate
                Big Learning Rate
                Very Big Learning Rate
                "Bad" Feature
                Scaling / Standardizing / Normalizing
        Step 5 - Rinse and Repeat!
            The Path of Gradient Descent
        Recap
    Chapter 1: A Simple Regression Problem
        Spoilers
        Jupyter Notebook
            Imports
        A Simple Regression Problem
        Data Generation
            Synthetic Data Generation
        Gradient Descent
            Step 0 - Random Initialization
            Step 1 - Compute Model’s Predictions
            Step 2 - Compute the Loss
            Step 3 - Compute the Gradients
            Step 4 - Update the Parameters
            Step 5 - Rinse and Repeat!
        Linear Regression in Numpy
        PyTorch
            Tensor
            Loading Data, Devices and CUDA
            Creating Parameters
        Autograd
            backward
            grad
            zero_
            Updating Parameters
            no_grad
        Dynamic Computation Graph
        Optimizer
            step / zero_grad
        Loss
        Model
            Parameters
            state_dict
            Device
            Forward Pass
            train
            Nested Models
            Sequential Models
            Layers
        Putting It All Together
            Data Preparation
            Model Configuration
            Model Training
        Recap
    Chapter 2: Rethinking the Training Loop
        Spoilers
        Jupyter Notebook
            Imports
        Rethinking the Training Loop
            Training Step
        Dataset
            TensorDataset
        DataLoader
            Mini-Batch Inner Loop
            Random Split
        Evaluation
            Plotting Losses
        TensorBoard
            Running It Inside a Notebook
            Running It Separately (Local Installation)
            Running It Separately (Binder)
            SummaryWriter
            add_graph
            add_scalars
        Saving and Loading Models
            Model State
            Saving
            Resuming Training
            Deploying / Making Predictions
            Setting the Model’s Mode
        Putting It All Together
        Recap
    Chapter 2.1: Going Classy
        Spoilers
        Jupyter Notebook
            Imports
        Going Classy
            The Class
            The Constructor
                Arguments
                Placeholders
                Variables
                Functions
            Training Methods
            Saving and Loading Methods
            Visualization Methods
            The Full Code
        Classy Pipeline
            Model Training
            Making Predictions
            Checkpointing
            Resuming Training
        Putting It All Together
        Recap
    Chapter 3: A Simple Classification Problem
        Spoilers
        Jupyter Notebook
            Imports
        A Simple Classification Problem
        Data Generation
        Data Preparation
        Model
            Logits
            Probabilities
            Odds Ratio
            Log Odds Ratio
            From Logits to Probabilities
            Sigmoid
            Logistic Regression
        Loss
            BCELoss
            BCEWithLogitsLoss
            Imbalanced Dataset
        Model Configuration
        Model Training
        Decision Boundary
        Classification Threshold
            Confusion Matrix
            Metrics
                True and False Positive Rates
                Precision and Recall
                Accuracy
            Trade-offs and Curves
                Low Threshold
                High Threshold
                ROC and PR Curves
                The Precision Quirk
                Best and Worst Curves
                Comparing Models
                Further Reading
        Putting It All Together
        Recap
Part II: Computer Vision
    Chapter 4: Classifying Images
        Spoilers
        Jupyter Notebook
            Imports
        Classifying Images
            Data Generation
            NCHW vs NHWC
        Torchvision
            Datasets
            Models
            Transforms
            Transforms on Images
            Transforms on Tensor
                Normalize Transform
            Composing Transforms
        Data Preparation
            Dataset Transforms
            SubsetRandomSampler
            Data Augmentation Transforms
            WeightedRandomSampler
            Seeds and more (seeds)
            Putting It Together
            Pixels as Features
        Shallow Model
            Notation
            Model Configuration
            Model Training
        Deep-ish Model
            Model Configuration
            Model Training
            Show Me the Math!
            Show Me the Code!
            Weights as Pixels
        Activation Functions
            Sigmoid
            Hyperbolic Tangent (TanH)
            Rectified Linear Unit (ReLU)
            Leaky ReLU
            Parametric ReLU (PReLU)
        Deep Model
            Model Configuration
            Model Training
            Show Me the Math Again!
        Putting It All Together
        Recap
    Bonus Chapter: Feature Space
        Two-Dimensional Feature Space
        Transformations
        A Two-Dimensional Model
        Decision Boundary, Activation Style!
        More Functions, More Boundaries
        More Layers, More Boundaries
        More Dimensions, More Boundaries
        Recap
    Chapter 5: Convolutions
        Spoilers
        Jupyter Notebook
            Imports
        Convolutions
            Filter / Kernel
            Convolving
            Moving Around
            Shape
            Convolving in PyTorch
            Striding
            Padding
            A REAL Filter
        Pooling
        Flattening
        Dimensions
        Typical Architecture
            LeNet-5
        A Multiclass Classification Problem
            Data Generation
            Data Preparation
            Loss
                Logits
                Softmax
                LogSoftmax
                Negative Log-Likelihood Loss
                Cross-Entropy Loss
            Classification Losses Showdown!
            Model Configuration
            Model Training
        Visualizing Filters and More!
            Visualizing Filters
            Hooks
            Visualizing Feature Maps
            Visualizing Classifier Layers
            Accuracy
            Loader Apply
        Putting It All Together
        Recap
    Chapter 6: Rock, Paper, Scissors
        Spoilers
        Jupyter Notebook
            Imports
        Rock, Paper, Scissors…
            Rock Paper Scissors Dataset
        Data Preparation
            ImageFolder
            Standardization
            The Real Datasets
        Three-Channel Convolutions
        Fancier Model
        Dropout
            Two-Dimensional Dropout
        Model Configuration
            Optimizer
            Learning Rate
        Model Training
            Accuracy
            Regularizing Effect
            Visualizing Filters
        Learning Rates
            Finding LR
            Adaptive Learning Rate
                Moving Average (MA)
                EWMA
                EWMA Meets Gradients
                Adam
                Visualizing Adapted Gradients
            Stochastic Gradient Descent (SGD)
                Momentum
                Nesterov
                Flavors of SGD
            Learning Rate Schedulers
                Epoch Schedulers
                Validation Loss Scheduler
                Schedulers in StepByStep - Part I
                Mini-Batch Schedulers
                Schedulers in StepByStep - Part II
                Scheduler Paths
            Adaptive vs Cycling
        Putting It All Together
        Recap
    Chapter 7: Transfer Learning
        Spoilers
        Jupyter Notebook
            Imports
        Transfer Learning
        ImageNet
        ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
            ILSVRC-2012
                AlexNet (SuperVision Team)
            ILSVRC-2014
                VGG
                Inception (GoogLeNet Team)
            ILSVRC-2015
                ResNet (MSRA Team)
        Comparing Architectures
        Transfer Learning in Practice
            Pre-Trained Model
                Adaptive Pooling
                Loading Weights
                Model Freezing
                Top of the Model
            Model Configuration
            Data Preparation
            Model Training
            Generating a Dataset of Features
            Top Model
        Auxiliary Classifiers (Side-Heads)
        1x1 Convolutions
        Inception Modules
        Batch Normalization
            Running Statistics
            Evaluation Phase
            Momentum
            BatchNorm2d
            Other Normalizations
            Small Summary
        Residual Connections
            Learning the Identity
            The Power of Shortcuts
            Residual Blocks
        Putting It All Together
            Fine-Tuning
            Feature Extraction
        Recap
    Extra Chapter: Vanishing and Exploding Gradients
        Spoilers
        Jupyter Notebook
            Imports
        Vanishing and Exploding Gradients
            Vanishing Gradients
            Ball Dataset and Block Model
            Weights, Activations, and Gradients
            Initialization Schemes
            Batch Normalization
            Exploding Gradients
            Data Generation & Preparation
            Model Configuration & Training
            Gradient Clipping
                Value Clipping
                Norm Clipping (or Gradient Scaling)
            Model Configuration & Training
            Clipping with Hooks
        Recap
Part III: Sequences
    Chapter 8: Sequences
        Spoilers
        Jupyter Notebook
            Imports
        Sequences
            Data Generation
        Recurrent Neural Networks (RNNs)
            RNN Cell
            RNN Layer
            Shapes
            Stacked RNN
            Bidirectional RNN
            Square Model
                Data Generation
                Data Preparation
                Model Configuration
                Model Training
            Visualizing the Model
                Transformed Inputs
                Hidden States
                The Journey of a Hidden State
            Can We Do Better?
        Gated Recurrent Units (GRUs)
            GRU Cell
            GRU Layer
            Square Model II - The Quickening
            Model Configuration & Training
            Visualizing the Model
                Hidden States
                The Journey of a Gated Hidden State
            Can We Do Better?
        Long Short-Term Memory (LSTM)
            LSTM Cell
            LSTM Layer
            Square Model III - The Sorcerer
            Model Configuration & Training
            Visualizing the Hidden States
        Variable-Length Sequences
            Padding
            Packing
            Unpacking (to padded)
            Packing (from padded)
            Variable-Length Dataset
            Data Preparation
                Collate Function
            Square Model IV - Packed
            Model Configuration & Training
        1D Convolutions
            Shapes
            Multiple Features or Channels
            Dilation
            Data Preparation
            Model Configuration & Training
            Visualizing the Model
        Putting It All Together
            Fixed-Length Dataset
            Variable-Length Dataset
            There Can Be Only ONE… Model
            Model Configuration & Training
        Recap
    Chapter 9 - Part I: Sequence-To-Sequence
        Spoilers
        Jupyter Notebook
            Imports
        Sequence-To-Sequence
            Data Generation
        Encoder-Decoder Architecture
            Encoder
            Decoder
                Teacher Forcing
            Encoder + Decoder
            Data Preparation
            Model Configuration & Training
            Visualizing Predictions
            Can We Do Better?
        Attention
            "Values"
            "Keys" and "Queries"
            Computing the Context Vector
            Scoring Method
            Attention Scores
            Scaled Dot Product
            Attention Mechanism
            Source Mask
            Decoder
            Encoder + Decoder + Attention
            Model Configuration & Training
            Visualizing Predictions
            Visualizing Attention
            Multi-Headed Attention
    Chapter 9 - Part II: Sequence-To-Sequence
        Spoilers
        Self-Attention
            Encoder
            Cross-Attention
            Decoder
                Subsequent Inputs and Teacher Forcing
                Attention Scores
                Target Mask (Training)
                Target Mask (Evaluation/Prediction)
            Encoder + Decoder + Self-Attention
            Model Configuration & Training
            Visualizing Predictions
            Sequential No More
        Positional Encoding (PE)
            Encoder + Decoder + PE
            Model Configuration & Training
            Visualizing Predictions
            Visualizing Attention
        Putting It All Together
            Data Preparation
            Model Assembly
            Encoder + Decoder + Positional Encoding
            Self-Attention "Layers"
            Attention Heads
            Model Configuration & Training
        Recap
    Chapter 10: Transform and Roll Out
        Spoilers
        Jupyter Notebook
            Imports
        Transform and Roll Out
        Narrow Attention
            Chunking
            Multi-Headed Attention
        Stacking Encoders and Decoders
        Wrapping "Sub-Layers"
        Transformer Encoder
        Transformer Decoder
        Layer Normalization
            Batch vs Layer
            Our Seq2Seq Problem
            Projections or Embeddings
        The Transformer
            Data Preparation
            Model Configuration & Training
            Visualizing Predictions
        The PyTorch Transformer
            Model Configuration & Training
            Visualizing Predictions
        Vision Transformer
            Data Generation & Preparation
            Patches
                Rearranging
                Embeddings
            Special Classifier Token
            The Model
            Model Configuration & Training
        Putting It All Together
            Data Preparation
            Model Assembly
                1. Encoder-Decoder
                2. Encoder
                3. Decoder
                4. Positional Encoding
                5. Encoder "Layer"
                6. Decoder "Layer"
                7. "Sub-Layer" Wrapper
                8. Multi-Headed Attention
            Model Configuration & Training
        Recap
Part IV: Natural Language Processing
    Chapter 11: Down the Yellow Brick Rabbit Hole
        Spoilers
        Jupyter Notebook
            Additional Setup
            Imports
        "Down the Yellow Brick Rabbit Hole"
        Building a Dataset
            Sentence Tokenization
            HuggingFace’s Dataset
            Loading a Dataset
                Attributes
                Methods
        Word Tokenization
            Vocabulary
            HuggingFace’s Tokenizer
        Before Word Embeddings
            One-Hot Encoding (OHE)
            Bag-of-Words (BoW)
            Language Models
            N-grams
            Continuous Bag-of-Words (CBoW)
        Word Embeddings
            Word2Vec
            What Is an Embedding Anyway?
            Pretrained Word2Vec
            Global Vectors (GloVe)
            Using Word Embeddings
                Vocabulary Coverage
                Tokenizer
                Special Tokens' Embeddings
            Model I - GloVE + Classifier
                Data Preparation
                Pretrained PyTorch Embeddings
                Model Configuration & Training
            Model II - GloVe + Transformer
                Visualizing Attention
        Contextual Word Embeddings
            ELMo
            BERT
            Document Embeddings
            Model III - Preprocessed Embeddings
                Data Preparation
                Model Configuration & Training
        BERT
            Tokenization
            Input Embeddings
            Pretraining Tasks
                Masked Language Model (MLM)
                Next Sentence Prediction (NSP)
            Outputs
            Model IV - Classifying using BERT
                Data Preparation
                Model Configuration & Training
        Fine-Tuning with HuggingFace
            Sequence Classification (or Regression)
            Tokenized Dataset
            Trainer
            Predictions
            Pipelines
            More Pipelines
        GPT-2
        Putting It All Together
            Data Preparation
                "Packed" Dataset
            Model Configuration & Training
            Generating Text
        Recap
        Thank You!