Distributed Machine Learning with Python: Accelerating model training and serving with distributed systems

Length: 284 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-04-29
ISBN-10: 1801815690
ISBN-13: 9781801815697
Sales Rank: #171231 (See Top 100 Books)

Build and deploy an efficient data processing pipeline for machine learning model training in an elastic, in-parallel model training or multi-tenant cluster and cloud

Key Features

Accelerate model training and interference with order-of-magnitude time reduction
Learn state-of-the-art parallel schemes for both model training and serving
A detailed study of bottlenecks at distributed model training and serving stages

Book Description

Reducing time cost in machine learning leads to a shorter waiting time for model training and a faster model updating cycle. Distributed machine learning enables machine learning practitioners to shorten model training and inference time by orders of magnitude. With the help of this practical guide, you’ll be able to put your Python development knowledge to work to get up and running with the implementation of distributed machine learning, including multi-node machine learning systems, in no time. You’ll begin by exploring how distributed systems work in the machine learning area and how distributed machine learning is applied to state-of-the-art deep learning models. As you advance, you’ll see how to use distributed systems to enhance machine learning model training and serving speed. You’ll also get to grips with applying data parallel and model parallel approaches before optimizing the in-parallel model training and serving pipeline in local clusters or cloud environments. By the end of this book, you’ll have gained the knowledge and skills needed to build and deploy an efficient data processing pipeline for machine learning model training and inference in a distributed manner.

What you will learn

Deploy distributed model training and serving pipelines
Get to grips with the advanced features in TensorFlow and PyTorch
Mitigate system bottlenecks during in-parallel model training and serving
Discover the latest techniques on top of classical parallelism paradigm
Explore advanced features in Megatron-LM and Mesh-TensorFlow
Use state-of-the-art hardware such as NVLink, NVSwitch, and GPUs

Who this book is for

This book is for data scientists, machine learning engineers, and ML practitioners in both academia and industry. A fundamental understanding of machine learning concepts and working knowledge of Python programming is assumed. Prior experience implementing ML/DL models with TensorFlow or PyTorch will be beneficial. You’ll find this book useful if you are interested in using distributed systems to boost machine learning model training and serving speed.

Distributed Machine Learning with Python
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1 – Data Parallelism
Chapter 1: Splitting Input Data
    Single-node training is too slow
        The mismatch between data loading bandwidth and model training bandwidth
        Single-node training time on popular datasets
        Accelerating the training process with data parallelism
    Data parallelism – the high-level bits
        Stochastic gradient descent 
        Model synchronization 
    Hyperparameter tuning
        Global batch size
        Learning rate adjustment
        Model synchronization schemes
    Summary
Chapter 2: Parameter Server and All-Reduce
    Technical requirements
    Parameter server architecture
        Communication bottleneck in the parameter server architecture
        Sharding the model among parameter servers
    Implementing the parameter server
        Defining model layers
        Defining the parameter server
        Defining the worker
        Passing data between the parameter server and worker
    Issues with the parameter server 
        The parameter server architecture introduces a high coding complexity for practitioners
    All-Reduce architecture
        Reduce
        All-Reduce 
        Ring All-Reduce
    Collective communication 
        Broadcast
        Gather
        All-Gather
    Summary
Chapter 3: Building a Data Parallel Training and Serving Pipeline
    Technical requirements 
    The data parallel training pipeline in a nutshell
        Input pre-processing 
        Input data partition
        Data loading
        Training
        Model synchronization
        Model update
    Single-machine multi-GPUs and multi-machine multi-GPUs
        Single-machine multi-GPU
        Multi-machine multi-GPU
    Checkpointing and fault tolerance
        Model checkpointing
        Load model checkpoints
    Model evaluation and hyperparameter tuning
    Model serving in data parallelism
    Summary
Chapter 4: Bottlenecks and Solutions
    Communication bottlenecks in data parallel training
        Analyzing the communication workloads
        Parameter server architecture
        The All-Reduce architecture
        The inefficiency of state-of-the-art communication schemes
    Leveraging idle links and host resources
        Tree All-Reduce
        Hybrid data transfer over PCIe and NVLink
    On-device memory bottlenecks
    Recomputation and quantization
        Recomputation
        Quantization
    Summary
Section 2 – Model Parallelism
Chapter 5: Splitting the Model
    Technical requirements
    Single-node training error – out of memory
        Fine-tuning BERT on a single GPU
        Trying to pack a giant model inside one state-of-the-art GPU
    ELMo, BERT, and GPT
        Basic concepts
        RNN
        ELMo
        BERT
        GPT
    Pre-training and fine-tuning
    State-of-the-art hardware
        P100, V100, and DGX-1
        NVLink
        A100 and DGX-2
        NVSwitch
    Summary
Chapter 6: Pipeline Input and Layer Split
    Vanilla model parallelism is inefficient
        Forward propagation
        Backward propagation
        GPU idle time between forward and backward propagation
    Pipeline input
    Pros and cons of pipeline parallelism
        Advantages of pipeline parallelism
        Disadvantages of pipeline parallelism
    Layer split
    Notes on intra-layer model parallelism
    Summary
Chapter 7: Implementing Model Parallel Training and Serving Workflows
    Technical requirements
    Wrapping up the whole model parallelism pipeline
        A model parallel training overview
        Implementing a model parallel training pipeline
        Specifying communication protocol among GPUs
        Model parallel serving
    Fine-tuning transformers
    Hyperparameter tuning in model parallelism
        Balancing the workload among GPUs
        Enabling/disabling pipeline parallelism
    NLP model serving
    Summary
Chapter 8: Achieving Higher Throughput and Lower Latency
    Technical requirements
    Freezing layers
        Freezing layers during forward propagation
        Reducing computation cost during forward propagation
        Freezing layers during backward propagation
    Exploring memory and storage resources
    Understanding model decomposition and distillation
        Model decomposition
        Model distillation
    Reducing bits in hardware
    Summary
Section 3 – Advanced Parallelism Paradigms
Chapter 9: A Hybrid of Data and Model Parallelism
    Technical requirements
    Case study of Megatron-LM
        Layer split for model parallelism
        Row-wise trial-and-error approach
        Column-wise trial-and-error approach
        Cross-machine for data parallelism
    Implementation of Megatron-LM
    Case study of Mesh-TensorFlow
    Implementation of Mesh-TensorFlow
    Pros and cons of Megatron-LM and Mesh-TensorFlow
    Summary
Chapter 10: Federated Learning and Edge Devices
    Technical requirements
    Sharing knowledge without sharing data
        Recapping the traditional data parallel model training paradigm
        No input sharing among workers
        Communicating gradients for collaborative learning
    Case study: TensorFlow Federated
    Running edge devices with TinyML
    Case study: TensorFlow Lite
    Summary
Chapter 11: Elastic Model Training and Serving
    Technical requirements
    Introducing adaptive model training
        Traditional data parallel training 
        Adaptive model training in data parallelism
        Adaptive model training (AllReduce-based)
        Adaptive model training (parameter server-based)
        Traditional model-parallel model training paradigm
        Adaptive model training in model parallelism
    Implementing adaptive model training in the cloud
    Elasticity in model inference
        Serverless
    Summary
Chapter 12: Advanced Techniques for Further Speed-Ups
    Technical requirements
    Debugging and performance analytics
        General concepts in the profiling results
        Communication results analysis
        Computation results analysis
    Job migration and multiplexing
        Job migration
        Job multiplexing
    Model training in a heterogeneous environment
    Summary
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

AI & Machine Learning Data Modeling & Design Data Warehousing Database Storage & Design Natural Language Processing Python

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Distributed Machine Learning with Python: Accelerating model training and serving with distributed systems, sometime you may not get the results, please search the main title.