Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch

Length: 300 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2023-04-18
ISBN-10: 1098106822
ISBN-13: 9781098106829
Sales Rank: #1161120 (See Top 100 Books)

Get up to speed on Apache Spark, the popular engine for large-scale data processing, including machine learning and analytics. If you’re looking to expand your skill set or advance your career in scalable machine learning with MLlib, distributed PyTorch, and distributed TensorFlow, this practical guide is for you. Using Spark as your main data processing platform, you’ll discover several open source technologies designed and built for enriching Spark’s ML capabilities.

Scaling Machine Learning with Spark examines various technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLFlow, TensorFlow, PyTorch, and Petastorm. This book shows you when to use each technology and why. If you’re a data scientist working with machine learning, you’ll learn how to:

Build practical distributed machine learning workflows, including feature engineering and data formats
Extend deep learning functionalities beyond Spark by bridging into distributed TensorFlow and PyTorch
Manage your machine learning experiment lifecycle with MLFlow
Use Petastorm as a storage layer for bridging data from Spark into TensorFlow and PyTorch
Use machine learning terminology to understand distribution strategies

Preface
    Who Should Read This Book?
    Do You Need Distributed Machine Learning?
    Navigating This Book
    What Is Not Covered
    The Environment and Tools
        The Tools
        The Datasets
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. Distributed Machine Learning Terminology and Concepts
    The Stages of the Machine Learning Workflow
    Tools and Technologies in the Machine Learning Pipeline
    Distributed Computing Models
        General-Purpose Models
            MapReduce
            MPI
            Barrier
            Shared memory
        Dedicated Distributed Computing Models
    Introduction to Distributed Systems Architecture
        Centralized Versus Decentralized Systems
        Interaction Models
            Client/server
            Peer-to-peer
            Geo-distributed
        Communication in a Distributed Setting
            Asynchronous
            Synchronous
    Introduction to Ensemble Methods
        High Versus Low Bias
        Types of Ensemble Methods
        Distributed Training Topologies
            Centralized ensemble learning
            Decentralized decision trees
            Centralized, distributed training with parameter servers
            Centralized, distributed training in a P2P topology
    The Challenges of Distributed Machine Learning Systems
        Performance
            Data parallelism versus model parallelism
            Combining data parallelism and model parallelism
            Deep learning
        Resource Management
        Fault Tolerance
        Privacy
        Portability
    Setting Up Your Local Environment
        Chapters 2–6 Tutorials Environment
        Chapters 7–10 Tutorials Environment
    Summary
2. Introduction to Spark and PySpark
    Apache Spark Architecture
    Intro to PySpark
    Apache Spark Basics
        Software Architecture
            Creating a custom schema
            Key Spark data abstractions and APIs
            DataFrames are immutable
        PySpark and Functional Programming
        Executing PySpark Code
    pandas DataFrames Versus Spark DataFrames
    Scikit-Learn Versus MLlib
    Summary
3. Managing the Machine Learning Experiment Lifecycle with MLflow
    Machine Learning Lifecycle Management Requirements
    What Is MLflow?
        Software Components of the MLflow Platform
        Users of the MLflow Platform
    MLflow Components
        MLflow Tracking
            Using MLflow Tracking to record runs
            Logging your dataset path and version
        MLflow Projects
        MLflow Models
        MLflow Model Registry
            Registering models
            Transitioning between model stages
    Using MLflow at Scale
    Summary
4. Data Ingestion, Preprocessing, and Descriptive Statistics
    Data Ingestion with Spark
        Working with Images
            Image format
            Binary format
        Working with Tabular Data
    Preprocessing Data
        Preprocessing Versus Processing
        Why Preprocess the Data?
        Data Structures
        MLlib Data Types
        Preprocessing with MLlib Transformers
            Working with text data
            From nominal categorical features to indices
            Structuring continuous numerical data
            Additional transformers
        Preprocessing Image Data
            Extracting labels
            Transforming labels to indices
            Extracting image size
        Save the Data and Avoid the Small Files Problem
            Avoiding small files
            Image compression and Parquet
    Descriptive Statistics: Getting a Feel for the Data
        Calculating Statistics
        Descriptive Statistics with Spark Summarizer
        Data Skewness
        Correlation
            Pearson correlation
            Spearman correlation
    Summary
5. Feature Engineering
    Features and Their Impact on Models
    MLlib Featurization Tools
        Extractors
        Selectors
        Example: Word2Vec
    The Image Featurization Process
        Understanding Image Manipulation
            Grayscale
            Defining image boundaries using image gradients
        Extracting Features with Spark APIs
            pyspark.sql.functions: pandas_udf and Python type hints
            pyspark.sql.GroupedData: applyInPandas and mapInPandas
    The Text Featurization Process
        Bag-of-Words
        TF-IDF
        N-Gram
        Additional Techniques
    Enriching the Dataset
    Summary
6. Training Models with Spark MLlib
    Algorithms
    Supervised Machine Learning
        Classification
            MLlib classification algorithms
            Implementing multilabel classification support
            What about imbalanced class labels?
        Regression
            Recommendation systems
            ALS for collaborative filtering
    Unsupervised Machine Learning
        Frequent Pattern Mining
        Clustering
    Evaluating
        Supervised Evaluators
        Unsupervised Evaluators
    Hyperparameters and Tuning Experiments
        Building a Parameter Grid
        Splitting the Data into Training and Test Sets
        Cross-Validation: A Better Way to Test Your Models
    Machine Learning Pipelines
        Constructing a Pipeline
        How Does Splitting Work with the Pipeline API?
    Persistence
    Summary
7. Bridging Spark and Deep Learning Frameworks
    The Two Clusters Approach
    Implementing a Dedicated Data Access Layer
        Features of a DAL
        Selecting a DAL
    What Is Petastorm?
        SparkDatasetConverter
        Petastorm as a Parquet Store
    Project Hydrogen
        Barrier Execution Mode
        Accelerator-Aware Scheduling
    A Brief Introduction to the Horovod Estimator API
    Summary
8. TensorFlow Distributed Machine Learning Approach
    A Quick Overview of TensorFlow
        What Is a Neural Network?
        TensorFlow Cluster Process Roles and Responsibilities
    Loading Parquet Data into a TensorFlow Dataset
    An Inside Look at TensorFlow’s Distributed Machine Learning Strategies
        ParameterServerStrategy
        CentralStorageStrategy: One Machine, Multiple Processors
        MirroredStrategy: One Machine, Multiple Processors, Local Copy
        MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
        TPUStrategy
        What Things Change When You Switch Strategies?
    Training APIs
        Keras API
            MobileNetV2 transfer learning case study
            Training the Keras MobileNetV2 algorithm from scratch
        Custom Training Loop
        Estimator API
    Putting It All Together
    Troubleshooting
    Summary
9. PyTorch Distributed Machine Learning Approach
    A Quick Overview of PyTorch Basics
        Computation Graph
        PyTorch Mechanics and Concepts
    PyTorch Distributed Strategies for Training Models
        Introduction to PyTorch’s Distributed Approach
        Distributed Data-Parallel Training
        RPC-Based Distributed Training
            Remote execution
            Remote references
                Using RRefs to orchestrate distributed algorithms
                Identifying objects by reference
            Distributed autograd
            The distributed optimizer
        Communication Topologies in PyTorch (c10d)
            Collective communication in PyTorch
            Peer-to-peer communication in PyTorch
        What Can We Do with PyTorch’s Low-Level APIs?
    Loading Data with PyTorch and Petastorm
    Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
        The Enigma of Mismatched Data Types
        The Mystery of Straggling Workers
    How Does PyTorch Differ from TensorFlow?
    Summary
10. Deployment Patterns for Machine Learning Models
    Deployment Patterns
        Pattern 1: Batch Prediction
        Pattern 2: Model-in-Service
        Pattern 3: Model-as-a-Service
        Determining Which Pattern to Use
        Production Software Requirements
    Monitoring Machine Learning Models in Production
        Data Drift
        Model Drift, Concept Drift
        Distributional Domain Shift (the Long Tail)
        What Metrics Should I Monitor in Production?
        How Do I Measure Changes Using My Monitoring System?
            Define a reference
            Measure the reference against fresh metrics values
            Algorithms to use for measuring
        What It Looks Like in Production
    The Production Feedback Loop
    Deploying with MLlib
        Production Machine Learning Pipelines with Structured Streaming
    Deploying with MLflow
        Defining an MLflow Wrapper
        Deploying the Model as a Microservice
        Loading the Model as a Spark UDF
    How to Develop Your System Iteratively
    Summary
Index

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.

Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch

How to download source code?

Kubernetes Secrets Handbook: Design, implement, and maintain production-grade Kubernetes Secrets management solutions

NLP Application: Natural Language Questions and SQL using Computational Linguistics

Architecting ASP.NET Core Applications, 3rd Edition: An atypical design patterns guide for .NET 8, C# 12, and beyond

Software Architectures, 2nd Edition: Topics Usually Missed in Textbooks

Microservices for Machine Learning: Design, implement, and manage high-performance ML systems with microservices

Modern Data Mining with Python: A risk-managed approach to developing and deploying explainable and efficient algorithms using ModelOps