Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch
- Length: 300 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2023-04-18
- ISBN-10: 1098106822
- ISBN-13: 9781098106829
- Sales Rank: #1161120 (See Top 100 Books)
Get up to speed on Apache Spark, the popular engine for large-scale data processing, including machine learning and analytics. If you’re looking to expand your skill set or advance your career in scalable machine learning with MLlib, distributed PyTorch, and distributed TensorFlow, this practical guide is for you. Using Spark as your main data processing platform, you’ll discover several open source technologies designed and built for enriching Spark’s ML capabilities.
Scaling Machine Learning with Spark examines various technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLFlow, TensorFlow, PyTorch, and Petastorm. This book shows you when to use each technology and why. If you’re a data scientist working with machine learning, you’ll learn how to:
- Build practical distributed machine learning workflows, including feature engineering and data formats
- Extend deep learning functionalities beyond Spark by bridging into distributed TensorFlow and PyTorch
- Manage your machine learning experiment lifecycle with MLFlow
- Use Petastorm as a storage layer for bridging data from Spark into TensorFlow and PyTorch
- Use machine learning terminology to understand distribution strategies
Preface Who Should Read This Book? Do You Need Distributed Machine Learning? Navigating This Book What Is Not Covered The Environment and Tools The Tools The Datasets Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments 1. Distributed Machine Learning Terminology and Concepts The Stages of the Machine Learning Workflow Tools and Technologies in the Machine Learning Pipeline Distributed Computing Models General-Purpose Models MapReduce MPI Barrier Shared memory Dedicated Distributed Computing Models Introduction to Distributed Systems Architecture Centralized Versus Decentralized Systems Interaction Models Client/server Peer-to-peer Geo-distributed Communication in a Distributed Setting Asynchronous Synchronous Introduction to Ensemble Methods High Versus Low Bias Types of Ensemble Methods Distributed Training Topologies Centralized ensemble learning Decentralized decision trees Centralized, distributed training with parameter servers Centralized, distributed training in a P2P topology The Challenges of Distributed Machine Learning Systems Performance Data parallelism versus model parallelism Combining data parallelism and model parallelism Deep learning Resource Management Fault Tolerance Privacy Portability Setting Up Your Local Environment Chapters 2–6 Tutorials Environment Chapters 7–10 Tutorials Environment Summary 2. Introduction to Spark and PySpark Apache Spark Architecture Intro to PySpark Apache Spark Basics Software Architecture Creating a custom schema Key Spark data abstractions and APIs DataFrames are immutable PySpark and Functional Programming Executing PySpark Code pandas DataFrames Versus Spark DataFrames Scikit-Learn Versus MLlib Summary 3. Managing the Machine Learning Experiment Lifecycle with MLflow Machine Learning Lifecycle Management Requirements What Is MLflow? Software Components of the MLflow Platform Users of the MLflow Platform MLflow Components MLflow Tracking Using MLflow Tracking to record runs Logging your dataset path and version MLflow Projects MLflow Models MLflow Model Registry Registering models Transitioning between model stages Using MLflow at Scale Summary 4. Data Ingestion, Preprocessing, and Descriptive Statistics Data Ingestion with Spark Working with Images Image format Binary format Working with Tabular Data Preprocessing Data Preprocessing Versus Processing Why Preprocess the Data? Data Structures MLlib Data Types Preprocessing with MLlib Transformers Working with text data From nominal categorical features to indices Structuring continuous numerical data Additional transformers Preprocessing Image Data Extracting labels Transforming labels to indices Extracting image size Save the Data and Avoid the Small Files Problem Avoiding small files Image compression and Parquet Descriptive Statistics: Getting a Feel for the Data Calculating Statistics Descriptive Statistics with Spark Summarizer Data Skewness Correlation Pearson correlation Spearman correlation Summary 5. Feature Engineering Features and Their Impact on Models MLlib Featurization Tools Extractors Selectors Example: Word2Vec The Image Featurization Process Understanding Image Manipulation Grayscale Defining image boundaries using image gradients Extracting Features with Spark APIs pyspark.sql.functions: pandas_udf and Python type hints pyspark.sql.GroupedData: applyInPandas and mapInPandas The Text Featurization Process Bag-of-Words TF-IDF N-Gram Additional Techniques Enriching the Dataset Summary 6. Training Models with Spark MLlib Algorithms Supervised Machine Learning Classification MLlib classification algorithms Implementing multilabel classification support What about imbalanced class labels? Regression Recommendation systems ALS for collaborative filtering Unsupervised Machine Learning Frequent Pattern Mining Clustering Evaluating Supervised Evaluators Unsupervised Evaluators Hyperparameters and Tuning Experiments Building a Parameter Grid Splitting the Data into Training and Test Sets Cross-Validation: A Better Way to Test Your Models Machine Learning Pipelines Constructing a Pipeline How Does Splitting Work with the Pipeline API? Persistence Summary 7. Bridging Spark and Deep Learning Frameworks The Two Clusters Approach Implementing a Dedicated Data Access Layer Features of a DAL Selecting a DAL What Is Petastorm? SparkDatasetConverter Petastorm as a Parquet Store Project Hydrogen Barrier Execution Mode Accelerator-Aware Scheduling A Brief Introduction to the Horovod Estimator API Summary 8. TensorFlow Distributed Machine Learning Approach A Quick Overview of TensorFlow What Is a Neural Network? TensorFlow Cluster Process Roles and Responsibilities Loading Parquet Data into a TensorFlow Dataset An Inside Look at TensorFlow’s Distributed Machine Learning Strategies ParameterServerStrategy CentralStorageStrategy: One Machine, Multiple Processors MirroredStrategy: One Machine, Multiple Processors, Local Copy MultiWorkerMirroredStrategy: Multiple Machines, Synchronous TPUStrategy What Things Change When You Switch Strategies? Training APIs Keras API MobileNetV2 transfer learning case study Training the Keras MobileNetV2 algorithm from scratch Custom Training Loop Estimator API Putting It All Together Troubleshooting Summary 9. PyTorch Distributed Machine Learning Approach A Quick Overview of PyTorch Basics Computation Graph PyTorch Mechanics and Concepts PyTorch Distributed Strategies for Training Models Introduction to PyTorch’s Distributed Approach Distributed Data-Parallel Training RPC-Based Distributed Training Remote execution Remote references Using RRefs to orchestrate distributed algorithms Identifying objects by reference Distributed autograd The distributed optimizer Communication Topologies in PyTorch (c10d) Collective communication in PyTorch Peer-to-peer communication in PyTorch What Can We Do with PyTorch’s Low-Level APIs? Loading Data with PyTorch and Petastorm Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch The Enigma of Mismatched Data Types The Mystery of Straggling Workers How Does PyTorch Differ from TensorFlow? Summary 10. Deployment Patterns for Machine Learning Models Deployment Patterns Pattern 1: Batch Prediction Pattern 2: Model-in-Service Pattern 3: Model-as-a-Service Determining Which Pattern to Use Production Software Requirements Monitoring Machine Learning Models in Production Data Drift Model Drift, Concept Drift Distributional Domain Shift (the Long Tail) What Metrics Should I Monitor in Production? How Do I Measure Changes Using My Monitoring System? Define a reference Measure the reference against fresh metrics values Algorithms to use for measuring What It Looks Like in Production The Production Feedback Loop Deploying with MLlib Production Machine Learning Pipelines with Structured Streaming Deploying with MLflow Defining an MLflow Wrapper Deploying the Model as a Microservice Loading the Model as a Spark UDF How to Develop Your System Iteratively Summary Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Scaling Machine Learning with Spark: Distributed ML with MLlib, TensorFlow, and PyTorch
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.