Reliable Machine Learning: Applying SRE Principles to ML in Production

by Cathy Chen, D. Sculley, Kranti Parisa, Niall Murphy, Todd Underwood

Length: 408 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2022-11-01
ISBN-10: 1098106229
ISBN-13: 9781098106225
Sales Rank: #373795 (See Top 100 Books)

Whether you’re part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run ML reliably, effectively, and accountably within your organization. You’ll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.

By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you’ll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.

You’ll examine:

What ML is: how it functions and what it relies on
Conceptual frameworks for understanding how ML “loops” work
Effective “productionization,” and how it can be made easily monitorable, deployable, and operable
Why ML systems make production troubleshooting more difficult, and how to get around them
How ML, product, and production teams can communicate effectively

Foreword
Preface
    Why We Wrote This Book
    SRE as the Lens on ML
    Intended Audience
    How This Book Is Organized
        Our Approach
        Let’s Knit!
        Navigating This Book
    About the Authors
    Conventions Used in This Book
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
        Cathy Chen
        Niall Richard Murphy
        Kranti Parisa
        D. Sculley
        Todd Underwood
1. Introduction
    The ML Lifecycle
        Data Collection and Analysis
        ML Training Pipelines
        Build and Validate Applications
        Quality and Performance Evaluation
        Defining and Measuring SLOs
        Launch
            Models as code
            Launch slowly
            Release, not refactor
            Isolate rollouts at the data layer
            Measure SLOs during launch
            Review the rollout
        Monitoring and Feedback Loops
    Lessons from the Loop
2. Data Management Principles
    Data as Liability
    The Data Sensitivity of ML Pipelines
    Phases of Data
        Creation
        Ingestion
        Processing
            Validation
            Cleaning and ensuring data consistency
            Enriching and extending
        Storage
        Management
        Analysis and Visualization
    Data Reliability
        Durability
        Consistency
        Version Control
        Performance
        Availability
    Data Integrity
        Security
        Privacy
        Policy and Compliance
            Jurisdictional rules
            Reporting requirements
    Conclusion
3. Basic Introduction to Models
    What Is a Model?
    A Basic Model Creation Workflow
    Model Architecture Versus Model Definition Versus Trained Model
    Where Are the Vulnerabilities?
        Training Data
            Incomplete coverage
            Spurious correlations
            Cold start
            Self-fulfilling prophecies and ML echo chambers
            Changes in the world
        Labels
            Label noise
            Wrong label objective
            Fraud or malicious feedback
        Training Methods
            Overfitting
            Lack of stability
            Peculiarities of deep learning
    Infrastructure and Pipelines
        Platforms
        Feature Generation
        Upgrades and Fixes
    A Set of Useful Questions to Ask About Any Model
    An Example ML System
        Yarn Product Click-Prediction Model
        Features
        Labels for Features
        Model Updating
        Model Serving
        Common Failures
    Conclusion
4. Feature and Training Data
    Features
        Feature Selection and Engineering
        Lifecycle of a Feature
        Feature Systems
            Data ingestion system
            Feature store
            Feature quality evaluation system
    Labels
    Human-Generated Labels
        Annotation Workforces
        Measuring Human Annotation Quality
        An Annotation Platform
        Active Learning and AI-Assisted Labeling
        Documentation and Training for Labelers
    Metadata
        Metadata Systems Overview
        Dataset Metadata
        Feature Metadata
        Label Metadata
        Pipeline Metadata
    Data Privacy and Fairness
        Privacy
            PII data and features
            Private data and labeling
        Fairness
    Conclusion
5. Evaluating Model Validity and Quality
    Evaluating Model Validity
    Evaluating Model Quality
        Offline Evaluations
        Evaluation Distributions
            Held-out test data
            Progressive validation
            Golden sets
            Stress-test distributions
            Sliced analysis
            Counterfactual testing
        A Few Useful Metrics
            Canary metrics
                Bias
                Calibration
            Classification metrics
                Accuracy
                Precision and recall
                AUC ROC
                Precision/recall curves
            Regression metrics
                Mean squared error and mean absolute error
                Log loss
    Operationalizing Verification and Evaluation
    Conclusion
6. Fairness, Privacy, and Ethical ML Systems
    Fairness (a.k.a. Fighting Bias)
        Definitions of Fairness
        Reaching Fairness
        Fairness as a Process Rather than an Endpoint
        A Quick Legal Note
    Privacy
        Methods to Preserve Privacy
            Technical measures
            Institutional measures
        A Quick Legal Note
    Responsible AI
        Explanation
        Effectiveness
        Social and Cultural Appropriateness
    Responsible AI Along the ML Pipeline
        Use Case Brainstorming
        Data Collection and Cleaning
        Model Creation and Training
        Model Validation and Quality Assessment
        Model Deployment
        Products for the Market
    Conclusion
7. Training Systems
    Requirements
    Basic Training System Implementation
        Features
        Feature Store
        Model Management System
        Orchestration
            Job/process/resource scheduling system
            ML framework
        Quality Evaluation
        Monitoring
    General Reliability Principles
        Most Failures Will Not Be ML Failures
        Models Will Be Retrained
        Models Will Have Multiple Versions (at the Same Time!)
        Good Models Will Become Bad
        Data Will Be Unavailable
        Models Should Be Improvable
        Features Will Be Added and Changed
        Models Can Train Too Fast
        Resource Utilization Matters
        Utilization != Efficiency
        Outages Include Recovery
    Common Training Reliability Problems
        Data Sensitivity
        Example Data Problem at YarnIt
        Reproducibility
        Example Reproducibility Problem at YarnIt
        Compute Resource Capacity
        Example Capacity Problem at YarnIt
    Structural Reliability
        Organizational Challenges
        Ethics and Fairness Considerations
    Conclusion
8. Serving
    Key Questions for Model Serving
        What Will Be the Load to Our Model?
        What Are the Prediction Latency Needs of Our Model?
        Where Does the Model Need to Live?
            On a local machine
            On servers owned or managed by our organization
            In the cloud
            On-device
        What Are the Hardware Needs for Our Model?
        How Will the Serving Model Be Stored, Loaded, Versioned, and Updated?
        What Will Our Feature Pipeline for Serving Look Like?
    Model Serving Architectures
        Offline Serving (Batch Inference)
            Advantages
            Disadvantages
        Online Serving (Online Inference)
            Advantages
            Disadvantages
        Model as a Service
            Advantages
            Disadvantages
        Serving at the Edge
            Advantages
            Disadvantages
        Choosing an Architecture
    Model API Design
    Testing
    Serving for Accuracy or Resilience?
    Scaling
        Autoscaling
        Caching
    Disaster Recovery
    Ethics and Fairness Considerations
    Conclusion
9. Monitoring and Observability for Models
    What Is Production Monitoring and Why Do It?
        What Does It Look Like?
        The Concerns That ML Brings to Monitoring
        Reasons for Continual ML Observability—in Production
    Problems with ML Production Monitoring
        Difficulties of Development Versus Serving
        A Mindset Change Is Required
    Best Practices for ML Model Monitoring
        Generic Pre-serving Model Recommendations
            Explainability and monitoring
        Training and Retraining
            Concrete recommendations
        Model Validation (Before Rollout)
            Fallbacks in validation
            Call to action
            Concrete recommendations
        Serving
            Model
                Case 1: Real-time actuals
                Case 2: Delayed actuals
                Case 3: Biased actuals
                Case 4: No/few actuals
                Other approaches
                Troubleshooting model performance metrics
            Data
                Drift
                Measuring drift
                Troubleshooting drift
            Data quality
                Categorical data
                Numerical data
                Measuring data quality
            Service
                Optimizing performance of the model
                Optimizing performance of the service
            Other Things to Consider
                SLOs in ML monitoring
                Monitoring across services
                Fairness in monitoring
                Privacy in monitoring
                Business impact
                Dense data types (image, video, text documents, audio, and so on)
            High-Level Recommendations for Monitoring Strategy
    Conclusion
10. Continuous ML
    Anatomy of a Continuous ML System
        Training Examples
        Training Labels
        Filtering Out Bad Data
        Feature Stores and Data Management
        Updating the Model
        Pushing Updated Models to Serving
    Observations About Continuous ML Systems
        External World Events May Influence Our Systems
        Models Can Influence Their Own Training Data
        Temporal Effects Can Arise at Several Timescales
        Emergency Response Must Be Done in Real Time
            Stop training
            Fall back
            Roll back
            Remove bad data
            Roll through
            Choosing a response strategy
            Organizational considerations
        New Launches Require Staged Ramp-ups and Stable Baselines
        Models Must Be Managed Rather Than Shipped
    Continuous Organizations
    Rethinking Noncontinuous ML Systems
    Conclusion
11. Incident Response
    Incident Management Basics
        Life of an Incident
        Incident Response Roles
    Anatomy of an ML-Centric Outage
    Terminology Reminder: Model
    Story Time
        Story 1: Searching but Not Finding
            Stages of ML incident response for story 1
        Story 2: Suddenly Useless Partners
            Stages of ML incident response for story 2
        Story 3: Recommend You Find New Suppliers
            Stages of ML incident response for story 3
    ML Incident Management Principles
        Guiding Principles
        Model Developer or Data Scientist
            Preparation
            Incident handling
            Continuous improvement
        Software Engineer
            Preparation
            Incident handling
            Continuous improvement
        ML SRE or Production Engineer
            Preparation
            Incident handling
            Continuous improvement
        Product Manager or Business Leader
            Preparation
            Incident handling
            Continuous improvement
    Special Topics
        Production Engineers and ML Engineering Versus Modeling
        The Ethical On-Call Engineer Manifesto
            Impact
            Cause
            Troubleshooting
            Solutions and a call to action
    Conclusion
12. How Product and ML Interact
    Different Types of Products
    Agile ML?
    ML Product Development Phases
        Discovery and Definition
        Business Goal Setting
        MVP Construction and Validation
        Model and Product Development
        Deployment
        Support and Maintenance
    Build Versus Buy
        Models
            Generic use cases
            Company’s data initiatives
        Data Processing Infrastructure
        End-to-End Platforms
        Scoring Approach for Making the Decision
        Making the Decision
    Sample YarnIt Store Features Powered by ML
        Showcasing Popular Yarns by Total Sales
        Recommendations Based on Browsing History
        Cross-selling and Upselling
        Content-Based Filtering
        Collaborative Filtering
    Conclusion
13. Integrating ML into Your Organization
    Chapter Assumptions
        Leader-Based Viewpoint
        Detail Matters
        ML Needs to Know About the Business
        The Most Important Assumption You Make
        The Value of ML
    Significant Organizational Risks
        ML Is Not Magic
        Mental (Way of Thinking) Model Inertia
        Surfacing Risk Correctly in Different Cultures
        Siloed Teams Don’t Solve All Problems
    Implementation Models
        Remembering the Goal
        Greenfield Versus Brownfield
        ML Roles and Responsibilities
        How to Hire ML Folks
    Organizational Design and Incentives
        Strategy
        Structure
        Processes
        Rewards
        People
        A Note on Sequencing
    Conclusion
14. Practical ML Org Implementation Examples
    Scenario 1: A New Centralized ML Team
        Background and Organizational Description
        Process
        Rewards
        People
        Default Implementation
    Scenario 2: Decentralized ML Infrastructure and Expertise
        Background and Organizational Description
        Process
        Rewards
        People
        Default Implementation
    Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling
        Background and Organizational Description
        Process
        Rewards
        People
        Default Implementation
    Conclusion
15. Case Studies: MLOps in Practice
    1. Accommodating Privacy and Data Retention Policies in ML Pipelines
        Background
        Problem and Resolution
            Challenge 1: Which dialects?
            Solution: Get rid of the concept of dialects!
            Challenge 2: Racing the clock
            Solutions (and new challenges!)
        Takeaways
    2. Continuous ML Model Impacting Traffic
        Background
        Problem and Resolution
        Takeaways
    3. Steel Inspection
        Background
        Problem and Resolution
        Takeaways
    4. NLP MLOps: Profiling and Staging Load Test
        Background
        Problem and Resolution
            An improved process for benchmarking
        Takeaways
    5. Ad Click Prediction: Databases Versus Reality
        Background
        Problem and Resolution
        Takeaways
    6. Testing and Measuring Dependencies in ML Workflow
        Background
        Problem and Resolution
            Building the regression-testing sandbox
            Monitoring for regression
        Takeaways
Index

AI & Machine Learning Artificial Intelligence Computer Vision & Pattern Recognition Data Processing Intelligence & Semantics Natural Language Processing Neural Networks

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Reliable Machine Learning: Applying SRE Principles to ML in Production, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.

Reliable Machine Learning: Applying SRE Principles to ML in Production

How to download source code?

Ai-assisted Programming: Better Planning, Coding, Testing, and Deployment

Python for Scientific Computing and Artificial Intelligence

Metaheuristics for Machine Learning: Algorithms and Applications

Devin, world's first AI software engineer: Future of Software Development with AI

Math and Architectures of Deep Learning

AI for Absolute Beginners: A Clear Guide to Tomorrow