Reliable Machine Learning: Applying SRE Principles to ML in Production
- Length: 408 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2022-11-01
- ISBN-10: 1098106229
- ISBN-13: 9781098106225
- Sales Rank: #373795 (See Top 100 Books)
Whether you’re part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run ML reliably, effectively, and accountably within your organization. You’ll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.
By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you’ll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.
You’ll examine:
- What ML is: how it functions and what it relies on
- Conceptual frameworks for understanding how ML “loops” work
- Effective “productionization,” and how it can be made easily monitorable, deployable, and operable
- Why ML systems make production troubleshooting more difficult, and how to get around them
- How ML, product, and production teams can communicate effectively
Foreword Preface Why We Wrote This Book SRE as the Lens on ML Intended Audience How This Book Is Organized Our Approach Let’s Knit! Navigating This Book About the Authors Conventions Used in This Book O’Reilly Online Learning How to Contact Us Acknowledgments Cathy Chen Niall Richard Murphy Kranti Parisa D. Sculley Todd Underwood 1. Introduction The ML Lifecycle Data Collection and Analysis ML Training Pipelines Build and Validate Applications Quality and Performance Evaluation Defining and Measuring SLOs Launch Models as code Launch slowly Release, not refactor Isolate rollouts at the data layer Measure SLOs during launch Review the rollout Monitoring and Feedback Loops Lessons from the Loop 2. Data Management Principles Data as Liability The Data Sensitivity of ML Pipelines Phases of Data Creation Ingestion Processing Validation Cleaning and ensuring data consistency Enriching and extending Storage Management Analysis and Visualization Data Reliability Durability Consistency Version Control Performance Availability Data Integrity Security Privacy Policy and Compliance Jurisdictional rules Reporting requirements Conclusion 3. Basic Introduction to Models What Is a Model? A Basic Model Creation Workflow Model Architecture Versus Model Definition Versus Trained Model Where Are the Vulnerabilities? Training Data Incomplete coverage Spurious correlations Cold start Self-fulfilling prophecies and ML echo chambers Changes in the world Labels Label noise Wrong label objective Fraud or malicious feedback Training Methods Overfitting Lack of stability Peculiarities of deep learning Infrastructure and Pipelines Platforms Feature Generation Upgrades and Fixes A Set of Useful Questions to Ask About Any Model An Example ML System Yarn Product Click-Prediction Model Features Labels for Features Model Updating Model Serving Common Failures Conclusion 4. Feature and Training Data Features Feature Selection and Engineering Lifecycle of a Feature Feature Systems Data ingestion system Feature store Feature quality evaluation system Labels Human-Generated Labels Annotation Workforces Measuring Human Annotation Quality An Annotation Platform Active Learning and AI-Assisted Labeling Documentation and Training for Labelers Metadata Metadata Systems Overview Dataset Metadata Feature Metadata Label Metadata Pipeline Metadata Data Privacy and Fairness Privacy PII data and features Private data and labeling Fairness Conclusion 5. Evaluating Model Validity and Quality Evaluating Model Validity Evaluating Model Quality Offline Evaluations Evaluation Distributions Held-out test data Progressive validation Golden sets Stress-test distributions Sliced analysis Counterfactual testing A Few Useful Metrics Canary metrics Bias Calibration Classification metrics Accuracy Precision and recall AUC ROC Precision/recall curves Regression metrics Mean squared error and mean absolute error Log loss Operationalizing Verification and Evaluation Conclusion 6. Fairness, Privacy, and Ethical ML Systems Fairness (a.k.a. Fighting Bias) Definitions of Fairness Reaching Fairness Fairness as a Process Rather than an Endpoint A Quick Legal Note Privacy Methods to Preserve Privacy Technical measures Institutional measures A Quick Legal Note Responsible AI Explanation Effectiveness Social and Cultural Appropriateness Responsible AI Along the ML Pipeline Use Case Brainstorming Data Collection and Cleaning Model Creation and Training Model Validation and Quality Assessment Model Deployment Products for the Market Conclusion 7. Training Systems Requirements Basic Training System Implementation Features Feature Store Model Management System Orchestration Job/process/resource scheduling system ML framework Quality Evaluation Monitoring General Reliability Principles Most Failures Will Not Be ML Failures Models Will Be Retrained Models Will Have Multiple Versions (at the Same Time!) Good Models Will Become Bad Data Will Be Unavailable Models Should Be Improvable Features Will Be Added and Changed Models Can Train Too Fast Resource Utilization Matters Utilization != Efficiency Outages Include Recovery Common Training Reliability Problems Data Sensitivity Example Data Problem at YarnIt Reproducibility Example Reproducibility Problem at YarnIt Compute Resource Capacity Example Capacity Problem at YarnIt Structural Reliability Organizational Challenges Ethics and Fairness Considerations Conclusion 8. Serving Key Questions for Model Serving What Will Be the Load to Our Model? What Are the Prediction Latency Needs of Our Model? Where Does the Model Need to Live? On a local machine On servers owned or managed by our organization In the cloud On-device What Are the Hardware Needs for Our Model? How Will the Serving Model Be Stored, Loaded, Versioned, and Updated? What Will Our Feature Pipeline for Serving Look Like? Model Serving Architectures Offline Serving (Batch Inference) Advantages Disadvantages Online Serving (Online Inference) Advantages Disadvantages Model as a Service Advantages Disadvantages Serving at the Edge Advantages Disadvantages Choosing an Architecture Model API Design Testing Serving for Accuracy or Resilience? Scaling Autoscaling Caching Disaster Recovery Ethics and Fairness Considerations Conclusion 9. Monitoring and Observability for Models What Is Production Monitoring and Why Do It? What Does It Look Like? The Concerns That ML Brings to Monitoring Reasons for Continual ML Observability—in Production Problems with ML Production Monitoring Difficulties of Development Versus Serving A Mindset Change Is Required Best Practices for ML Model Monitoring Generic Pre-serving Model Recommendations Explainability and monitoring Training and Retraining Concrete recommendations Model Validation (Before Rollout) Fallbacks in validation Call to action Concrete recommendations Serving Model Case 1: Real-time actuals Case 2: Delayed actuals Case 3: Biased actuals Case 4: No/few actuals Other approaches Troubleshooting model performance metrics Data Drift Measuring drift Troubleshooting drift Data quality Categorical data Numerical data Measuring data quality Service Optimizing performance of the model Optimizing performance of the service Other Things to Consider SLOs in ML monitoring Monitoring across services Fairness in monitoring Privacy in monitoring Business impact Dense data types (image, video, text documents, audio, and so on) High-Level Recommendations for Monitoring Strategy Conclusion 10. Continuous ML Anatomy of a Continuous ML System Training Examples Training Labels Filtering Out Bad Data Feature Stores and Data Management Updating the Model Pushing Updated Models to Serving Observations About Continuous ML Systems External World Events May Influence Our Systems Models Can Influence Their Own Training Data Temporal Effects Can Arise at Several Timescales Emergency Response Must Be Done in Real Time Stop training Fall back Roll back Remove bad data Roll through Choosing a response strategy Organizational considerations New Launches Require Staged Ramp-ups and Stable Baselines Models Must Be Managed Rather Than Shipped Continuous Organizations Rethinking Noncontinuous ML Systems Conclusion 11. Incident Response Incident Management Basics Life of an Incident Incident Response Roles Anatomy of an ML-Centric Outage Terminology Reminder: Model Story Time Story 1: Searching but Not Finding Stages of ML incident response for story 1 Story 2: Suddenly Useless Partners Stages of ML incident response for story 2 Story 3: Recommend You Find New Suppliers Stages of ML incident response for story 3 ML Incident Management Principles Guiding Principles Model Developer or Data Scientist Preparation Incident handling Continuous improvement Software Engineer Preparation Incident handling Continuous improvement ML SRE or Production Engineer Preparation Incident handling Continuous improvement Product Manager or Business Leader Preparation Incident handling Continuous improvement Special Topics Production Engineers and ML Engineering Versus Modeling The Ethical On-Call Engineer Manifesto Impact Cause Troubleshooting Solutions and a call to action Conclusion 12. How Product and ML Interact Different Types of Products Agile ML? ML Product Development Phases Discovery and Definition Business Goal Setting MVP Construction and Validation Model and Product Development Deployment Support and Maintenance Build Versus Buy Models Generic use cases Company’s data initiatives Data Processing Infrastructure End-to-End Platforms Scoring Approach for Making the Decision Making the Decision Sample YarnIt Store Features Powered by ML Showcasing Popular Yarns by Total Sales Recommendations Based on Browsing History Cross-selling and Upselling Content-Based Filtering Collaborative Filtering Conclusion 13. Integrating ML into Your Organization Chapter Assumptions Leader-Based Viewpoint Detail Matters ML Needs to Know About the Business The Most Important Assumption You Make The Value of ML Significant Organizational Risks ML Is Not Magic Mental (Way of Thinking) Model Inertia Surfacing Risk Correctly in Different Cultures Siloed Teams Don’t Solve All Problems Implementation Models Remembering the Goal Greenfield Versus Brownfield ML Roles and Responsibilities How to Hire ML Folks Organizational Design and Incentives Strategy Structure Processes Rewards People A Note on Sequencing Conclusion 14. Practical ML Org Implementation Examples Scenario 1: A New Centralized ML Team Background and Organizational Description Process Rewards People Default Implementation Scenario 2: Decentralized ML Infrastructure and Expertise Background and Organizational Description Process Rewards People Default Implementation Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling Background and Organizational Description Process Rewards People Default Implementation Conclusion 15. Case Studies: MLOps in Practice 1. Accommodating Privacy and Data Retention Policies in ML Pipelines Background Problem and Resolution Challenge 1: Which dialects? Solution: Get rid of the concept of dialects! Challenge 2: Racing the clock Solutions (and new challenges!) Takeaways 2. Continuous ML Model Impacting Traffic Background Problem and Resolution Takeaways 3. Steel Inspection Background Problem and Resolution Takeaways 4. NLP MLOps: Profiling and Staging Load Test Background Problem and Resolution An improved process for benchmarking Takeaways 5. Ad Click Prediction: Databases Versus Reality Background Problem and Resolution Takeaways 6. Testing and Measuring Dependencies in ML Workflow Background Problem and Resolution Building the regression-testing sandbox Monitoring for regression Takeaways Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Reliable Machine Learning: Applying SRE Principles to ML in Production
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.