Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations
- Length: 560 pages
- Edition: 1
- Language: English
- Publisher: Addison-Wesley Professional
- Publication Date: 2022-10-08
- ISBN-10: 0137424604
- ISBN-13: 9780137424603
- Sales Rank: #673294 (See Top 100 Books)
Improve Your Service Scalability and Reliability with SRE
“The techniques and principles of SRE are not only clearly defined here, but also the rationale behind them is explained in a way that will stick. This is not some dry definition, this is practical, usable understanding. . . . I can whole-heartedly recommend this book without any reservation. This is a very good book on an important topic that helps to move the game forward for our discipline!”
–From the Foreword by David Farley, Founder and CEO of Continuous Delivery Ltd.
Pioneered by Google to create more scalable and reliable large-scale systems, Site Reliability Engineering (SRE) has become one of today’s most valuable software innovation opportunities. Establishing SRE Foundations is a concise, practical guide that shows how to drive successful SRE adoption in your own organization. Dr. Vladyslav Ukis presents a step-by-step approach to establishing the right cultural, organizational, and technical process foundations, quickly achieving a “minimum viable SRE” and continually improving from there.
Dr. Ukis draws extensively on his own experiences leading an SRE transformation journey at a major healthcare company. Throughout, he answers specific questions that organizations ask about SRE, identifies pitfalls, and shows how to avoid or overcome them. Whatever your role in software development, engineering, or operations, this guide will help you apply SRE to improve what matters most: user and customer experience.
- Understand how SRE works, its role in software operations, and the challenges of SRE transformation
- Assess your organization’s current operations and readiness for SRE transformation
- Achieve organizational buy-in and initiate foundational activities, including SLO definitions, alerting, on-call rotations, incident response, and error budget-based decision-making
- Align organizational structures to support a full SRE transformation
- Measure the progress and success of your SRE initiative
- Sustain and advance your SRE transformation beyond the foundations
Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.
Cover Page About This eBook Halftitle Page Title Page Copyright Page Pearson’s Commitment to Diversity, Equity, and Inclusion Dedication Page Contents Foreword Preface Acknowledgments About the Author Part I: Foundations Chapter 1. Introduction to SRE 1.1 Why SRE? 1.2 Alignment Using SRE 1.3 Why Does SRE Work? 1.4 Summary Chapter 2. The Challenge 2.1 Misalignment 2.2 Collective Ownership 2.3 Ownership Using SRE 2.4 The Challenge Statement 2.5 Coaching 2.6 Summary Chapter 3. SRE Basic Concepts 3.1 Service Level Indicators 3.2 Service Level Objectives 3.3 Error Budgets 3.4 Error Budget Policies 3.5 SRE Concept Pyramid 3.6 Alignment Using the SRE Concept Pyramid 3.7 Summary Chapter 4. Assessing the Status Quo 4.1 Where Is the Organization? 4.2 Where Are the People? 4.3 Where Is the Tech? 4.4 Where Is the Culture? 4.5 Where Is the Process? 4.6 SRE Maturity Model 4.7 Posing Hypotheses 4.8 Summary Part II: Running the Transformation Chapter 5. Achieving Organizational Buy-In 5.1 Getting People Behind SRE 5.2 SRE Marketing Funnel 5.3 SRE Coaches 5.4 Top-Down Buy-In 5.5 Bottom-Up Buy-In 5.6 Lateral Buy-In 5.7 Buy-In Staggering 5.8 Team Coaching 5.9 Traversing the Organization 5.10 Organizational Coaching 5.11 Summary Chapter 6. Laying Down the Foundations 6.1 Introductory Talks by Team 6.2 Conveying the Basics 6.3 SLI Standardization 6.4 Enabling Logging 6.5 Teaching the Log Query Language 6.6 Defining Initial SLOs 6.7 Default SLOs 6.8 Providing Basic Infrastructure 6.9 Engaging Champions 6.10 Dealing with Detractors 6.11 Creating Documentation 6.12 Broadcast Success 6.13 Summary Chapter 7. Reacting to Alerts on SLO Breaches 7.1 Environment Selection 7.2 Responsibilities 7.3 Ways of Working 7.4 Setting Up On-Call Rotations 7.5 On-Call Management Tools 7.6 Out-of-Hours On-Call 7.7 Systematic Knowledge Sharing 7.8 Broadcast Success 7.9 Summary Chapter 8. Implementing Alert Dispatching 8.1 Alert Escalation 8.2 Defining an Alert Escalation Policy 8.3 Defining Stakeholder Groups 8.4 Triggering Stakeholder Notifications 8.5 Defining Stakeholder Rings 8.6 Defining Effective Stakeholder Notifications 8.7 Getting the Stakeholders Subscribed 8.8 Broadcast Success 8.9 Summary Chapter 9. Implementing Incident Response 9.1 Incident Response Foundations 9.2 Incident Priorities 9.3 Complex Incident Coordination 9.4 Incident Postmortems 9.5 Effective Postmortem Criteria 9.6 Mashing Up the Tools 9.7 Service Status Broadcast 9.8 Documenting the Incident Response Process 9.9 Broadcast Success 9.10 Summary Chapter 10. Setting Up an Error Budget Policy 10.1 Motivation 10.2 Terminology 10.3 Error Budget Policy Structure 10.4 Error Budget Policy Conditions 10.5 Error Budget Policy Consequences 10.6 Error Budget Policy Governance 10.7 Extending the Error Budget Policy 10.8 Agreeing to the Error Budget Policy 10.9 Storing the Error Budget Policy 10.10 Enacting the Error Budget Policy 10.11 Reviewing the Error Budget Policy 10.12 Related Concepts 10.13 Summary Chapter 11. Enabling Error Budget–Based Decision–Making 11.1 Reliability Decision-Making Taxonomy 11.2 Implementing SRE Indicators 11.3 Process Indicators, Not People KPIs 11.4 Decisions Versus Indicators 11.5 Decision-Making Workflows 11.6 Summary Chapter 12. Implementing Organizational Structure 12.1 SRE Principles Versus Organizational Structure 12.2 Who Builds It, Who Runs It? 12.3 You Build It, You Run It 12.4 You Build It, You and SRE Run It 12.5 You Build It, SRE Run It 12.6 Cost Optimization 12.7 Team Topologies 12.8 Choosing a Model 12.9 A New Role: SRE 12.10 SRE Career Path 12.11 Communicating the Chosen Model 12.12 Introducing the Chosen Model 12.13 Summary Part III: Measuring and Sustaining the Transformation Chapter 13. Measuring the SRE Transformation 13.1 Testing Transformation Hypotheses 13.2 Outages Not Detected Internally 13.3 Services Exhausting Error Budgets Prematurely 13.4 Executives’ Perceptions 13.5 Reliability Perception by Users and Partners 13.6 Summary Chapter 14. Sustaining the SRE Movement 14.1 Maturing the SRE CoP 14.2 SRE Minutes 14.3 Availability Newsletter 14.4 SRE Column in the Engineering Blog 14.5 Promote Long-Form SRE Wiki Articles 14.6 SRE Broadcasting 14.7 Combining SRE and CD Indicators 14.8 SRE Feedback Loops 14.9 New Hypotheses 14.10 Providing Learning Opportunities 14.11 Supporting SRE Coaches 14.12 Summary Chapter 15. The Road Ahead 15.1 Service Catalog 15.2 SLAs 15.3 Regulatory Compliance 15.4 SRE Infrastructure 15.5 Game Days Appendix: Topics for Quick Reference SRE Wiki Content Runbook Template Content Incident Response Process Content Postmortem Lifecycle Operations Teams’ Responsibilities SRE Online Communities SRE Newsletters SRE Conferences SRE Indicators Decision-Making Workflows Index
Donate to keep this site alive
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.