97 Things Every SRE Should Know: Collective Wisdom from the Experts
- Length: 252 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2020-12-15
- ISBN-10: 1492081493
- ISBN-13: 9781492081494
- Sales Rank: #205533 (See Top 100 Books)
Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You’ll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You’ll grow and refine your SRE skills through sound advice and thought-provokingquestions that drive the direction of the field.
Some of the 97 things you should know:
- “Test Your Disaster Plan”–Tanya Reilly
- “Integrating Empathy into SRE Tools”–Daniella Niyonkuru
- “The Best Advice I Can Give to Teams”–Nicole Forsgren
- “Where to SRE”–Fatema Boxwala
- “Facing That First Page”–Andrew Louis
- “I Have an Error Budget, Now What?”–Alex Hidalgo
- “Get Your Work Recognized: Write a Brag Document”–Julia Evans and Karla Burnett
Preface How We Structured the Book O’Reilly Online Learning How to Contact Us Acknowledgments I. New to SRE 1. Site Reliability Engineering in Six Words Alex Hidalgo 2. Do We Know Why We Really Want Reliability? Niall Murphy 3. Building Self-Regulating Processes Denise Yu 4. Four Engineers of an SRE Seder Jacob Scott 5. The Reliability Stack Alex Hidalgo 6. Infrastructure: It’s Where the Power Is Charity Majors 7. Thinking About Resilience Justin Li 8. Observability in the Development Cycle Charity Majors and Liz Fong-Jones 9. There Is No Magic Bouke van der Bijl 10. How Wikipedia Is Served to You Effie Mouzeli 11. Why You Should Understand (a Little) About TCP Julia Evans 12. The Importance of a Management Interface Salim Virji 13. When It Comes to Storage, Think Distributed Salim Virji 14. The Role of Cardinality Charity Majors and Liz Fong-Jones 15. Security Is like an Onion Lucas Fontes 16. Use Your Words Tanya Reilly 17. Where to SRE Fatema Boxwala 18. Dear Future Team Frances Rees 19. Sustainability and Burnout Denise Yu 20. Don’t Take Advice from Graybeards John Looney 21. Facing That First Page Andrew Louis II. Zero to One 22. SRE, at Any Size, Is Cultural Matthew Huxtable 23. Everyone Is an SRE in a Small Organization Matthew Huxtable 24. Auditing Your Environment for Improvements Joan O’Callaghan 25. With Incident Response, Start Small Thai Wood 26. Solo SRE: Effecting Large-Scale Change as a Single Individual Ashley Poole 27. Design Goals for SLO Measurement Ben Sigelman 28. I Have an Error Budget—Now What? Alex Hidalgo 29. How to Change Things Joan O’Callaghan 30. Methodological Debugging Avishai Ish-Shalom and Nati Cohen 31. How Startups Can Build an SRE Mindset Tamara Miner 32. Bootstrapping SRE in Enterprises Vanessa Yiu 33. It’s Okay Not to Know, and It’s Okay to Be Wrong Todd Palino 34. Storytelling Is a Superpower Anita Clarke 35. Get Your Work Recognized: Write a Brag Document Julia Evans and Karla Burnett III. One to Ten 36. Making Work Visible Lorin Hochstein 37. An Overlooked Engineering Skill Murali Suriar 38. Unpacking the On-Call Divide Jason Hand 39. The Maestros of Incident Response Andrew Louis Stop the Bleeding What’s Everyone Doing? 40. Effortless Incident Management Suhail Patel, Miles Bryant, and Chris Evans 41. If You’re Doing Runbooks, Do Them Well Spike Lindsey 42. Why I Hate Our Playbooks Frances Rees 43. What Machines Do Well Michelle Brush 44. Integrating Empathy into SRE Tools Daniella Niyonkuru 45. Using ChatOps to Implement Empathy Daniella Niyonkuru 46. Move Fast to Unbreak Things Michelle Brush 47. You Don’t Know for Sure Until It Runs in Production Ingrid Epure 48. Sometimes the Fix Is the Problem Jake Pittis 49. Legendary Elise Gale 50. Metrics Are Not SLIs (The Measure Everything Trap) Brian Murphy 51. When SLOs Attack: Pathological SLOs and How to Fix Them Narayan Desai 52. Holistic Approach to Product Reliability Kristine Chen and Bart Ponurkiewicz 53. In Search of the Lost Time Ingrid Epure 54. Unexpected Lessons from Office Hours Tamara Miner 55. Building Tools for Internal Customers that They Actually Want to Use Vinessa Wan 56. It’s About the Individuals and Interactions Vinessa Wan 57. The Human Baseline in SRE Effie Mouzeli 58. Remotely Productive or Productively Remote Avleen Vig 59. Of Margins and Individuals Kurt Andersen 60. The Importance of Margins in Systems Kurt Andersen 61. Fewer Spreadsheets, More Napkins Jacob Bednarz 62. Sneaking in Your DevOps Deliciously Vinessa Wan 63. Effecting SRE Cultural Changes in Enterprises Vanessa Yiu 64. To All the SREs I’ve Loved Felix Glaser 65. Complex: The Most Overloaded Word in Technology Laura Nolan IV. Ten to Hundred 66. The Best Advice I Can Give to Teams Nicole Forsgren 67. Create Your Supporting Artifacts Daria Barteneva and Eva Parish 68. The Order of Operations for Getting SLO Buy-In David K. Rensin 69. Heroes Are Necessary, but Hero Culture Is Not Lei Lopez 70. On-Call Rotations that People Want to Join Miles Bryant, Chris Evans, and Suhail Patel 71. Study of Human Factors and Team Culture to Improve Pager Fatigue Daria Barteneva 72. Optimize for MTTBTB (Mean Time to Back to Bed) Spike Lindsey 73. Mitigating and Preventing Cascading Failures Rita Lu 74. On-Call Health: The Metric You Could Be Measuring Caitie McCaffrey 75. Helping Leaders Prioritize On-Call Health Caitie McCaffrey Bring Quantitative Data Link SLAs to On-Call Health Treat On-Call Health like a Feature Measure Attrition 76. The SRE as a Diplomat Johnny Boursiquot 77. The Forward-Deployed SRE Johnny Boursiquot 78. Test Your Disaster Plan Tanya Reilly 79. Why Training Matters to an SRE Practice and SRE Matters to Your Training Program Jennifer Petoff 80. The Power of Uniformity Chris Evans, Suhail Patel, and Miles Bryant 81. Bytes per User Value Arshia Mufti 82. Make Your Engineering Blog a Priority Anita Clarke 83. Don’t Let Anyone Run Code in Your Context John Looney 84. Trading Places: SRE and Product Shubheksha Jalan 85. You See Teams, I See Product Avleen Vig 86. The Performance Emergency Fund Dawn Parzych 87. Important but Not Urgent: Roadmaps for SREs Laura Nolan V. The Future of SRE 88. That 50% Thing Tanya Reilly 89. Following the Path of Safety-Critical Systems Heidy Khlaaf 90. Applicable and Achievable Static Analysis Heidy Khlaaf 91. The Importance of Formal Specification Hillel Wayne 92. Risk and Rot in Sociotechnical Systems Laura Nolan 93. SRE in Crisis Niall Murphy 94. Expected Risk Limitations Blake Bisset 95. Beyond Local Risk: Accounting for Angry Birds Blake Bisset 96. A Word from Software Safety Nerds J. Paul Reed 97. Incidents: A Window into Gaps Lorin Hochstein 98. The Third Age of SRE Björn “Beorn” Rabenstein Contributors Index About the Editors
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: 97 Things Every SRE Should Know: Collective Wisdom from the Experts
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.