Hands-on Site Reliability Engineering: Build Capability to Design, Deploy, Monitor, and Sustain Enterprise Software Systems at Scale

by Shamayel Mohammed Farooqui, Vishnu Vardhan Chikoti

Length: 236 pages
Edition: 1
Language: English
Publisher: BPB Publications
Publication Date: 2021-07-06
ISBN-10: 9391030327
ISBN-13: 9789391030322
Sales Rank: #1301074 (See Top 100 Books)

5 ratings

Print Book Look Inside

A comprehensive guide with basic to advanced SRE practices and hands-on examples.

Key Features

Demonstrates how to execute site reliability engineering along with fundamental concepts.
Illustrates real-world examples and successful techniques to put SRE into production.
Introduces you to DevOps, advanced techniques of SRE, and popular tools in use.

Description

Hands-on Site Reliability Engineering (SRE) brings you a tailor-made guide to learn and practice the essential activities for the smooth functioning of enterprise systems, right from designing to the deployment of enterprise software programs and extending to scalable use with complete efficiency and reliability.

The book explores the fundamentals around SRE and related terms, concepts, and techniques that are used by SRE teams and experts. It discusses the essential elements of an IT system, including microservices, application architectures, types of software deployment, and concepts like load balancing. It explains the best techniques in delivering timely software releases using containerization and CI/CD pipeline. This book covers how to track and monitor application performance using Grafana, Prometheus, and Kibana along with how to extend monitoring more effectively by building full-stack observability into the system.

The book also talks about chaos engineering, types of system failures, design for high-availability, DevSecOps and AIOps.

What you will learn

Learn the best techniques and practices for building and running reliable software.
Explore observability and popular methods for effective monitoring of applications.
Workaround SLIs, SLOs, Error Budgets, and Error Budget Policies to manage failures.

Who this book is for

This book caters to experienced IT professionals, application developers, software engineers, and all those who are looking to develop SRE capabilities at the individual or team level.

About the Authors

Shamayel M. Farooqui is a technology leader who specializes in driving digital transformation for organizations and is the author of ‘Enterprise DevOps Framework – Transforming IT Operations’.
He has expertise in implementing IT security, cloud migrations, and IT automation and a proven track record of building teams of skilled site reliability engineers focused on delivering solutions for optimizing and running hybrid, multi-cloud environments.log links: http://www.shamayelfarooqui.com, http://www.shamayelfarooqui.com, https://www.xfgeek.com/home

LinkedIn Profile: https://www.linkedin.com/in/shamayel/

Vishnu Vardhan Chikoti has diverse experience in the areas of Application and Database design and development, Micro-services & Micro-frontends, DevOps, Site Reliability Engineering, and Machine Learning.
With the ability to conduct deep analysis, strong execution skills, and an innovative mindset, he has successfully led R&D teams to build engineering solutions to improve the reliability of applications. He is also an expert in building high-volume transaction processing applications for middle and back-office functions for Investment Banks using a variety of architectures.

LinkedIn Profile: https://www.linkedin.com/in/vishnu-vardhan-chikoti-3763262/

Cover Page
Title Page
Copyright Page
Foreword
Dedication Page
About the Authors
About the Reviewer
Acknowledgement
Preface
Errata
Table of Contents
1. Understanding the World of IT
    Structure
    Objective
    What is the role of IT in an organization?
        Hardware availability
        Core software services
        Compliance and security
        Application development and hosting
        Enterprise Architecture (EA)
        Software delivery
    Understanding the IT organization structure
    Role of infrastructure teams
        Data centers
        Virtualization
        Containerization
        On-premise infrastructure
        Cloud infrastructure
        Development and deployment platforms
    Role of application teams
        Cross-functional development teams
        DevOps teams
        Production support/operations teams
    IT security
        Change management team
    The TCP/IP protocol suite
    Domain Name System
    Conclusion
    Multiple choice questions
        Answers
2. Introduction to DevOps
    Structure
    Objective
    Introduction to DevOps
    DevOps principles and practices
        DevOps principles
        DevOps practices
    Benefits of DevOps
    Overview of DevOps tools
        Git
        Ansible
        Jenkins
    Conclusion
    Multiple choice questions
        Answers
3. Introduction to SRE
    Structure
    Objective
    DevOps and SRE
    Rise of internet companies
    SRE overview
    SRE terms
    SRE team responsibilities
    Skill set of SREs
    Conclusion
    Multiple choice questions
        Answers
4. Identify and Eliminate Toil
    Structure
    Objective
    Understanding toil
        Importance of eliminating toil
    Process optimization with automation
    Examples of toil with approaches to automate
        Purging and archiving of files
        Purging of database tables
        Installation/Patching
        Monitoring
        Checking log files
        Identify and Access Management
        Vulnerability scans
        Infrastructure provisioning/decommissioning
        Incident management
    Conclusion
    Multiple choice questions
        Answers
5. Release Management
    Structure
    Objective
    Understanding release management
        Release planning
        Build package
        Test for quality and security
        Deployment
    Release automation with CI/CD
        Using IaC for release management
    Blue-green deployments
    Canary deployments
    Conclusion
    Multiple Choice Questions
        Answers
6. Incident Management
    Structure
    Objective
    Understanding an incident management
        Incident
        Incident lifecycle
    Blameless postmortems
    Incident example
        Incident detection/notification
        Incident triage
        Incident communication
        Incident resolution
        Incident retrospective/postmortem
    Incident knowledge base
    Role of development teams
    Conclusion
    Multiple choice questions
        Answers
7. IT Monitoring
    Structure
    Objective
    End to end monitoring strategy
    Infrastructure monitoring
        Server monitoring
        Network monitoring
        Storage monitoring
    Application monitoring
        Probes
        Checking logs
        Capturing processing time
        MQ monitoring
        Database monitoring
    End user monitoring
    DNS monitoring
    Monitoring Tools
        Agents
        Transport
        Collectors
        Data transformation
        Storage
        Alerting
        Dashboarding
        Prometheus
        Metricbeat
        Grafana
        ElastAlert
    Conclusion
    Multiple choice questions
        Answers
8. Observability
    Structure
    Objective
    Goals of observability
        Service reliability
        Operational efficiency
        Security and compliance
    Three pillars of observability
        Standardized libraries/APIs/SDKs
        Standardized trace context
        Tracers
        Cardinality attributes
    Open source libraries and tools
        Filebeat
        Logstash
        Fluentd
        OpenTelemetry
    Conclusion
    Multiple Choice Questions
        Answers
9. Key SRE KPIs: SLAs, SLOs, SLIs, and Error Budgets
    Structure
    Objective
    Key metrics for SRE
    Service level indicator (SLI)
    Service Level Objective (SLO)
    Service level agreement (SLA)
    Error budgets
        Error budget policy
    Conclusion
    Multiple choice questions
        Answers
10. Chaos Engineering
    Structure
    Objective
    Introducing chaos engineering
        Application/service unavailability
        Network delays
        Network failures
        Resource unavailability
        Configuration errors
        Database failures
    Chaos engineering process
        Define steady state
        Build a hypothesis
        Minimize blast radius
        Inject the failure condition
        Verify hypothesis
        Reverse failure condition
        Fix any issues
        Automate to run continuously
    Chaos GameDays
    Injecting failures
        Killing a process
        Network failures
        HTTP failures
        Injecting multiple failures
    Techniques for building resiliency
        Single point of failures
        Rate limiting/throttling
        Circuit breaker
        Handle retry storms
    Conclusion
    Multiple choice questions
        Answers
11. DevSecOps and AIOps
    Structure
    Objective
    Understanding DevSecOps
        Code scanning for security
        Secure releases using Infrastructure as Code
    Introduction to AIOps
    Use cases with AIOps
        Intelligent alerting
        Noise reduction
        Automated root cause analysis
        Automated remediation
        ChatOps
    ChatOps example with Rasa, Flask, and Telegram
    Conclusion
    Multiple choice questions
        Answers
12. Culture of Site Reliability Engineering
    Structure
    Objective
    Breaking silos in the organization
    Embracing risk
    Continuous improvement
        Intelligent automation
        Shift-left mindset
    Conclusion
    Multiple choice questions
        Answers
Index