Site Reliability Engineering

Overview

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this course, you will learn how and why the DevOps commitment to the entire lifecycle can enable the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

Who Should Take This Course

AUDIENCE

This course is suitable for DevOps, Admins, Developers, and Managers.

PREREQUISITES

Students should have an IT background as a prerequisite for this course.

Schedule

3-4 Days

Technical Requirements

Students will need a reasonably modern laptop with unrestricted connection to the Internet and Chrome browser. Laptops with overly restrictive VPNs or firewalls may not work properly. Cloud based lab environment will be provided to students, no local installations needed.

Why You Should Take This Course

In this course, you will gain familiarity with the following topics:

Introduction

Learn what site reliability engineering is

Principles

Examine the patterns, behaviors, and areas of concern for SRE

Practices

Understand the theory and practice of an SRE’s day-to-day work: building and operating large, distributed computing systems

Management

Explore Google’s best practices for training, communication, and meetings that your organization can use

Schedule

Course Outline

Introduction
- The Sysadmin Approach to Service Management
- Google’s Approach to Service Management: Site Reliability Engineering
- Tenets of SRE
- Demand Forecasting and Capacity Planning
- Efficiency and Performance
The Production Environment at Google, from the Viewpoint of an SRE
- Hardware
- System Software That “Organizes” the Hardware
- Storage
- Networking
- Monitoring and Alerting
Principles
- Embracing Risk
- Managing Risk
- Motivation for Error Budgets
- Benefits
Service Level Objectives
- Service Level Terminology
- Indicators in Practice
- What Do You and Your Users Care About?
- Agreements in Practice
Eliminating Toil
- Toil Defined
- Why Less Toil Is Better
Monitoring Distributed Systems
- Why Monitor?
- Setting Reasonable Expectations for Monitoring
- Symptoms Versus Causes
- Black-Box Versus White-Box
- As Simple as Possible, No Simpler
- Bigtable SRE: A Tale of Over-Alerting
- Gmail: Predictable, Scriptable Responses from Humans
The Evolution of Automation at Google
- The Value of Automation
- A Platform
- Faster Repairs
- Faster Action
- Automate Yourself Out of a Job: Automate ALL the Things!
- Resolving Inconsistencies Idempotently
- Borg: Birth of the Warehouse-Scale Computer
Release Engineering
- The Role of a Release Engineer
- Philosophy
- Self-Service Model
- Testing
- Packaging
- Configuration Management
Simplicity
- System Stability Versus Agility
- The Virtue of Boring
- Minimal APIs
- Modularity
- Release Simplicity
Practical Alerting from Time-Series Data
- Instrumentation of Applications
- Collection of Exported Data
- Storage in the Time-Series Arena
- Labels and Vectors
- Alerting
Being On-Call
- Life of an On-Call Engineer
- Balanced On-Call
- Balance in Quantity
- Balance in Quality
Effective Troubleshooting
- Theory
- In Practice
- Problem Report
- Triage
Emergency Response
- What to Do When Systems Break
- Test-Induced Emergency
- Response
- Keep a History of Outages
Managing Incidents
- Unmanaged Incidents
- Poor Communication
- Freelancing
- Live Incident State Document
- Clear, Live Handoff
- A Managed Incident
Postmortem Culture: Learning from Failure
- Google’s Postmortem Philosophy
- Collaborate and Share Knowledge
Tracking Outages
- Escalator
- Outalator
- Aggregation
- Tagging
- Analysis
Testing for Reliability
- Types of Software Testing
- Traditional Tests
- Production Tests
- Testing at Scale
Software Engineering in SRE
- Why Is Software Engineering Within SRE Important?
- Traditional Capacity Planning
- Intent-Based Capacity Planning
- Fostering Software Engineering in SRE
Load Balancing at the Frontend
- Power Isn’t the Answer
- Load Balancing Using DNS
- Load Balancing at the Virtual IP Address
Load Balancing in the Datacenter
- The Ideal Case
- Identifying Bad Tasks: Flow Control and Lame Ducks
- A Simple Approach to Unhealthy Tasks: Flow Control
- A Robust Approach to Unhealthy Tasks: Lame Duck State
Handling Overload
- The Pitfalls of “Queries per Second”
- Per-Customer Limits
- Client-Side Throttling
- Criticality
- Handling Overload Errors
- Deciding to Retry
- Load from Connections
Addressing Cascading Failures
- Causes of Cascading Failures and Designing to Avoid Them
- Server Overload
- Resource Exhaustion
- Service Unavailability
- Preventing Server Overload
- Queue Management
- Planned Changes, Drains, or Turndowns
- Testing for Cascading Failures
Managing Critical State: Distributed Consensus for Reliability
- Motivating the Use of Consensus: Distributed Systems Coordination Failure
- Case Study 1: The Split-Brain Problem
- Case Study 2: Failover Requires Human Intervention
- Case Study 3: Faulty Group-Membership Algorithms
- How Distributed Consensus Works
- Paxos Overview: An Example Protocol
- System Architecture Patterns for Distributed Consensus
- Reliable Replicated State Machines
- Reliable Replicated Datastores and Configuration Stores
Distributed Periodic Scheduling with Cron
- Cron
- Reliability Perspective
- Cron Jobs and Idempotency
Data Processing Pipelines
- Origin of the Pipeline Design Pattern
- Initial Effect of Big Data on the Simple Pipeline Pattern
- Challenges with the Periodic Pipeline Pattern
Data Integrity: What You Read Is What You Wrote
- Data Integrity’s Strict Requirements
- Choosing a Strategy for Superior Data Integrity
- Data Integrity Is the Means; Data Availability Is the Goal
- The 24 Combinations of Data Integrity Failure Modes
- First Layer: Soft Deletion
- Second Layer: Backups and Their Related Recovery Methods
- Overarching Layer: Replication
- Third Layer: Early Detection
- Trust but Verify
- Hope Is Not a Strategy
Reliable Product Launches at Scale
- Launch Coordination Engineering
- The Role of the Launch Coordination Engineer
- Setting Up a Launch Process
- Capacity Planning
- Failure Modes
Accelerating SREs to On-Call and Beyond
- You’ve Hired Your Next SRE(s), Now What?
- Initial Learning Experiences: The Case for Structure Over Chaos
- Learning Paths That Are Cumulative and Orderly
Dealing with Interrupts
- Managing Operational Load
- Factors in Determining How Interrupts Are Handled
- Imperfect Machines
- Cognitive Flow State
Embedding an SRE to Recover from Operational Overload
- Phase 1: Learn the Service and Get Context
- Identify the Largest Sources of Stress
- Identify Kindling
- Phase 2: Sharing Context
- Write a Good Postmortem for the Team
- Sort Fires According to Type
- Phase 3: Driving Change
- Start with the Basics *. Communication and Collaboration in SRE
- Communications: Production Meetings
- Collaboration within SRE
- Team Composition
The Evolving SRE Engagement Model
- SRE Engagement: What, How, and Why
- The PRR Model
- The SRE Engagement Model
- Alternative Support
- Production Readiness Reviews: Simple PRR Model
- Engagement
- Analysis
Lessons Learned from Other Industries
- Meet Our Industry Veterans
- Preparedness and Disaster Testing
- Relentless Organizational Focus on Safety
Conclusion

FAQs

Is there a discount available for current students?

UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Asynchronous courses are excluded from this offer.

What is the cancellation and refund policy?

Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.

AUDIENCE

PREREQUISITES

Schedule

Technical Requirements

Is there a discount available for current students?

What is the cancellation and refund policy?

Stay in the Loop

Contact Us