Eric Harvieux, an SRE on Google's Customer Reliability Engineering (CRE) team, will talk to us about Site Reliability Engineering (SRE) in Practice, including a panel discussion with Fidelity, Home Depot, Sabre, and Google SRE Practitioners. We hope to hear how real-life SRE is different than the books.
2. Proprietary + Confidential
Table of Contents
Introduction to SRE
SRE as a Role, Mindset, and Tools
SLOs and Error Budgets
Postmortems
Teams
Panel Discussion
01
02
03
04
05
2
3. Proprietary + Confidential
Place Image Here
Intro To SRE
Site Reliability Engineers develop solutions to design, build, and run
large-scale systems scalably, reliably, and efficiently. We treat
operations like a software engineering problem.
We guide system architecture by operating at the intersection of
software development and systems engineering, using data to guide
decision making.
We approach our work with a spirit of constructive pessimism: we
hope for the best, but plan for the worst.
3
4. Proprietary + Confidential
Place Image Here
SLOs and Error Budgets
Service Level Objectives are simply a goal for how reliable one
aspect of your service’s reliability is, over some period of time.
But! They aren’t necessarily simple to define.
● How reliable do you actually need to be? Who says?
● If you have many Critical User Journeys, which do you
monitor? All of them?
● What if your dependencies don’t have SLOs defined?
An Error Budget is just the gap between 100% and your SLO target;
it’s room to make mistakes.
4
5. Proprietary + Confidential
Postmortems ensure an incident is documented, that all the
contributing root causes are understood, and effective
preventative actions are put in place to reduce the likelihood
and/or severity of recurrence.
After any significant undesirable event, this is the chance to
openly and honestly review weak points in our systems.
Being responsible for, or being involved in a postmortem is
not punishment.
Postmortems
Primary Goals Postmortems are expected
5
6. Proprietary + Confidential
Place Image Here
Blamelessness
Blamelessness could mean a number of things, but the key result
should be should be: Solely by virtue of being involved in an incident,
or speaking factually about what occurred, I’m:
● not going to lose my job.
● not going to be rated lower in my performance review.
● not going to get condescending questions from
management.
● not going to be the butt of jokes.
That means you might have to adapt for your environment:
● Complete the postmortem review asynchronously: to give
people time to collect data.
● Listing or not listing names of those involved should be an
agreed-upon policy.
6
7. Proprietary + Confidential
Postmortems: Value vs. Effort
Postmortems aren’t fun. Especially for the person who owns the
work involved in putting one together. So, like any reasonable human,
they might try to avoid it. Ways we get around Postmortems include:
● not declaring an incident at all
● determining the incident was a repeat issue
● picking a definition of impact that avoids postmortem
requirements.
But we’re probably lying to ourselves.
This means it’s time to think about our incident review process and
make sure it’s efficient and effective.
7
8. Proprietary + Confidential
Place Image Here
Team Composition
Independent or Embedded? SREs can be positioned in a number of
ways within an organization to have the most effective impact:
● SREs who share responsibility for a number of services might
exist well as an independent team
● A development team suffering from poor reliability might
benefit from an SRE sitting with them
● How many SREs do you need anyway?
8