SRE in Apiary

SRE in Apiary
CZJUG 21.5.2018
Ladislav Prskavec
@abtris
1

"What happens when a software engineer is tasked with what used to
be called operations."
» Ben Treynor Sloss, Vice President, Google Engineering,
founder of Google SRE
3

SRE implement DevOps
— Google Cloud
4

Apiary in numbers
» Apiary users: 336,786 Apiary API projects: 440,178
» Apiary engineers: 19
» Apiary platform engineers: 10 + 4
» Apiary SREs: 4
» App deploys: 15 (per week)
» Parsing service invocations [1 day]: ~200k
» CI build: ~19 min (8 parallel workers)
5

How we started with
SRE team
6

2014 - 2 people
software developer and ops guy
7

2015-2017 - 3 people
software developers
8

2018 - 4 people
2 seniors, 2 juniors
9

No team separation
» bounder context, but ...
» Shared ownership of platform - shared responsibility
» Shared tooling (debug, deploy, monitor)
» Shared codebase
» Brainstorm
» Motivation for good design (monitoring, future debugging)
11

Things break
» They do - better be ready
» Knowing when there's problem (logs, metrics, alerting)
» Having someone there - being oncall
» Responding (mitigation, resolution)
» Learning from it (postmortems)
12

Measure everything
» No gut feeling when we have the data (app metrics, runtime
metrics)
» Both production and non-production systems (e.g. our CI test
time)
» Thresholds, automated alerting
» Visualize the data (oncall dashboard, happiness dashboard)
13

Gradual changes
» Delivery vs deploy
» Continuous Integration / Continuous Delivery (CI/CD)
Automated testing within CI
» Testing environments (similar to production)
» Short iterations, fast rollbacks
» No-downtime deploy & immutable
» Rolling delivery
16

Tooling & automation
» oncall logistics
» schedules
» escalations
» alerting
» conflicts
» documentation
» runbooks
» internal processes
» domain dictionary
17

Reason 1. Decreasing changes of errors
» Source and great post: http://www.devops.ch/2017/05/10/devops-explained/
18

Reason 2: Eliminating toil, work that is:
» Repetitive
» Automatable
» Doesn't provide enduring value
» Scales linearly with service
» Compounds significantly and surprisingly
19

Reason 3: Focusing on creative
engineering work that:
» Improves reliability
» Improves performance & stability of systems
» Ensures scalability
» Reduces toil
» Is fun: improves morale, speeds up progress, allows skill
development
20

Incidents
Types:
» Low-priority incident
» High-priority incident
» Security incident
Both production and non-production systems
21

Being oncall
» Shared among developers (roles, not individuals, increase bus
factor) Responsible for the platform
» Safety net - you know who to call
» Runbooks - you know what to do
» Early alerting - proactively investigate
22

Incident response
If critical: Incident commander role Separate roles, if necessary:
» outbound and inbound communication
» root cause analysis
» issue mitigation
Tracking time (incident ack expiration) and keeping track Tooling
(alerts, paging, postmortem reminders)
23

Postmortems
» Root cause
» Lessons learned
» Actionable items
» Prevent future issues
» Create runbooks
» Blameless
» Generated reminders
24

Incident reviews
» Weekly, team lead sync
» Reviewing past incidents - types, occurrence, actionability
» Discuss improvements
» Incident fatigue prevention
25

Summary
» Culture is more important than process!
» Start early and work on improvements!
» Product owner for SRE work is useful role!
27

References
» SRE vs. DevOps: competing standards or close friends?
» SRE Weekly
» Awesome Site Reliability Engineering
Books
» SRE book
» Seeking SRE
28

SRE in Apiary

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SRE in Apiary

Ähnlich wie SRE in Apiary (20)

Mehr von Ladislav Prskavec

Mehr von Ladislav Prskavec (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SRE in Apiary