How to get started with Site Reliability Engineering

How to get started with
Site Reliability Engineering
Women Who Code Toronto, April 2021

Breakdown of Engineering roles by gender
https://juliasilge.com/blog/salary-gender/

Scatter graph of roles by gender
So how much of a gender imbalance is there for SRE roles?
https://insights.stackoverflow.com/survey/2020#developer-profile-gender

Scatter graph of roles by gender
Around 30x...
https://insights.stackoverflow.com/survey/2020#developer-profile-gender

Job titles, oh so many job titles...
● System Administrator
● Cloud Architect
● Infrastructure Engineer
● Site Reliability Engineer
● DevOps Engineer
● Platform Engineer
(least coding to most coding… sort of
not really it’s all made up)

So why is there such a gender imbalance?

Possibly because no-one can agree what “DevOps” is
“We interview a lot of engineers and hiring managers about
what they're looking for when they hire for pertinent roles.”
“We usually find a clear consensus on what the relevant
skills are.”
“When we did this for DevOps, we found no such
consensus.”
https://triplebyte.com/blog/no-one-agrees-on-what-devops-means-not-even-employers

Probably because no-one can agree what is is
“On one end of the spectrum, there are back-end developers
who focus on building infrastructure and automation tools.”
“On the other end of the spectrum, there are systems experts
who serve as the first line of defense against production
outages but rarely write code aside from the occasional shell
script”
https://triplebyte.com/blog/no-one-agrees-on-what-devops-means-not-even-employers

“DevOps is a philosophy
before it’s a job title”

Common assumptions
● Linux expert
● Networking wizard
● Learn every AWS product
● Run everything in Docker
● CI/CD* all the things
● Automate everything
● ...
* Continuous Integration/Continuous Delivery (run my tests then deploy automatically… continuously)

Truth is, we have (almost) no idea what we are doing

What are you really expected to know?
● Is your application is running well?
● Advantages and limitations of your current processes
● Alternative ways of deploying, hosting and
architecting your current platform
● “Shared suffering makes a team a team” try to learn
from people who have battle scars
https://psmag.com/books-and-culture/painful-experiences-solidarity-bonding-power-shared-suffering-90352

The how isn’t important,
but the why

https://twitter.com/kelseyhightower/status/826528907381739520
https://www.protocol.com/enterprise/kelsey-hightower-google-cloud (if you don’t know who Kelsey Hightower is)

Technologies
● Don’t learn Kubernetes*, but containerisation
● Don’t just learn AWS, but cloud computing
● Nothing wrong with hacky scripts to get started
● Databases, queues and caches are your friends
and worst enemies...
* I mean, you will have to eventually, but no-one really knows how it works anyways

What could you learn?
Many, many things. But mostly fall into these categories...
● Repeatability - Can I do it the same way over and over?
● Observability - Can I tell if it’s working well or not?
● Efficiency - Can I make this happen faster/cheaper?
● Composure - Can I fix this without panicking?

Repeatability
● Don’t you mean automation? Shouldn’t you
automate all the things?
● Repeatability is a bi-product of automation,
automation is the means not the end result
● Wise co-worker said to me once “if you have to do it
more than twice, write a script so you don’t mess it up
the third time”

Repeatability, deployment example

1. Drag my files via a GUI onto the server
2. Run a command to copy files onto the server
3. Write a bash script to copy files onto the server
4. Modify the bash script to backup previous version
5. Write another bash script to rollback to previous version…
6. …
7. …
8. ...

99. …
100. Finish building Kubernetes
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/rollback.go

Repeatability, deployment
https://charity.wtf/about/
https://twitter.com/mipsytipsy/status/1270693381648203777

Observability
1. Is there anything weird in the logs?
2. How much CPU/etc. is my application using?
○ How much does the thing my application is running on have?
3. How fast is it responding?
○ How fast does it usually respond?
4. How many other applications/services* does my application
depend on?
○ Repeat 1-4 for each of those
* Remember that web servers, databases, caches and queues are just applications someone else wrote

Service Level Agreements
https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli

Service Level Agreement (SLA) example
● SLA - My website must be available 99.9% of the time each
month (43 minutes) or my paying customers get a (partial)
refund
● Service Level Objective (SLO)
○ Back-end - API must not be down for more than 43 minutes each
month
○ Front-end - Front-end must not have a broken User Experience (UX)
for more than 43 minutes each month
https://en.wikipedia.org/wiki/High_availability

How do you define “down” or “broken”?
● Service Level Indicator (SLI)
○ Back-end - API must respond to requests in less than 5 seconds,
99.9% of the time
○ Front-end - Time To Interactive* must be less than 10 seconds,
99.9% of the time
* https://web.dev/interactive/

Any decent observability stack can capture this information
https://www.g2.com/categories/enterprise-monitoring

Efficiency
Code problem, or a process problem?
● I know where the website is slow, but I don’t know
why the website is slow…
● It takes ages to release my feature, who or what
keeps holding it up?

Know your stack
Learn where your stack has strengths and weaknesses
● What API calls, scheduled tasks or page renders take longer
than average?
● Using profiling tools, what specifically is slow?
● Can I solve with the language?
● Can I solve at the database/cache?
● Can I solve with re-architecting?

DevOps Research and Assessments (DORA) metrics
Deployment Frequency (DF), Mean Lead Time
for changes (MLT), Mean Time To Recover
(MTTR) and Change Failure Rate (CFR)
● Ship it quicker
● Ship smaller changes more
often
● Fix bugs quicker
● Detect bugs earlier
...keep the site online more
https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance

Value Stream Mapping
https://visible.is/#value-stream-mapping

Composure
But… what if a deployment that has just taken the site offline??

Composure
● Be comfortable on the command line*
● Figure out how to get quick access to
○ Logs**
○ CPU/memory/IO metrics***
● Know how to rollback
● Check if it’s safe to rollback
● Know when to ask for help
● Know who to ask for help
* https://www.learnenough.com/command-line-tutorial/basics
** https://phoenixnap.com/kb/how-to-view-read-linux-log-files
*** https://www.tecmint.com/command-line-tools-to-monitor-linux-performance/

DevOps interviews are tricky
● Coding tests are rare (sometimes a terminal
test) but whiteboarding is common
● Technology specifics will usually be based
on their in-house stack
● Conceptual answers should be valid*
● Admit when you don’t know how something
works (but…)
● Provide examples of alternative approaches
where possible
* Too many different technology stacks, so try to relate theirs to something more familiar to you

On-call
● Ask what a typical on-call shift
looks like
○ How many out-of-hours
pages do they get?
● Ask how many other people
and teams are on-call also
● Ask how incidents are
prioritised and expected
resolution times

Should you consider it?
● You will get an opportunity to learn many new things, but your work will be
less visible to stakeholders
● Having root access means more risk, more danger
● Folks are very keen to teach, but takes the right mindset to learn
● Be careful as lots of jobs want rebranded sysadmins, have no intention of
fixing their broken processes (don’t let the salary suck you in)

https://www.commitstrip.com/en/
2012/11/29/nest-pas-root-qui-veut/

Summary
● Learn why to do something, not
how
● Start by analysing and optimising
what you are familiar with
● Pick the tools you find easiest to
use, but be aware of others
● Look at the human factors
● Measure everything, otherwise you
won’t know if it’s faster
● Be careful with job opportunities
Infrastructure Engineer at PartnerStack
(https://jobs.lever.co/partnerstack)
www.slideshare.net/secret/dYrg0kLRxz
p3K

How to get started with Site Reliability Engineering

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How to get started with Site Reliability Engineering

Ähnlich wie How to get started with Site Reliability Engineering (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How to get started with Site Reliability Engineering