Many companies are looking for "DevOps'' in many forms, but what kind of skills or experiences are actually needed? I’ll debunk some of the myths surrounding what recruiters or internet lurkers might tell you and find out if you might actually have an aptitude for Site Reliability or Infrastructure Engineering. If so, what might be good knowledge areas to get started with? And if learning leads to an interview, what might that look like?
4. Scatter graph of roles by gender
So how much of a gender imbalance is there for SRE roles?
https://insights.stackoverflow.com/survey/2020#developer-profile-gender
5. Scatter graph of roles by gender
Around 30x...
https://insights.stackoverflow.com/survey/2020#developer-profile-gender
6. Job titles, oh so many job titles...
● System Administrator
● Cloud Architect
● Infrastructure Engineer
● Site Reliability Engineer
● DevOps Engineer
● Platform Engineer
(least coding to most coding… sort of
not really it’s all made up)
8. Possibly because no-one can agree what “DevOps” is
“We interview a lot of engineers and hiring managers about
what they're looking for when they hire for pertinent roles.”
“We usually find a clear consensus on what the relevant
skills are.”
“When we did this for DevOps, we found no such
consensus.”
https://triplebyte.com/blog/no-one-agrees-on-what-devops-means-not-even-employers
9. Probably because no-one can agree what is is
“On one end of the spectrum, there are back-end developers
who focus on building infrastructure and automation tools.”
“On the other end of the spectrum, there are systems experts
who serve as the first line of defense against production
outages but rarely write code aside from the occasional shell
script”
https://triplebyte.com/blog/no-one-agrees-on-what-devops-means-not-even-employers
10. “DevOps is a philosophy
before it’s a job title”
12. Common assumptions
● Linux expert
● Networking wizard
● Learn every AWS product
● Run everything in Docker
● CI/CD* all the things
● Automate everything
● ...
* Continuous Integration/Continuous Delivery (run my tests then deploy automatically… continuously)
13. Truth is, we have (almost) no idea what we are doing
14. What are you really expected to know?
● Is your application is running well?
● Advantages and limitations of your current processes
● Alternative ways of deploying, hosting and
architecting your current platform
● “Shared suffering makes a team a team” try to learn
from people who have battle scars
https://psmag.com/books-and-culture/painful-experiences-solidarity-bonding-power-shared-suffering-90352
17. Technologies
● Don’t learn Kubernetes*, but containerisation
● Don’t just learn AWS, but cloud computing
● Nothing wrong with hacky scripts to get started
● Databases, queues and caches are your friends
and worst enemies...
* I mean, you will have to eventually, but no-one really knows how it works anyways
19. What could you learn?
Many, many things. But mostly fall into these categories...
● Repeatability - Can I do it the same way over and over?
● Observability - Can I tell if it’s working well or not?
● Efficiency - Can I make this happen faster/cheaper?
● Composure - Can I fix this without panicking?
20. Repeatability
● Don’t you mean automation? Shouldn’t you
automate all the things?
● Repeatability is a bi-product of automation,
automation is the means not the end result
● Wise co-worker said to me once “if you have to do it
more than twice, write a script so you don’t mess it up
the third time”
22. Repeatability, deployment example
1. Drag my files via a GUI onto the server
2. Run a command to copy files onto the server
3. Write a bash script to copy files onto the server
4. Modify the bash script to backup previous version
5. Write another bash script to rollback to previous version…
6. …
7. …
8. ...
23. Repeatability, deployment example
99. …
100. Finish building Kubernetes
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/deployment/rollback.go
25. Observability
1. Is there anything weird in the logs?
2. How much CPU/etc. is my application using?
○ How much does the thing my application is running on have?
3. How fast is it responding?
○ How fast does it usually respond?
4. How many other applications/services* does my application
depend on?
○ Repeat 1-4 for each of those
* Remember that web servers, databases, caches and queues are just applications someone else wrote
27. Service Level Agreement (SLA) example
● SLA - My website must be available 99.9% of the time each
month (43 minutes) or my paying customers get a (partial)
refund
● Service Level Objective (SLO)
○ Back-end - API must not be down for more than 43 minutes each
month
○ Front-end - Front-end must not have a broken User Experience (UX)
for more than 43 minutes each month
https://en.wikipedia.org/wiki/High_availability
28. Service Level Agreement (SLA) example
How do you define “down” or “broken”?
● Service Level Indicator (SLI)
○ Back-end - API must respond to requests in less than 5 seconds,
99.9% of the time
○ Front-end - Time To Interactive* must be less than 10 seconds,
99.9% of the time
* https://web.dev/interactive/
29. Service Level Agreement (SLA) example
Any decent observability stack can capture this information
https://www.g2.com/categories/enterprise-monitoring
30. Efficiency
Code problem, or a process problem?
● I know where the website is slow, but I don’t know
why the website is slow…
● It takes ages to release my feature, who or what
keeps holding it up?
31. Know your stack
Learn where your stack has strengths and weaknesses
● What API calls, scheduled tasks or page renders take longer
than average?
● Using profiling tools, what specifically is slow?
● Can I solve with the language?
● Can I solve at the database/cache?
● Can I solve with re-architecting?
32. DevOps Research and Assessments (DORA) metrics
Deployment Frequency (DF), Mean Lead Time
for changes (MLT), Mean Time To Recover
(MTTR) and Change Failure Rate (CFR)
● Ship it quicker
● Ship smaller changes more
often
● Fix bugs quicker
● Detect bugs earlier
...keep the site online more
https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance
35. Composure
● Be comfortable on the command line*
● Figure out how to get quick access to
○ Logs**
○ CPU/memory/IO metrics***
● Know how to rollback
● Check if it’s safe to rollback
● Know when to ask for help
● Know who to ask for help
* https://www.learnenough.com/command-line-tutorial/basics
** https://phoenixnap.com/kb/how-to-view-read-linux-log-files
*** https://www.tecmint.com/command-line-tools-to-monitor-linux-performance/
37. DevOps interviews are tricky
● Coding tests are rare (sometimes a terminal
test) but whiteboarding is common
● Technology specifics will usually be based
on their in-house stack
● Conceptual answers should be valid*
● Admit when you don’t know how something
works (but…)
● Provide examples of alternative approaches
where possible
* Too many different technology stacks, so try to relate theirs to something more familiar to you
38. On-call
● Ask what a typical on-call shift
looks like
○ How many out-of-hours
pages do they get?
● Ask how many other people
and teams are on-call also
● Ask how incidents are
prioritised and expected
resolution times
39. Should you consider it?
● You will get an opportunity to learn many new things, but your work will be
less visible to stakeholders
● Having root access means more risk, more danger
● Folks are very keen to teach, but takes the right mindset to learn
● Be careful as lots of jobs want rebranded sysadmins, have no intention of
fixing their broken processes (don’t let the salary suck you in)
41. Summary
● Learn why to do something, not
how
● Start by analysing and optimising
what you are familiar with
● Pick the tools you find easiest to
use, but be aware of others
● Look at the human factors
● Measure everything, otherwise you
won’t know if it’s faster
● Be careful with job opportunities
Infrastructure Engineer at PartnerStack
(https://jobs.lever.co/partnerstack)
www.slideshare.net/secret/dYrg0kLRxz
p3K