Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web-scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment.
Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it!
Redefine Operations in a DevOps World: The New Role for Site Reliability Engineering
1. Redefine Operations in a DevOps World
The New Role for Site Reliability Engineering
Todd Palino
DO2T61S
DEVOPS: AGILE OPERATIONS
Senior Staff Engineer, Site Reliability
LinkedIn
Site reliability engineering is a phrase that was first coined by Ben Treynor at Google in 2008, and he’s described it as what happens when you treat operations like a software problem. SRE codifies the rules about how we run our infrastructure, develops tools that implements those rules, and monitors the entire system to make sure it’s working the way we expect. And when it doesn’t, we mitigate the problem and add more to our tools to make them work better the next time.
Many companies have SRE organizations now, with one of the largest being at LinkedIn. Of course, Facebook, Apple, and Netflix are no slouches either. We all do it a little bit differently – for example, at LinkedIn SRE and software engineering are different organizations, as opposed to at Google where SREs and SWEs are part of the same organization. There’s no one right way to SRE, and you don’t have to have a huge company either. This does make my job here today a little tricky, as I can’t give you a step by step guide on how to implement SRE at your company. What I can do is tell you what SRE is, and what kind of environment it thrives in. LinkedIn is an excellent example of this, and here’s why.
These are LinkedIn’s six core values. Members first, relationships matter, Be open, honest, and constructive, demand excellence, take intelligent risks, and act like an owner. They’re also the reason SRE works at LinkedIn, and you’re going to hear these themes echoed in everything I talk about. If we left out any of these, our site reliability organization would not be effective. Now, these are not your company’s values, unless you’re one of my colleagues, but you probably recognize all of them as being important.
This is a mantra you often find in DevOps organizations – move fast and break things. It’s not completely crazy – If you’re taking intelligent risks, sometimes things are going to break. But breaking things should not be the goal. Yes, we need to move quickly. But we should not treat breaking things as the holy grail. We have users, and the user experience matters. I want to enable my developers to move quickly in the safest way possible, and this is one of the things that differentiates SRE – our focus is first on site up. What does that mean? Site stability comes first – if there is a problem with the site that is impacting users, getting that mitigated at the very least is everyone’s primary focus. And this is built into our DNA across the entire company. Even the development team that I work with, when they develop their OKRs each quarter, has as the very first objective “Site Up”.
Distilling it down, beyond keeping the site up, SRE’s job is to automate all the easy things. And to make the hard things easy. Don’t worry, you’re not going to run out of things to do. Remember, you’re working with developers
There’s a lot of them, and they’re always creating new features and applications because they’re being driven by product teams
Who may not always be as focused on stability as you are. So they need tools, controls, and data to be able to accomplish their goals, and do it in a way that maintains the reliability of the entire site. That’s where SRE comes in.
This is Ben, who is one of our SRE managers. He’s written a fantastic post series about SRE and operations titled “Every Day is Monday in Operations” – the URL is at the end of the presentation. This picture came about when he said during an all hands that you could call him any time when there’s a problem, and posted his phone number. But really, this isn’t SRE. We’re not superheroes. I’m more like…
Batman. Well…
That’s more like it. I don’t have super powers. I build wonderful toys to solve problems. LinkedIn SRE has a lot of tools available to us, which we’ve either built or improved, to make everyone’s job easier, both SRE and software engineer. We have build and deployment systems that automate getting code into production, in a manner consistent with our policies. We have monitoring and alerting systems that are common across the entire site. We have an auto-remediation system called Nurse that can respond to alerts and run through mitigation and recovery without waking us up. And when we find a new task, we write a new tool. This is because an SRE is also…
Professionally lazy. My job is not to respond to the alert and get the site back up. OK, it’s part of my job. But really, my job is to make sure that the problem never happens more than once, and that we spend as little time as possible finding and fixing the issues. My job is to automate myself out of a job. To the old school sysadmins, this sounds like a bad idea – you want to make yourself indispensable. You want to make sure that everyone appreciates you and realizes that the site would fall apart without you. If you do that, they can’t let you go. They also can’t let you go on vacation. As an SRE, I know there is always another challenge. The developers are always adding new features that I need to understand, monitor, and assure are built in a way to be scalable and easy to run. There’s always an improvement I can make to our monitoring, or a system that needs to be tuned a little better. I want the next challenge, not last week’s challenge. Besides being lazy, what kind of skills does a typical SRE need to demonstrate?
One of the biggest skills that you have to hire for is being able to work with code. Not only being able to write it, and write it well, but also being able to read and review. SREs write a lot of tools, but especially because those tools are part of our infrastructure that keeps the entire site running, they are treated as first-class applications, just like the products that serve our members. We strive to write good applications, not just scripts that we’ve hacked together. And like a good engineering organization, we have code reviews. It doesn’t stop there, however
My particular brand of SRE is called “embedded SRE”. This means that my team is embedded with a development team, and we work together on a product as a single team. In my case, this is Apache Kafka, which many may be familiar with as a high performance streaming data infrastructure. The Streaming SRE team and the Streaming Engineering team sit together, we plan work together, and we handle issues together. Even though we have separate management chains, for all intents and purposes we are a single team. As such we both write tools to support running Kafka, such as monitoring applications, though the SRE team spends more of their time on this task. We also both debug deeper problems in Kafka, or discuss and plan new features. The software engineering team tends to spend more time on this side of the work. We each have our own expertise that we bring to the team – they go much deeper on the code, particularly the product code, than SRE does. But SRE knows much more…
How all the pieces fit together. SREs need to not only understand their own applications, and how the various bits interact, but also how the site infrastructure in general works. A site like LinkedIn has hundreds of applications that variously depend on each other. It’s difficult for the development team to understand what all of these applications to, which ones are upstream, and which ones are downstream. And then there’s all of the infrastructure tools that help us deploy applications. How do we get hardware? How do we define where apps get deployed? How do you set up network access controls? Source code ACLs? Network port numbers? SRE is here to help take the application that has been developed, and both get it, and keep it, running to serve the members. We have to know what all the pieces are, who is responsible for them, and how to use them. Especially in a large organization, this is a lot of information, and we don’t want to be the irreplaceable ops guy, so
A lot of an SRE’s job is knowing where to find the answers. I’m sure I’m not the only person here who looks like a genius to their friends and family because we know how to Google the answer to their questions. Seriously, we all know how screwed we would be if Google or StackOverflow went down. But as long as they’re up, and thank your deity of choice that they have SREs, I don’t need to know all the answers, I just need to know where to look for them. It’s not too hard to identify this type of person, because they’re the ones who are willing to answer a question asked of them with “I don’t know, but here’s how I’d go about looking for the answer.” They’re also the person who is constantly learning new things, which is another important attribute for an SRE.
We’ve talked about what SRE is, and what an SRE looks like. So how do you go about building an SRE organization that works? I was going to talk about hiring the right people here, but that’s not what comes first. Before you can hire SREs, you need to have a company that is willing to support them. All the technical expertise in the world will do you no good if you have an environment that ties their hands. So we have to start with a company that is willing to listen to the site reliability organization, and trust in their assessment and direction. Depending on how bad things are, this might include halting new features until the site can be stabilized. At LinkedIn, we call this a Code Yellow – the team that is in this state is declaring that they have to pare back everything they’re doing to stabilize their current problems. It’s not a failure – it’s a declaration that things are bad, it’s not acceptable, and they’re prioritizing fixing it. The first thing you need is data. If you don’t have it, it needs to be the first thing SRE works on.
What gets measured, gets fixed. This is a quote from David Henke, who led Engineering and Operations at LinkedIn for many years. He can be credited with the pivot in our technical organization that got us focused on fixing what was wrong with the infrastructure as we were in our hypergrowth phase. You cannot tackle the problem if you don’t know that you have a problem, or how bad it is. SRE loves data. Data cuts to the truth of the situation, without bias or equivocation. LinkedIn currently generates over 100 terabytes a day of application metrics, logs, tracking and measurement data. I know, because it’s one of several types of data that Kafka carries internally. This is what drives the decisions we make – what problem is impacting our members the most, what feature is the most worthwhile to pursue, which team has the most on-call pain? Once SRE can identify where the problems are, we can attack them. Which leads us to our first big culture point for SRE.
We are here to attack the problem, not the person. Another quote from David, which was spoken specifically about incident post-mortems, but applies overall to how we work. LinkedIn strives to maintain a blameless environment in engineering and operations. We all make mistakes – I, myself, have knocked the entirety of one of our backend datacenters with a broadcast command that wiped out all of Zookeeper. Knowing who is to blame does nothing for fixing the situation that led to a problem. It only serves to make that person feel isolated. Breaking things happens – if you’re taking intelligent risks, some of them are going to fail. These are opportunities to learn, and figure out how to do better next time.
I saw one of the best examples of this this past August. At the end of August we had our SREinCon – an internal 2-day conference for our SRE organization where we get together and share what we are doing and what we are learning. This year, one of my colleagues stood up on stage and told her story about the anatomy of a major incident that took place shortly after she started at the company, which was over a year previous to the conference. In that incident, the actions she took to try and mitigate the problem ended up taking down the entire site. She spoke of how it all happened, and what was learned both organizationally and personally. That she was willing to stand in front of her peers and discuss that openly and honestly is the finest testament to our blameless culture. The conference itself is another cultural touchstone
This is having a collaborative environment. We’ve all seen companies where internal political squabbles detract from the values and mission of the company. I hope this is not somewhere where any of you are right now. We know that this is toxic and will destroy a team, and a large part of that is because you cannot be open and honest. David also promoted the idea of the Four Agreements – these are the agreements that we need to have with our colleagues. Do your best. Be impeccable with your word. Don’t assume anything. And don’t take things personally. These four agreements are the foundation for collaboration, and a team that can trust each individual to be doing their job, and not working against the team. When I trust the rest of the team to handle their own applications well, it frees me to work on my own. This increases the efficiency of the entire company, because we are not duplicating effort.
It also means that when someone has feedback for me, I can accept it without being concerned about ulterior motives. And I want their feedback, because it’s the only way we can improve. Yes, Mr. Henke is a very smart man. The only bad feedback is no feedback, especially when it comes to tools and infrastructure. It either means that the application is not being used, or the people using it don’t care enough to improve it. If it’s that people don’t care, well, you have a culture problem that needs to be addressed. But if the problem is that the application is not being used, especially if it’s an infrastructure tool that it’s expected everyone is using, it means that there are problems that are being worked around, and for some reason there is a lack of trust that if given the feedback, it will be acted on.
Of course you’re going to have to compensate your SREs well. But we can get money anywhere. Especially if you’re located in a tech-dense area, such as the San Francisco Bay area, good engineers are in high demand. If you want to hire and keep your SREs, you need to offer more.
The key is to provide the opportunity to learn, grow, and be recognized for it. When I was hired at LinkedIn, nearly four years ago, I knew what SRE was, though it wasn’t my role at the time. What I didn’t know was the first thing about big data. I had no idea what Apache Kafka was. I was brought in for my general skills as an engineer, my ability to mesh with the team, and my ability to learn. Since that time, I have essentially reinvented my career, and that is thanks in large part to the support I have received from LinkedIn, and my management chain, to build my own brand. Sure, I’m paid well. But I stay because I’m treated well, and I have the ability to make a real impact. Not only on LinkedIn, but also for our members. That impact is a direct function of my ability to be a technical leader within SRE.
Part of being an SRE is engaging with other teams. Our infrastructure consists of hundreds of interconnected applications, so it’s rare that you’ll run into an issue that isn’t shared by multiple teams. Even if it’s a problem with your own application, it nearly always impacts someone else, either upstream or downstream. This is one of the reasons that LinkedIn’s career progression path documents specifically call out an increasing amount of interaction with teams both inside and outside of the company. As an example, some of my responsibilities that are not directly related to the application I run include leading our Site Reliability Technology Leadership Group, developing new standards for incident management across the entire company, and taking advantage of wonderful opportunities like these to talk with my peers in the industry about both Apache Kafka and site reliability engineering.
But when you make it a priority for your SREs to constantly improve themselves, and you must, you also need to accept that sometimes you need to let them go. At LinkedIn we frequently use the phrase “next play”. For any sports fans, the origin of this phrase is with college basketball, and a coach who would constantly emphasize “next play” every time his team completed a sequence. For us, it’s the same – you completed that project? Great, what’s the next play? But it also refers to your personal next play. As individuals, we must be constantly aware of what we need in our career, what the next logical step is. Sometimes this is movement within the company to a new team, and we make this very easy with clear guidelines and open doors for discussion. Sometimes it’s at another company. Either way, everyone must feel comfortable with discussing this with their management without fear, and management must support their engineers with making the changes that are right for that person first.
The most important thing to remember, though, is that culture starts at the top. I wouldn’t work here if I didn’t trust our entire team, from Satya Nadella on down. LinkedIn’s values are fully supported by Jeff, Kevin, and Mohak. Our site reliability organization has been built by David, Bruno Connelly, and the teams they have built. David Henke had something else that he said that I’d like to leave you with. You are only as good as your lieutenants. Leadership builds the team that supports their vision, and needs to trust them to execute on it. If nothing else, walk out of this room understanding that you need to be the change you wish to see in your company.
If you have more questions about SRE at LinkedIn, I encourage you to check out Ben’s post series titled Every Day is Monday in Operations. He collaborated with David Henke on this, and you’ll find it published on LinkedIn, of course. You should also check out Site Reliability Engineering, published by O’Reilly Media, which was authored by several SREs at Google. Keep in mind, of course, that neither of these will tell you exactly how SRE is supposed to work. They will only tell you how it works at LinkedIn and Google. You’ll need to take that information and figure out how it should look for you.