This is a session given by Emily Dowdle at Nordic APIs 2016 Platform Summit on October 26th, in Stockholm Sweden.
Description:
I’m convinced Humpty Dumpty is a story of DevOps gone wrong.
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king’s horses and all the king’s men
Couldn’t put Humpty together again.
First, who asks a horse to do surgery? Hoofs can’t hold scalpels. Second, either the king’s men are inept or they’re not communicating. Two kindergarteners with some Elmer’s could have done the job.
You see, Humpty is a deploy. He was fine in staging but shit the bed in production. Now the site’s down and your boss is threatening everyone’s jobs. IT is saying the code is broken. The developers are saying it’s a server issue.
Meanwhile, Humpty is bleeding out. And your customers are complaining on Twitter. Which means a customer service rep has entered the #incident channel to tell you the site’s down. Yea, no shit, Tom.
Sound familiar?
DevOps is the new Agile. Everyone “does it” but few fully embrace it.
This talk will focus on common pitfalls and how to ensure your entire API team — ops, IT, sysadmins, SREs and developers — stop blaming each other and work together.
I’ll cover accelerating API development by empowering your engineers, reducing incidents by simplifying deploys and moving toward continuous deployment by utilizing agile API development.
Humpty Dumpty A story of API DevOps Gone Wrong (Emily Dowdle)
1. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
Humpty Dumpty
A Story of DevOps
Gone Wrong
2. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle
emilydowdle.com
@editingemily
3. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king’s horses and all the king’s men
Couldn’t put Humpty together again.
Humpty Dumpty
4. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
You see, Humpty
is a deploy.
5. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king’s horses and all the king’s men
Couldn’t put Humpty together again.
The Important Bit
(In case you missed it.)
6. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
(It’s ops.)
7. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
But it doesn’t have to
be that way.
8. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
Change is hard.
And a little scary.
9. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
A long history
of conflict.
10. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
Conflict
We have different priorities.
11. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
12. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Fundamentally unfair.
13. HUMPTY DUMPTY DEVOPS
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle // emilydowdle.com
And they deserve
to feel like this.
29. HUMPTY DUMPTY DEVOPS
FAILURE | Practice Failing Together
Emily Dowdle // emilydowdle.com
Embrace Failure
It happens. Seriously.
30. HUMPTY DUMPTY DEVOPS
FAILURE | Practice Failing Together
Emily Dowdle // emilydowdle.com
Leave Your Ego
At The Door
31. HUMPTY DUMPTY DEVOPS
FAILURE | Practice Failing Together
Emily Dowdle // emilydowdle.com
Don’t Point
Fingers
Make failure fabulous.
32. HUMPTY DUMPTY DEVOPS
FAILURE | Practice Failing Together
Emily Dowdle // emilydowdle.com
Postmortems
They’re not optional.
33. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
1. What happened?
2. What was impacted?
3. When did it happen?
4. Who was involved?
5. How was it discovered?
6. Why did it happen?
7. What’s the solution?
8. When will it be fixed?
Postmortem
Questions
FAILURE | Practice Failing Together
34. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Making your job
awesome is your job.
35. HUMPTY DUMPTY DEVOPS
Here’s som example text
Emily Dowdle // emilydowdle.com
HUMPTY DUMPTY | A Story Of DevOps Gone Wrong
Emily Dowdle
emilydowdle.com
@editingemily
Editor's Notes
Hello, I’m Emily Dowdle. I’m a software engineer at Wazee Digital in sunny Denver, Colorado. I’m excited to be here with you all in [ ____________ ].
I want to talk to you about DevOps.
You see, I’m convinced Humpty Dumpty is a story of DevOps gone wrong.
Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.
First, who asks a horse to do surgery? Hoofs can’t hold scalpels.
Second, either the king’s men are inept or they’re not communicating. Two kindergarteners with some Elmer’s could have done the job.
You see, Humpty is a deploy. He was fine in staging but shit the bed in production. Now the site’s down and your boss is threatening everyone’s jobs.
IT is saying the code is broken. The developers are saying it’s a server issue.
Meanwhile, Humpty is bleeding out. And your customers are complaining on Twitter. Which means a customer service rep has entered the #incident channel to tell you the site’s down. Yea, no shit, Tom.
Sound familiar?
We’ve all been there. A deploy goes awry and the entire department is up in arms, defending themselves and blaming each other.
In other words, All the king’s horses and all the king’s men aren’t communicating.
Now I’m on the Dev side of DevOps, so I won’t say who are the horses in this situation.
On most teams, there is a tension that could be described as friction, attitude, or a general inability to tolerate each other without eye rolls and audible sighs.
What I like to call good ’ol Southern-style passive aggressiveness.
But it doesn’t have to be that way.
If we’re going to solve problems, real-world, critical problems — the kind that get our hearts pumping and make us want to leap out of bed in the morning — we’re going to have to change our attitudes.
But change is hard.
And it’s a lot of responsibility.
Sometimes I’d like my boss to walk in and tell me exactly what to do. To take all the responsibility and make the final call.
That is the easier path. But it’s not the right one.
I believe making our jobs awesome is our job.
So what’s next? Where do we start?
We have to start by taking a look back. And acknowledging the decisions and processes that got us in this mess to begin with.
Like it not, developers are measured by the number of features they release. No CEO has ever cracked open code to review your thorough test suite or pondered at the glorious variable name you picked out. (I appreciate it, though. So you have that going for you.)
If all of us decided to tackle our growing mound of tech debt this month instead of working on the latest and greatest idea your sales team came up with, you better believe we’d be hauled into someone’s office and chided.
But operations people are measured on an entirely different aspect of the business: site reliability and uptime. And you better believe keeping a site up 99.999% of the time is no easy feat.
I’ll spare you the math. That’s a little over 5 minutes downtime per year. FIVE. MINUTES. PER. YEAR.
So, to break this down, developers must deploy new code to release new features. But deploys are the most frequent cause of downtime.
No wonder we’re natural enemies.
Our different priorities lead us to butt heads. And we quickly devolve to working around each other while mumbling about the other team’s ineptitude.
Which puts us in a vicious cycle.
Developers get excited about a feature and throw it over to ops to deploy and support.
And all the responsibility to keep the site up and customers happy is dumped on operations.
Which leaves ops people feeling a bit like grumpy cat.
Which is fundamentally unfair.
They deserve to feel like this absurdly happy dog.
Look at him. If I felt like this dog when I got into work every morning, I would NEVER LEAVE MY JOB.
I think we all deserve to feel that level of passion about our work.
And I think I know how we can get there.
If you’re in operations, you need to empower your developers.
And that starts with…
Trusting your team.
You’re on the same side. If a developer says the code works, trust them. They’re not lying to you. And they don’t want to make your life a living hell. They honestly believe the code works.
BRIDGE THE SKILLS GAP
Ben Treynor of Google describes site reliability engineering as what happens when you ask a software engineer to design an operations function.
I would bet 90% of our miscommunications stem from a lack of knowledge in a particular area.
Now this doesn’t mean the person you’re talking to is stupid. None of us are.
It means they haven’t had a chance to work in the tech stack or tooling you use every day.
So teach them. Take two hours a week and pair. Document your configurations. Make it easy for developers to find answers independently.
It’s worth the investment.
GIVE READ-ONLY ACCESS TO ALL DEVELOPERS
To what? To EVERYTHING. I’m not saying to hand out root access like candy.
But you are not the gatekeeper of information. Do you like being interrupted every 5 minutes so you can copy and paste an error message? I didn’t think so.
Developers are writing the code that runs on your systems. It’s not a reach to think they should be able to get some feedback about whether it works. After all, don’t expect developers to jump in and help when they don’t have access to your machines.
CREATE CONSISTENT PLATFORMS
Pay attention to the parity between environments. Staging and production should be identical. That means the same allocated resources and the same data. Otherwise deploying will always be a roll of the dice.
SHARE SOURCE CONTROL
Keep your configuration tools on GitHub with the rest of your company’s code. Code is code. It’ll be much easier for operations and developers to solve problems together if everyone knows how to locate the affected code.
ADD YOUR DEVS TO THE ON-CALL ROTATION
My friend likes to say, “You build it, you support it.”
No one likes to be woken up at 2:00 a.m. And if you’re tired of stumbling through the dark toward your computer in the middle of the night, share the pain.
There’s no reason developers shouldn’t be on rotation. Remember, they can access logs and view your configuration tools now. Awesome!
SIMPLIFY DEPLOYS
Pushing code to production should not be a production.
Unnecessary steps increase the opportunity for error and decrease the number of people who can deploy.
And one more thing. Stop preventing developers from deploying their code to the QA and staging environments. Seriously.
If I have to ask permission to test my shit anywhere other than dev, you deserve to put out the fire.
Developers: don’t be assholes. Cut the attitude. A little empathy goes a long way.
You can’t avoid talking to your operations team at all costs and then expect them to blindly support your deploys.
Which brings me to…
MAKE OPERATIONS PART OF THE PLANNING PROCESS
Thinking about a feature? Include operations.
Talk about what will change before you write a single line of code. Discuss why this feature is important, who will need to be involved and what the risks are.
You can’t deploy mystery code and then get irritated with your operations team when they start asking 100 questions.
MAKE SMALL CHANGES. DEPLOY. REPEAT.
If your feature requires you to change 30% of your app’s spaghetti code, break the feature into smaller pieces.
Not sure if your feature is too big? Apply the same rule you use for method naming. If the method needs an “and” it’s doing too much. Small deploys make it MUCH easier to determine what went wrong in case of failure.
COMMUNICATE
You know how you already included operations in your feature planning?
Notify the operations team when you deploy too. Whether you use Slack or HipChat, make sure all developers and operations people have a single place to communicate.
Many companies use an #incident channel. Find what works for you and then use it.
YES, AND…
There’s a rule in improv that forces participants to say “yes, and…” rather than “yes, but…”
Try this next time you’re in a meeting and the results will likely surprise you.
The simple language change will make everyone feel heard, validated and a part of the team.
BE OPEN TO OTHER OPTIONS
If someone on operations says there’s going to be a problem, assume there is going to be a problem.
That means shutting your mouth and really hearing what they have to say. You’re an engineer, not God.
The core competency of operations is site reliability.
Let them help you. The solution you come to together will be much better than the one you thought of on your own.
HAVE SOME HUMILITY
If someone was woken up in the middle of the night because of something you released, say sorry. Buy some coffee. Help ’em out. Own your shit.
When you take responsibility for a mistake, your colleagues are much less likely to make a voodoo doll of you and keep it by their bed.
Failure is never a question of if, but when.
You will fail. A deploy will bring down the site. A typo in your configuration will bring users to Twitter fisticuffs.
It happens. We’re human. And until Skynet, we’re all stuck dealing with our occasional mistakes.
HAVE A HEALTHY ATTITUDE AROUND FAILURE
You need to 80/20 your failure preparedness procedures. It’s okay to spend 80% of your time trying to prevent failure, but devote at least 20% to practicing how you will handle failure when it happens.
We all half-ignore the safety talk given at the start of every flight, but I appreciate that oxygen falls from the ceiling in the event Bane decides to crash my plane.
LEAVE YOUR EGOS AT THE DOOR
When I started powerlifting seriously, I joined a small team of intimidating lifters.
The head of the group — a 60-year-old Juggernaut-like man whose traps rose to just under his ears — had one rule: leave your ego at the door.
It didn’t matter that we had to strip off 400 pounds every time it was my turn to squat. All that mattered was that I listened, learned and respected the team. We could all learn a lot from that.
STOP POINTING FINGERS
It never feels good to make a mistake. And when 20 people are required to rectify it, it feels even worse.
When I screw up, I’m embarrassed.
And if I feel attacked, I become defensive. I think most of you would probably say the same.
Let’s give everyone a little slack. It could have been your typo.
Postmortems aren’t optional. Don’t leave it up to the one enthusiastic junior on your team to throw a flag on the play every time something goes wrong.
Make it part of your everyday workflow. Something goes wrong? Site goes down? Hold a postmortem within 48 hours. It’s just that simple.
There are 8 questions to focus on during a postmortem.
They focus on what happened, who was involved and how the team can rectify the root cause of the incident.
If someone in your organization is constantly going on about who needs to be fired, the answer is always your boss. And if your boss isn’t cool about that, it’s time to find a new boss.
Responsibility is ALWAYS distributed amongst the entire team. And blame should be shared.
Failure should be celebrated because it is how we learn.
Remember…
Making your job awesome IS YOUR JOB.
And I hope I’ve given you a few ways you and your team can do just that.
Again, I’m Emily Dowdle. You can find me at emilydowdle.com or on Twitter @editingemily.
THANK YOU!!!