Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. Simultaneously improving service quality and enabling rapid, continuous change seems impossible on the surface.
At Netflix, Operations Engineering is a centralized organization whose charter is to accomplish just that by applying high-leverage software engineering practices like continuous delivery. real-time analytics, and automation to solve operational problems. It's well established that many traditional IT Operations teams struggle to bridge the gap with software engineering. Operations Engineering is no exception. And while DevOps as a construct seeks to address this gap, it doesn't go far enough. It does not explain how to bridge the gap or even why it's important to do so.
In this talk we’ll use Netflix Operations Engineering as a case study to address these questions. We'll explore common challenges faced by operational teams and strategies to overcome them.
5. Some said
• You’re overloading us
• Too many projects
• Poor targeting
Others said
• What took you so long?
• We’ve moved on
• Now we need to migrate
That’s great but…
We’re paying a high tax
6. • Expectations gap
– Division of labor
– Timing of solutions
– Leadership
• Affects
– Reputation
– Relationships
– Lost opportunities
Organizational Debt
21. DevOps is a software development method that
emphasizes the roles of both software developers and
other information-technology (IT) professionals with an
emphasis on IT Operations.
- Wikipedia
The Gap
24. Operational Excellence is the continuous improvement of
the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
25. • Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
Operations Engineering is the application of software
engineering practices to achieve and sustain operational
excellence.
26. Operations Engineering
• Service provider
• Operational excellence driver
• Cross-cutting solutions
• Undifferentiated heavy lifting
28. • You’re overloading us
• What took you so long?
Remember that feedback?
• We made assumptions
– Requirements – what & when
– Time for non-product work
29. • Move from assumptions to knowledge
• Affect change without imposing a tax?
• Achieve and sustain operational excellence?
How do we…
33. • What are your biggest operational pain points?
• How can we help?
• How well are we meeting your needs today?
• What would you like to see from us in the future?
Listen
Shower, rinse, repeat
Talk to your engineering customers
34. Grease the Squeaky Wheels
• low tolerance for tax
• more vocal than most
35. • High impact solutions
• Clarity on deliverables
• Lower operational tax
• Leadership, innovation, and partnership
What they wanted
36. • Deliver on solutions
• Better road map definition & communication
• A more aggressive stance on automation
• Deeper investment into leadership, innovation, planning
Our commitments
37. 2. Make an impact
• Apply what you’ve learned
• Deliver what matters
38. • global cloud console
• end to end delivery
• automation platform
• velocity with confidence
62. • Nearing completion
• Aggressive schedule
• Unexpected delays
• Commitment to June delivery
Spinnaker 1.0 – 1H 2015
63. • Built their own continuous delivery solution
• Not positioned for engineering-wide support
• Believes common solutions
Edge Engineering
64. Partnership in Action
• Strong relationship
• Open discussions about concerns
• Decision - leaned forward
• +2 engineers on Spinnaker
• Successful 1.0 launch
65. Moving Forward Together
• Containers?
• Achieving alignment
• Collaborative exploration
– Edge, Platform, Operations
– A new paved road?
66. • Paved Road adopted
– Adding new ones
• Production Ready ongoing
• Migrations easier
• Reputation improving
• Improved
– Service uptime
– Rate of change
Payoffs
67. Putting it to the test in 2016
• Streaming production & test - EC2 Classic to VPC
• Highly cross-functional
• Complex dependencies
• Zero downtime
Stay tuned…
68. Five Strategies
1. Reach out
2. Make an impact
3. Make it easy to do the right thing
4. Reduce the cost of change
5. Develop partnerships
Java 6 – needed to move forward on Java but struggled to drive adoption
Perforce – many teams moving to Git – no story for supporting perforce in the cloud
Jenkins – long queues & build times
Ant – long build times, inefficient dependency management
CentOS – slow delivery of new kernel and userland binaries
Asgard served us well as a deployment & cloud management
Mimir gave a great prototype and we learned a lot
Tech debt kept us from doing our jobs well
Does this sound familiar? Have any of you been on one side or the other of this situation?
To move forward we defined the concept of the paved road
The paved road promises a well supported integrated developer experience.
Java 7 – just to move forward – Java 8 already on the horizon
Git – organically adopted by many teams
Gradle – built time reduced due to efficient dependency management
Ubuntu – more frequent, well vetted userland binarie & kernels
Jenkins shards to fix long build times
Started building our next generation cloud console & continuous delivery platform Spinnaker
We staffed up and went for it – big bang
Read to the audience:
He that can earn ten shillings a day by his labour, and goes abroad, or sits idle one half of that day, tho' he spends but sixpence during his diversion or idleness, ought not to reckon that the only expense; he has really spent or rather thrown away five shillings besides.
- Advice to a Young Tradesman
Please raise you hand if you know which puritanical workaholic wrote this?
In addition to the obvious intent behind this there is a more profound message.
Time spent working is related to the money you make but time is also in and of itself a form of currency.
It’s the exchange or giving of time that drives the economics of an engineering organization
Netflix has a freedom & responsibility culture. You build it you run it perfectly aligns with our values around autonomy & ownership
This leads a high pressure situation created a shortage of time.
Read definition out loud
Out of curiosity – who agrees with this definition? Who disagrees?
Not only is there disagreement but the general construct isn’t really that helpful
It doesn’t address how to bridge the gap or why it matters to do so?
What’s are the strategies for success?
It’s the practices, tools, culture
Motivations the reason for doing DevOps is to achieve operational excellence
We do the undifferentiated heavy lifting for out customers.
This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter.
We do the undifferentiated heavy lifting for out customers.
This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter.
Going back to our Ben Franklin quote – time is a form of currency.
In our engineering world time really is currency. We don’t pay each other to do work.
We commit time to projects. In other words we have a time-based economy.
Audience – can anyone name one of the strategies?
Stop spamming us!
Audience – can anyone name one of the strategies?
A free chaos monkey for good ones
\
There are several approaches that you might take to solve for this problem. I’ll explore each one.
And once you’ve proven that you can deliver you have some money in the bank. You have earned a seat at the table.
Now you’re ready to build strong partnerships.