This document outlines Chef's Operations Maturity Model, which provides a roadmap for organizations to improve their operational maturity and move from being like horses to being like unicorns (highly automated and resilient). It discusses key areas of operational maturity like hardware management, operating system management, incident management, and postmortem analysis. Each area is broken down into levels of maturity from ad-hoc processes to fully automated continuous delivery. The goal is to help organizations understand where they are at and identify steps to improve areas like mean time to recovery and time to production. Ultimately, adopting the right culture is emphasized as more important than any specific technologies.
9. The Map is not the Territory
• Comparative study of Operational
Maturity Models
• On one end: ad-hoc, slow to
respond, “traditional” approach
• At the other: very fast, fully
automated, and disaster
indifferent
• Figure out what is most important
to your Organization
https://www.chimacumtack.com/images/measurehorse.jpg
10. Fitting the Model
• Varying degrees of adoption
• Operational trends often
correlated and relational, but not
definitive
• Roadmap for improving time to
deployment and lower time to
recovery
• Understand the challenges, set
real expectations for progress
http://www.web3dservice.com/3d_models/images/unicorn_3d_model_03.jpg
13. Every Server is Sacred!
• HA Support expected across the entire stack
• Dependence on vendor/on-site SE for replacement/maintenance
• “This is the best hardware money can buy!”
• Architecture Review and Request Forms for all changes
• “Tier 1” data centers
• Every project special snowflake
15. Maybe not ALL servers are sacred…
• Start using some farms of standardized machines
• Fewer support contracts, less dependence on vendor/on-site
support
• Architecture Reviews for new services with some implementation
standardization
• HA support across most of the stack
• Probably still using “Tier 1” data centers with excess redundancy
17. Most of these servers aren’t sacred?
• Limited support on ALL systems
• On-site support used sparingly, lower-skill onsite staff for “normal” failures
• Architecture Reviews only manage exceptions. Automated requests may be
exposed via emerging APIs
• Wide adoption of virtualization: server instances are commoditized
• Hardware becoming standardized and easy to replace
• Smaller, more efficient data centers.
• Limited redundancy with hot/hot/hot N+1/N HA strategies
19. None of the servers are sacred
• Infrastructure as a Service
• Hardware (if any) is fully commoditized
• Hardware is completely standardized, special cases
are regarded as a risk to business
• Redundant Array of Inexpensive Data centers
23. Operating Systems Management
• Many OS flavors and versions. Manual, irregular patching
• Limited flavors and versions, planned upgrades. “Patch
Tuesday!”
• Standard versions using JEOS with regular upgrades.
Automated patching.
• Internally maintained versions, constant upgrades
25. Incident Threshold: Recovery Time
• Which teams have regular on call responsibilities?
• What is expected of someone on call?
• How are people notified & engaged on an incident?
26. Incident Threshold: Recovery Time
• "Something is wrong!" 12+ hours
• "Something is wrong with the…!" 1-12 hours
• "Something went wrong with your deployment!”
<60 minutes
• "The core infrastructure fabric is down!”
seconds - 10 minutes
28. Postmortems
• Postmortem Focus
• Root Cause Orientation
• Root Cause Mitigation/
Resolution
• Root Cause Elimination
Rate
http://img3.wikia.nocookie.net/__cb20111008164412/mlpfanart/images/thumb/b/b2/Twilight_Sparkle_Angry_by_Ivan-Chan.png/597px-Twilight_Sparkle_Angry_by_Ivan-Chan.png
29. Postmortems: Ad Hoc
• "Human Error”: blame finding & punishment
• "Triggering Event”: blaming specific operator error or
specific hardware failures
• Cycle between protecting heroes and then firing
them
• <10% - Mostly break fix detection
30. Postmortems: Formal
• Focus on "Triggering Event" or "Human Error", but
blaming process and/or infrastructure
• "Let's implement more process and overhead”
• 10% within 3 months - mostly simple fixes
• Tracking but little progress against goals vs. other
priorities, frequent recurrence
31. Postmortems: Officially "Blame Free"
• Primary focus on on underlying technical root causes,
systemic fixes
• Improved tooling, programatic checks, operator tools
for special cases. Some focus on building resiliency
• 20% - Easily fixable issues eliminated within 3
months, programs to eliminate larger issues over time
32. Postmortems: “5 Whys”
• Including business and cultural issues
• Primary focus on insights and opportunities from
lessons learned
• Increased resiliency and appropriate operator tools,
focus on self-healing fixes
• Recurrence becomes infrequent and is a big deal
33. Navigating the Change
• Many more mile markers
• Roadmap to improve your
• Mean Time To Production
• Mean Time to Recovery
34. Becoming a Unicorn is Possible
• Approach the challenges
with realistic expectations
for your organization
• Always room for
improvement
• Culture trumps everything
http://webecoist.momtastic.com/wp-content/uploads/2010/09/unicorns_3x.jpg