2. About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
3. Gerie Owen
3
• QA Evangelist, test manager
• Subject matter expert on testing for
TechTarget’s SearchSoftwareQuality.com
• International and domestic conference
presenter
• Marathon runner & running coach
gowen@qualitestgroup.com
4. Agenda
• DevOps and disaster
• Preparing for disaster
• Principles of chaos
• Monitoring for disaster
• Getting back on your feet
• Conclusions
5. What is DevOps?
• Containerized development, rapid iteration with real-time
performance insights, intelligent feedback, diagnostic services, an
integrated DevOps pipeline, and deployment to the cloud
• Boshe moi!
• In layman’s terms:
• We automatically integrate and build every time there is a valid check-in
• We run automated tests at all stages, including production
• We send the app to production when it has been integrated and tested
• Automation makes it all work like a Swiss watch
6. What is a Disaster?
• A serious disruption, occurring over a relatively short time, loss and
impacts, which exceeds the ability of the affected community or
society to cope using its own resources.
• Disruption
• Short timeframe
• Exceeds the ability to cope
7. What is a Disaster
• Consistency becomes uncertain
• Automated workflow breaks down
• Build fails; smoke tests are blocked
• Server farm goes offline
• Application won’t start again
• Showstopper bug in production
• Anything that disrupts consistency
8. Preparing for Disaster
• We don’t react well when things go wrong
• Disbelief
• Uncertainty
• Panic
• How can we prepare for the unknown?
9. We Can Learn from Aircrews
• US Airways Flight 1549
• Sullenberger and Skiles had never met before that day
• Yet worked from established procedures
• Practiced for hundreds of hours
• Immediately turned to checklists
• 90 seconds after the bird strike, they were in the Hudson
• You have to practice this
10. We Can Learn From Aircrews
• Indecision and panic are killers
• Checklists drive decision-making by focusing on essentials
• Courses of action are defined fast
• Practice makes disasters just another day in the office
• Clear and structured communications is essential
11. We Can Also Practice Disaster
• Chaos engineering
• Failure scenarios
• Application health monitoring
12. Chaos Engineering
• Distributed systems at scale
• Experiments to uncover systemic weaknesses
• Defining normal behavior
• Set your null and alternative hypothesis
• Introduce variables that reflect real world events
• servers crash
• hard drives malfunction
• network connections lost
• Try to disprove the null hypothesis
13. Chaos Engineering
• Practice in production
• Vary real world events
• Yes, there could be customer impact
• It is incumbent upon the chaos engineer to minimize customer impact
• But that is the point of the experiment
14. Chaos Monkey
• Now called Simian Army
• Developed by Netflix
• Causes breakdowns in their production environment
• Now consists of a variety of tools
• It’s all about resiliency
• Can our application survive?
15. Practice Failure Scenarios
• Each team member contributes one or more scenarios
• The more unlikely, the better
• Write up the scenarios
• Only the team leader sees them beforehand
• They can be real failures experienced or thought exercises
16. Practice Failure Scenarios
• Describe the scenarios to the team
• “Load is remaining constant but performance is gradually
deteriorating. We’re starting to get 404 and related errors. The server
farm seems to be operating correctly; it’s an application issue. Pings
are slowing down, but not drastically.”
• How do we diagnose and address?
17. We Don’t Need Another Hero
• Heroes use superhuman efforts to fix a disaster
• In doing so, they break with team conventions
• Teams function better together
• If a team has a hero:
• the team may not try as hard in the future
• the hero is not replicable
• the hero can’t solve every problem
18. Monitor Application Health in Production
• Ping just doesn’t cut it any more
• Availability and performance data
• Synthetic testing
• Health over time
• Track trends of performance, page painting, database calls
• Whatever might give you health trends
19. Directions for Monitoring
• Watermarks for action
• E.g., 25 percent of pages take longer than 2 seconds to load
• AI for prediction
• Based on similar results in the past, the application is likely to fail in six hours
• Analytics for trends
• A combination of six measures indicates unhealthy trends
20. The Power of Checklists
• Checklists are part of our daily lives
• They
• relieve the cognitive load of remembering to do’s
• organize complicated decision-making
• reduce risk in complicated activities by ensuring that critical tasks are not
overlooked.
22. Using Checklists in DevOps
• Checklists can be used to:
• Replace Test Cases
• Supplement Test Cases
• Verify Entry and Exit Criteria
• Sanity Testing
• Ambiguity Reviews
• Dev Estimates
23. Types of Checklists
• Project Set Up
• Application Specific Regression
• Process type specific
• Website Graphics
• Browser Dependencies
• Usability checks
24. What Does Thinking Of Failure Accomplish?
• Failure doesn’t come as a surprise
• It does so all too often
• We have procedures to deal with failure
• We have practice dealing with failure
• Failure is just another day at the office
26. Conclusions
• Things will go wrong
• Don’t yell or panic
• Practice non-conforming situations regularly
• Make up unlikely scenarios; chances are they will happen
• Structured practices and communications may make work boring, but
they help when things start going wrong
• Ease into chaos engineering for resiliency
• Use your experiences to create checklists