In this presentation I discuss how poor implementations of root cause analysis undermine an organization’s attempts to enable a learning culture. Audience members will: 1. understand the intent of root cause analysis; 2. be able to recognize its limitations; 3. fix their implementations.
2. Origins of Root Cause Analysis
- First implemented in 1958 in Toyota manufacturing plants
(5 Whys)
- Has since been adopted and tailored to many other
industries
3. Why do RCA?
To better understand the underlying causes of problems so
we can address them and prevent them happening again.
4. When to do RCA?
- When issues happen more than once
- When an outage affects many users
- When a system is not functioning as designed
5. 5 Whys
- Start with a problem statement
- Ask why
- Repeat until root cause is found
6. Issues with 5 Whys
Incorrect or leading problem statement can point to the wrong
issue.
Not very useful in complex situations, where you can’t answer
why in the moment.
7. Issues with 5 Whys
It’s not repeatable
- Different people may get different results
- Same people at a different time may get different results
8. Issues with 5 Whys
Linear thinking leads teams to drive towards one root cause.
There is usually more than one root cause.
Human error is not a valid root cause.
11. Cynefin
Conceptual framework created by Dave Snowden to organize
intellectual capital at IBM.
Uses quadrants to organize problems by complexity and
suggests a course of action.
15. Complicated
Sense Analyze Respond
Failing pipelines ran at
same time as other builds
Concurrent pipeline
collisions cause node
resource contention
Update pipelines to splay
across nodes
Increasing CI pipeline failures, happens only during the day
16. Complex
Probe Sense Respond
- Investigate logs
- Inspect exponential
backoff code
- Run load tests
- View existing alerts
- Analyze performance from
load test
Add backpressure throttling
to discovery service
Discovery service became overloaded, triggering cascading failure in
other services
17. Chaotic
Act Sense Respond
Move instances to another
AZ
Still unable to connect Cannot connect in new AZ
either
Unable to connect to instances in AWS us-east-1a. No AWS service
warnings.
18. Disorder
Reduce Analyze Iterate
What do we know for sure? What do we agree on? Move to a quadrant,
continue
Stems from a lack of agreement on the problem
19. Takeaways
- Be aware of the limitations of the RCA techniques you use
- Emergent behavior arises from complexity and increased rate of change
- Consider trying Cynefin to help you approach complex problems