One is the loneliest number
Much, much worse than two
Many of PagerDuty’s mission-critical services are based on Cassandra, and as a result we have built up a lot of operational experience over the past few years. Unfortunately, some of our best learnings have come from sizeable failures in production. One of those failures stemmed from having multiple services share the same Cassandra cluster, which was a major factor in PagerDuty’s largest outage of 2014. This talk will relive that outage, sort through the wreckage, and explain why isolating your Cassandra clusters is a best practice you should adopt
7. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Casssandra Replication - Failure
Client
R1
R2
R3
X
8. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Foreshadowing
• Series of small outages / degradations
• Repair process started
• High load, high latency
• Response: disable thrift, turn off nodes
9. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
6 seconds
~25 ms
15. 2015-12-08ONE (IS THE LONELIEST NUMBER)
The Plan
• Trigger repair…
… with lots of people watching
• Use our load shedding strategies for any problems:
• Proactively disable non-critical services
• Disable thrift
16. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Surprise!
• Cron triggers a repair of a different keyspace
• Plus a compaction for a large CF
17. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Outgoing Notification Backlog Size
Normal
Bad
Horrible
18. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Outgoing Notification Backlog Size
Normal
Bad
Horrible
:(
19. 2015-12-08ONE (IS THE LONELIEST NUMBER)
Cassandra Pending Tasks: ReadStage (by host)
Over 9000
23. 2015-12-08ONE (IS THE LONELIEST NUMBER)
or: What can we learn from Aimee Mann?
One is the loneliest number that you'll ever do
Two can be as bad as one
It's the loneliest number since the number one
No, is the saddest experience you'll ever know
Yes, it's the saddest experience you'll ever know
24. 2015-12-09
No, is the saddest experience you’ll ever know
•Cassandra sheds load when overloaded
•Shedding drops “stale” requests
•Clients see timeouts and have trouble making progess
ONE (IS THE LONELIEST NUMBER)
•Sheds load if clients abandon the failed requests
•But if clients retry those requests…
25. 2015-12-09
Event Processing
Event Processing
So I heard you like retries…
ONE (IS THE LONELIEST NUMBER)
Notification
Management
App HostApp HostApp Host
Cassandra
Cluster
Cassandra
Cluster
Cassandra
Cluster
Cass Client retries (S)
Service client retries (T)
Load balancer retries (H)
Retries are multiplicative
Total # of retries: O(S*H*T)
Interactive Request (from user)
Load Balancer
26. 2015-12-09
Yes, it’s the saddest experience you’ll ever know
•Dropped requests were retried
•…causing load amplification
•…causing more dropped requests
•…causing even more retries
•…causing misery.
•i.e. too much load leads to much too much load
ONE (IS THE LONELIEST NUMBER)
27. 2015-12-09
How does overload get started?
•Unpredictable workloads
•Could be from request volume
•In our case, from batch-style processes
•Repairs, compaction, application-level tasks (e.g. archiving)
ONE (IS THE LONELIEST NUMBER)
28. 2015-12-09
PagerDuty system architecture
Cassandra
Cluster
ONE (IS THE LONELIEST NUMBER)
Inbound Event
Buffer
Data Access
Notification
Management
Message
Delivery
Monitoring Events SMS, Phone Calls
App Host
Interactive Requests (from users)
Load Balancer
29. 2015-12-09
+
=Workload A + B
Workload A Workload B
…and more bursts are more worst
ONE (IS THE LONELIEST NUMBER)
30. 2015-12-09
One (cluster) is the loneliest number that you’ll ever do
•How many ops are A vs. B?
•Must reverse engineer the contributions
•Build (constantly evolving) models
•Hard to reason about system behaviour
•…and gets substantially harder when your entire production stack is
overloaded
ONE (IS THE LONELIEST NUMBER)
32. 2015-12-09
Stop poking the bear
•Only retry when necessary - is failure an option?
•Less risky to retry user-initiated requests
•Don’t retry retries (much)
•Specifically:
•Only try a single fallback C* host at the driver level, not N-1
•Only try a single fallback service host, not M-1
ONE (IS THE LONELIEST NUMBER)
33. 2015-12-09
Prepare for the worst case
•To avoid overload, must provision for the worst case
•So either scale for the (bursty) stars aligning…
•…or prevent stars from aligning in the first place
ONE (IS THE LONELIEST NUMBER)
34. 2015-12-09
Preventing star-bursts, part 1: coordinate
•Explicit scheduling to interleave bursts
•Repairs, compactions, batch jobs - Cassandra & services
•Automation can help…
•…but still error prone
ONE (IS THE LONELIEST NUMBER)
35. 2015-12-09
Preventing star-bursts, part 2: smooth, not chunky
•Jobs can be done more frequently
•But with smaller batch size
•In the limit, aims for continuous & constant intensity workload
•Some Cassandra options too:
•Compaction, transfer, and other throttle limits
•Levelled compaction vs. size-tiered compaction
ONE (IS THE LONELIEST NUMBER)
36. 2015-12-09
Preventing star-bursts, part 3: isolation
•Air gap between each workload
•Distinct Cassandra cluster for each service/workload
•Cons:
•More infrastructure
•More configuration management
•Pros:
•Easy to monitor, reason about, diagnose, and scale
•Reduces the blast radius when failures happen (and they will)
ONE (IS THE LONELIEST NUMBER)
37. 2015-12-09
PagerDuty system architecture: today
ONE (IS THE LONELIEST NUMBER)
Inbound Event
Buffer
Notification
Management
Message
Delivery
Cassandra
Cluster
Cassandra
Cluster
Cassandra
Cluster
39. 2015-12-09
What have we learned?
• Retries: the devil’s in the details
• Variable workloads: bad, especially if unpredictable
• Workload peaks: additive, and bad in multiples
• Isolation: the gift that keeps on giving
ONE (IS THE LONELIEST NUMBER)
40. 2015-12-09
One is the loneliest number
that you'll ever do
Two can be as bad as one
It's the loneliest number since the number one
No, is the saddest experience you'll ever know
Yes, it's the saddest experience you'll ever know
ONE (IS THE LONELIEST NUMBER)