SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
2015-12-09
One (is the loneliest number)
donny@pagerduty.com & paul@pagerduty.com
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
2015-12-09ONE (IS THE LONELIEST NUMBER)
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
Failure
2015-12-08
Background
ONE (IS THE LONELIEST NUMBER)
• Shared cluster, 5 machines (with replication factor = 5)
• 10s of GBs of data
• In-flight data: 10s of MBs, maybe 100s
2015-12-08ONE (IS THE LONELIEST NUMBER)
Casssandra Replication
Client
R1
R2
R3
2015-12-08ONE (IS THE LONELIEST NUMBER)
Casssandra Replication - Failure
Client
R1
R2
R3
X
2015-12-08ONE (IS THE LONELIEST NUMBER)
Foreshadowing
• Series of small outages / degradations
• Repair process started
• High load, high latency
• Response: disable thrift, turn off nodes
2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
6 seconds
~25 ms
2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
2015-12-08ONE (IS THE LONELIEST NUMBER)
Coordinator Read Latency (in ms, by host)
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
The next day…
2015-12-08ONE (IS THE LONELIEST NUMBER)
The Plan
• Trigger repair…
… with lots of people watching
• Use our load shedding strategies for any problems:
• Proactively disable non-critical services
• Disable thrift
2015-12-08ONE (IS THE LONELIEST NUMBER)
Surprise!
• Cron triggers a repair of a different keyspace
• Plus a compaction for a large CF
2015-12-08ONE (IS THE LONELIEST NUMBER)
Outgoing Notification Backlog Size
Normal
Bad
Horrible
2015-12-08ONE (IS THE LONELIEST NUMBER)
Outgoing Notification Backlog Size
Normal
Bad
Horrible
:(
2015-12-08ONE (IS THE LONELIEST NUMBER)
Cassandra Pending Tasks: ReadStage (by host)
Over 9000
2015-12-08ONE (IS THE LONELIEST NUMBER)
Cassandra CPU (by host)
100%
2015-12-08ONE (IS THE LONELIEST NUMBER)
Factory Reset
Success… kind of
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
What went wrong?
2015-12-08ONE (IS THE LONELIEST NUMBER)
or: What can we learn from Aimee Mann?
One is the loneliest number that you'll ever do
Two can be as bad as one
It's the loneliest number since the number one
No, is the saddest experience you'll ever know
Yes, it's the saddest experience you'll ever know
2015-12-09
No, is the saddest experience you’ll ever know
•Cassandra sheds load when overloaded
•Shedding drops “stale” requests
•Clients see timeouts and have trouble making progess
ONE (IS THE LONELIEST NUMBER)
•Sheds load if clients abandon the failed requests
•But if clients retry those requests…
2015-12-09
Event Processing
Event Processing
So I heard you like retries…
ONE (IS THE LONELIEST NUMBER)
Notification
Management
App HostApp HostApp Host
Cassandra
Cluster
Cassandra
Cluster
Cassandra
Cluster
Cass Client retries (S)
Service client retries (T)
Load balancer retries (H)
Retries are multiplicative
Total # of retries: O(S*H*T)
Interactive Request (from user)
Load Balancer
2015-12-09
Yes, it’s the saddest experience you’ll ever know
•Dropped requests were retried
•…causing load amplification
•…causing more dropped requests
•…causing even more retries
•…causing misery.
•i.e. too much load leads to much too much load
ONE (IS THE LONELIEST NUMBER)
2015-12-09
How does overload get started?
•Unpredictable workloads
•Could be from request volume
•In our case, from batch-style processes
•Repairs, compaction, application-level tasks (e.g. archiving)
ONE (IS THE LONELIEST NUMBER)
2015-12-09
PagerDuty system architecture
Cassandra
Cluster
ONE (IS THE LONELIEST NUMBER)
Inbound Event
Buffer
Data Access
Notification
Management
Message
Delivery
Monitoring Events SMS, Phone Calls
App Host
Interactive Requests (from users)
Load Balancer
2015-12-09
+
=Workload A + B
Workload A Workload B
…and more bursts are more worst
ONE (IS THE LONELIEST NUMBER)
2015-12-09
One (cluster) is the loneliest number that you’ll ever do
•How many ops are A vs. B?
•Must reverse engineer the contributions
•Build (constantly evolving) models
•Hard to reason about system behaviour
•…and gets substantially harder when your entire production stack is
overloaded
ONE (IS THE LONELIEST NUMBER)
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
How we fixed it
2015-12-09
Stop poking the bear
•Only retry when necessary - is failure an option?
•Less risky to retry user-initiated requests
•Don’t retry retries (much)
•Specifically:
•Only try a single fallback C* host at the driver level, not N-1
•Only try a single fallback service host, not M-1
ONE (IS THE LONELIEST NUMBER)
2015-12-09
Prepare for the worst case
•To avoid overload, must provision for the worst case
•So either scale for the (bursty) stars aligning…
•…or prevent stars from aligning in the first place
ONE (IS THE LONELIEST NUMBER)
2015-12-09
Preventing star-bursts, part 1: coordinate
•Explicit scheduling to interleave bursts
•Repairs, compactions, batch jobs - Cassandra & services
•Automation can help…
•…but still error prone
ONE (IS THE LONELIEST NUMBER)
2015-12-09
Preventing star-bursts, part 2: smooth, not chunky
•Jobs can be done more frequently
•But with smaller batch size
•In the limit, aims for continuous & constant intensity workload
•Some Cassandra options too:
•Compaction, transfer, and other throttle limits
•Levelled compaction vs. size-tiered compaction
ONE (IS THE LONELIEST NUMBER)
2015-12-09
Preventing star-bursts, part 3: isolation
•Air gap between each workload
•Distinct Cassandra cluster for each service/workload
•Cons:
•More infrastructure
•More configuration management
•Pros:
•Easy to monitor, reason about, diagnose, and scale
•Reduces the blast radius when failures happen (and they will)
ONE (IS THE LONELIEST NUMBER)
2015-12-09
PagerDuty system architecture: today
ONE (IS THE LONELIEST NUMBER)
Inbound Event
Buffer
Notification
Management
Message
Delivery
Cassandra
Cluster
Cassandra
Cluster
Cassandra
Cluster
2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
Lessons learned
2015-12-09
What have we learned?
• Retries: the devil’s in the details
• Variable workloads: bad, especially if unpredictable
• Workload peaks: additive, and bad in multiples
• Isolation: the gift that keeps on giving
ONE (IS THE LONELIEST NUMBER)
2015-12-09
One is the loneliest number
that you'll ever do
Two can be as bad as one
It's the loneliest number since the number one
No, is the saddest experience you'll ever know
Yes, it's the saddest experience you'll ever know
ONE (IS THE LONELIEST NUMBER)
2015-12-09
donny@pagerduty.com & paul@pagerduty.com
PAGERDUTY.COM/JOBS
ONE (IS THE LONELIEST NUMBER)
2015-12-09
Questions?
donny@pagerduty.com & paul@pagerduty.com

Weitere ähnliche Inhalte

Andere mochten auch

Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of CassandraDataStax Academy
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net DriverDataStax Academy
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureDataStax Academy
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
 
Using Event-Driven Architectures with Cassandra
Using Event-Driven Architectures with CassandraUsing Event-Driven Architectures with Cassandra
Using Event-Driven Architectures with CassandraDataStax Academy
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to CassandraDataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 

Andere mochten auch (20)

Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of Cassandra
 
New features in 3.0
New features in 3.0New features in 3.0
New features in 3.0
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net Driver
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Playlists at Spotify
Playlists at SpotifyPlaylists at Spotify
Playlists at Spotify
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Using Event-Driven Architectures with Cassandra
Using Event-Driven Architectures with CassandraUsing Event-Driven Architectures with Cassandra
Using Event-Driven Architectures with Cassandra
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 

Ähnlich wie Cassandra: One (is the loneliest number)

PagerDuty: Span the WAN? Yes you can!
PagerDuty: Span the WAN? Yes you can!PagerDuty: Span the WAN? Yes you can!
PagerDuty: Span the WAN? Yes you can!DataStax Academy
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
Cassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache CassandraCassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache CassandraDataStax Academy
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeMaurice Naftalin
 
Mark Callaghan, Facebook
Mark Callaghan, FacebookMark Callaghan, Facebook
Mark Callaghan, FacebookOntico
 
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2amNothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2amDaniel Korn
 

Ähnlich wie Cassandra: One (is the loneliest number) (6)

PagerDuty: Span the WAN? Yes you can!
PagerDuty: Span the WAN? Yes you can!PagerDuty: Span the WAN? Yes you can!
PagerDuty: Span the WAN? Yes you can!
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Cassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache CassandraCassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache Cassandra
 
Complicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine AgeComplicating Complexity: Performance in a New Machine Age
Complicating Complexity: Performance in a New Machine Age
 
Mark Callaghan, Facebook
Mark Callaghan, FacebookMark Callaghan, Facebook
Mark Callaghan, Facebook
 
Nothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2amNothing Good Ever Happens After 2am
Nothing Good Ever Happens After 2am
 

Mehr von DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 

Mehr von DataStax Academy (9)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 

Kürzlich hochgeladen

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Cassandra: One (is the loneliest number)

  • 1. 2015-12-09 One (is the loneliest number) donny@pagerduty.com & paul@pagerduty.com
  • 2. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC
  • 3. 2015-12-09ONE (IS THE LONELIEST NUMBER)
  • 4. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC Failure
  • 5. 2015-12-08 Background ONE (IS THE LONELIEST NUMBER) • Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s
  • 6. 2015-12-08ONE (IS THE LONELIEST NUMBER) Casssandra Replication Client R1 R2 R3
  • 7. 2015-12-08ONE (IS THE LONELIEST NUMBER) Casssandra Replication - Failure Client R1 R2 R3 X
  • 8. 2015-12-08ONE (IS THE LONELIEST NUMBER) Foreshadowing • Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes
  • 9. 2015-12-08ONE (IS THE LONELIEST NUMBER) Coordinator Read Latency (in ms, by host) 6 seconds ~25 ms
  • 10. 2015-12-08ONE (IS THE LONELIEST NUMBER) Coordinator Read Latency (in ms, by host)
  • 11. 2015-12-08ONE (IS THE LONELIEST NUMBER) Coordinator Read Latency (in ms, by host)
  • 12. 2015-12-08ONE (IS THE LONELIEST NUMBER) Coordinator Read Latency (in ms, by host)
  • 13. 2015-12-08ONE (IS THE LONELIEST NUMBER) Coordinator Read Latency (in ms, by host)
  • 14. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC The next day…
  • 15. 2015-12-08ONE (IS THE LONELIEST NUMBER) The Plan • Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems: • Proactively disable non-critical services • Disable thrift
  • 16. 2015-12-08ONE (IS THE LONELIEST NUMBER) Surprise! • Cron triggers a repair of a different keyspace • Plus a compaction for a large CF
  • 17. 2015-12-08ONE (IS THE LONELIEST NUMBER) Outgoing Notification Backlog Size Normal Bad Horrible
  • 18. 2015-12-08ONE (IS THE LONELIEST NUMBER) Outgoing Notification Backlog Size Normal Bad Horrible :(
  • 19. 2015-12-08ONE (IS THE LONELIEST NUMBER) Cassandra Pending Tasks: ReadStage (by host) Over 9000
  • 20. 2015-12-08ONE (IS THE LONELIEST NUMBER) Cassandra CPU (by host) 100%
  • 21. 2015-12-08ONE (IS THE LONELIEST NUMBER) Factory Reset Success… kind of
  • 22. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC What went wrong?
  • 23. 2015-12-08ONE (IS THE LONELIEST NUMBER) or: What can we learn from Aimee Mann? One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know
  • 24. 2015-12-09 No, is the saddest experience you’ll ever know •Cassandra sheds load when overloaded •Shedding drops “stale” requests •Clients see timeouts and have trouble making progess ONE (IS THE LONELIEST NUMBER) •Sheds load if clients abandon the failed requests •But if clients retry those requests…
  • 25. 2015-12-09 Event Processing Event Processing So I heard you like retries… ONE (IS THE LONELIEST NUMBER) Notification Management App HostApp HostApp Host Cassandra Cluster Cassandra Cluster Cassandra Cluster Cass Client retries (S) Service client retries (T) Load balancer retries (H) Retries are multiplicative Total # of retries: O(S*H*T) Interactive Request (from user) Load Balancer
  • 26. 2015-12-09 Yes, it’s the saddest experience you’ll ever know •Dropped requests were retried •…causing load amplification •…causing more dropped requests •…causing even more retries •…causing misery. •i.e. too much load leads to much too much load ONE (IS THE LONELIEST NUMBER)
  • 27. 2015-12-09 How does overload get started? •Unpredictable workloads •Could be from request volume •In our case, from batch-style processes •Repairs, compaction, application-level tasks (e.g. archiving) ONE (IS THE LONELIEST NUMBER)
  • 28. 2015-12-09 PagerDuty system architecture Cassandra Cluster ONE (IS THE LONELIEST NUMBER) Inbound Event Buffer Data Access Notification Management Message Delivery Monitoring Events SMS, Phone Calls App Host Interactive Requests (from users) Load Balancer
  • 29. 2015-12-09 + =Workload A + B Workload A Workload B …and more bursts are more worst ONE (IS THE LONELIEST NUMBER)
  • 30. 2015-12-09 One (cluster) is the loneliest number that you’ll ever do •How many ops are A vs. B? •Must reverse engineer the contributions •Build (constantly evolving) models •Hard to reason about system behaviour •…and gets substantially harder when your entire production stack is overloaded ONE (IS THE LONELIEST NUMBER)
  • 31. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC How we fixed it
  • 32. 2015-12-09 Stop poking the bear •Only retry when necessary - is failure an option? •Less risky to retry user-initiated requests •Don’t retry retries (much) •Specifically: •Only try a single fallback C* host at the driver level, not N-1 •Only try a single fallback service host, not M-1 ONE (IS THE LONELIEST NUMBER)
  • 33. 2015-12-09 Prepare for the worst case •To avoid overload, must provision for the worst case •So either scale for the (bursty) stars aligning… •…or prevent stars from aligning in the first place ONE (IS THE LONELIEST NUMBER)
  • 34. 2015-12-09 Preventing star-bursts, part 1: coordinate •Explicit scheduling to interleave bursts •Repairs, compactions, batch jobs - Cassandra & services •Automation can help… •…but still error prone ONE (IS THE LONELIEST NUMBER)
  • 35. 2015-12-09 Preventing star-bursts, part 2: smooth, not chunky •Jobs can be done more frequently •But with smaller batch size •In the limit, aims for continuous & constant intensity workload •Some Cassandra options too: •Compaction, transfer, and other throttle limits •Levelled compaction vs. size-tiered compaction ONE (IS THE LONELIEST NUMBER)
  • 36. 2015-12-09 Preventing star-bursts, part 3: isolation •Air gap between each workload •Distinct Cassandra cluster for each service/workload •Cons: •More infrastructure •More configuration management •Pros: •Easy to monitor, reason about, diagnose, and scale •Reduces the blast radius when failures happen (and they will) ONE (IS THE LONELIEST NUMBER)
  • 37. 2015-12-09 PagerDuty system architecture: today ONE (IS THE LONELIEST NUMBER) Inbound Event Buffer Notification Management Message Delivery Cassandra Cluster Cassandra Cluster Cassandra Cluster
  • 38. 2015-12-09MAKING PAGERDUTY MORE RELIABLE USING PXC Lessons learned
  • 39. 2015-12-09 What have we learned? • Retries: the devil’s in the details • Variable workloads: bad, especially if unpredictable • Workload peaks: additive, and bad in multiples • Isolation: the gift that keeps on giving ONE (IS THE LONELIEST NUMBER)
  • 40. 2015-12-09 One is the loneliest number that you'll ever do Two can be as bad as one It's the loneliest number since the number one No, is the saddest experience you'll ever know Yes, it's the saddest experience you'll ever know ONE (IS THE LONELIEST NUMBER)