SlideShare ist ein Scribd-Unternehmen logo
1 von 47
ABOUT NETFLIX
NETFLIX
ACTIVE - ACTIVE
WHAT IS ACTIVE-ACTIVE
Also called dual active, it is a phrase used to
describe a network of independent processing nodes
where each node has access to a replicated database
giving each node access and usage of single
application. In an active-active system all requests are
load balanced across all available processing capacity,
Where a failure occurs on a node, another node in the
network takes its place.
DOES AN INSTANCE FAIL?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey
DOES A ZONE FAIL?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla
DOES A REGION FAIL?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issues
• Test with Chaos Kong
EVERYTHING FAILS… EVENTUALLY
• Keep your services running by embracing isolation and
redundancy
• Construct a highly agile and highly available service
from ephemeral and assumed broken components
ISOLATION
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not affect
functionality / operations
REDUNDANCY
• Make more than one (of pretty much everything)
• Specifically, distribute services across Availability
Zones and regions
HISTORY: X-MAS EVE 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process
that was inadvertently run against the
production ELB state data”
ACTIVE-ACTIVE ARCHITECTURE
THE PROCESS
IDENTIFYING CLUSTERS FOR AA
SNITCH CHANGES
EC2Snitch EC2MultiRegionSnitch
Uses Private IPs Uses Public IPs
PRIAM.MULTIREGION.ENABLE =TRUE
tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32,
10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32]
tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32,
54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]
SPIN UP NODES IN NEW REGION
us-east-1 us-west-2
APP
UPDATE KEYSPACE
Update keyspace <keyspace> with placement_strategy =
'NetworkTopologyStrategy'
and strategy_options = {us-east : 3, us-west-2 : 3};
Existing region and replication factor New region and replication factor
REBUILD NEW REGION
Run – nodetool rebuild us-east-1 on all us-west-2 nodes
RUN NODETOOL REPAIR
VALIDATION
BENCHMARKING GLOBAL CASSANDRA
WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION
CAPACITY
16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL
192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING
CASSANDRA IN 20 MINUTES
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test
Load
Test
Load
Validation
Load
Interzone Traffic
1 Million Writes
CL.ONE (Wait for
One Replica to ack)
1 Million Reads
after 500 ms
CL.ONE with No
Data Loss
Interregional Traffic
Up to 9Gbits/s, 83ms 18 TB backups
from S3
TEST FOR THUNDERING HERD
TEST FOR RETRIES
FAILURE
RETRY
KEY METRICS USED
• 99 /95 th Read Latency (Client & C*)
• Dropped Metrics on C*
• Exceptions on C*
• Heap Usage on C*
• CPU Usage (Client & C*)
• Threads Pending on C*
CONFIGURATION FOR TEST
• 24 Node C* SSDs
• 220 Client instances
• 70+ Jmeter Instances
C* IOPS
TOTAL READ IOPS
TOTAL WRITE IOPS
95th LATENCY
99th LATENCY
CHECK FOR CEILING
NETWORK PARTITION
us-east-1 us-west-2
TAKEAWAYS
REPAIRS AFTER EXTENSION ARE PAINFUL !!
TIME TO REPAIR DEPENDS ON
• Number of regions
• Number of replicas
• Data size
• Amount of entropy
ADJUST GC_GRACE AFTER
EXTENSION
• Column Family Setting
• Defined in seconds
• Default 10 days
• Tweak gc_grace settings to
accommodate time taken to repair
• BEWARE of deleted columns
RUNBOOK
PLAN FOR CAPACITY
CONSISTENCY LEVEL
• Check the client for consistency level setting
• In a Multiregional cluster QUORUM <>
LOCAL_QUORUM
• Recommended consistency levels
LOCAL_ONE (CASSANDRA-6202) for reads
and LOCAL_QUORUM for writes
• For region resiliency avoid – ALL or
QUORUM calls
HOW DO WE KNOW IT WORKS?
CREATE CHAOS!!
Benchmark …
Time Consuming
But worth it!
Active Active - C* Behind the Scenes at Netflix

Weitere ähnliche Inhalte

Andere mochten auch

Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
DataStax
 

Andere mochten auch (20)

3800 die-bonder overview
3800 die-bonder overview3800 die-bonder overview
3800 die-bonder overview
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Securing Cassandra
Securing CassandraSecuring Cassandra
Securing Cassandra
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
Lessons Learned from Real-World Deployments of Java EE 7 at JavaOne 2014
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center Strategies
 
Ficstar Software: Cassandra Installation to Optimization
Ficstar Software: Cassandra Installation to OptimizationFicstar Software: Cassandra Installation to Optimization
Ficstar Software: Cassandra Installation to Optimization
 
Target: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at TargetTarget: Performance Tuning Cassandra at Target
Target: Performance Tuning Cassandra at Target
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 

Mehr von DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Mehr von DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Active Active - C* Behind the Scenes at Netflix

  • 1.
  • 4.
  • 6. WHAT IS ACTIVE-ACTIVE Also called dual active, it is a phrase used to describe a network of independent processing nodes where each node has access to a replicated database giving each node access and usage of single application. In an active-active system all requests are load balanced across all available processing capacity, Where a failure occurs on a node, another node in the network takes its place.
  • 7. DOES AN INSTANCE FAIL? • It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  • 8. DOES A ZONE FAIL? • Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone • Test with Chaos Gorilla
  • 9. DOES A REGION FAIL? • Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issues • Test with Chaos Kong
  • 10. EVERYTHING FAILS… EVENTUALLY • Keep your services running by embracing isolation and redundancy • Construct a highly agile and highly available service from ephemeral and assumed broken components
  • 11. ISOLATION • Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not affect functionality / operations
  • 12. REDUNDANCY • Make more than one (of pretty much everything) • Specifically, distribute services across Availability Zones and regions
  • 13. HISTORY: X-MAS EVE 2012 • Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the production ELB state data”
  • 15.
  • 16.
  • 20. PRIAM.MULTIREGION.ENABLE =TRUE tcp 7101-7101 [ ] [10.190.21.36/32, 10.232.200.17/32, 10.33.573.26/32, 10.20.151.165/32, 10.226.99.46/32, 10.244.143.193/32] tcp 7103-7103 [ ] [54.196.221.136/32, 54.202.200.217/32, 54.203.57.226/32, 54.205.151.165/32, 54.226.99.46/32, 54.244.143.193/32]
  • 21.
  • 22. SPIN UP NODES IN NEW REGION us-east-1 us-west-2 APP
  • 23. UPDATE KEYSPACE Update keyspace <keyspace> with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {us-east : 3, us-west-2 : 3}; Existing region and replication factor New region and replication factor
  • 24. REBUILD NEW REGION Run – nodetool rebuild us-east-1 on all us-west-2 nodes
  • 27. BENCHMARKING GLOBAL CASSANDRA WRITE INTENSIVE TEST OF CROSS-REGION REPLICATION CAPACITY 16 X HI1.4XLARGE SSD NODES PER ZONE = 96 TOTAL 192 TB OF SSD IN SIX LOCATIONS UP AND RUNNING CASSANDRA IN 20 MINUTES Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-West-2 Region - Oregon Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East-1 Region - Virginia Test Load Test Load Validation Load Interzone Traffic 1 Million Writes CL.ONE (Wait for One Replica to ack) 1 Million Reads after 500 ms CL.ONE with No Data Loss Interregional Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  • 30. KEY METRICS USED • 99 /95 th Read Latency (Client & C*) • Dropped Metrics on C* • Exceptions on C* • Heap Usage on C* • CPU Usage (Client & C*) • Threads Pending on C*
  • 31. CONFIGURATION FOR TEST • 24 Node C* SSDs • 220 Client instances • 70+ Jmeter Instances
  • 33. TOTAL READ IOPS TOTAL WRITE IOPS
  • 38. REPAIRS AFTER EXTENSION ARE PAINFUL !!
  • 39. TIME TO REPAIR DEPENDS ON • Number of regions • Number of replicas • Data size • Amount of entropy
  • 40. ADJUST GC_GRACE AFTER EXTENSION • Column Family Setting • Defined in seconds • Default 10 days • Tweak gc_grace settings to accommodate time taken to repair • BEWARE of deleted columns
  • 43. CONSISTENCY LEVEL • Check the client for consistency level setting • In a Multiregional cluster QUORUM <> LOCAL_QUORUM • Recommended consistency levels LOCAL_ONE (CASSANDRA-6202) for reads and LOCAL_QUORUM for writes • For region resiliency avoid – ALL or QUORUM calls
  • 44. HOW DO WE KNOW IT WORKS? CREATE CHAOS!!
  • 45.