1) The document discusses Netflix's use of the Cassandra database distributed across Amazon Web Services (AWS) data centers.
2) Netflix uses Cassandra clusters spanning multiple AWS regions and availability zones to ensure high availability and prevent data loss.
3) During a recent AWS outage that impacted one availability zone, Netflix's Cassandra database was able to handle the loss of one third of its nodes without any loss of data or availability due to its multi-region, multi-zone architecture.
6. From the Netflix tech blog:
“Cassandra, our distributed cloud persistence store which
is distributed across all zones and regions, dealt with the
loss of one third of its regional nodes without any loss of
data or availability.[2]”
6
7. Topics
Cassandra at Netflix
Constructing clusters in AWS with Priam
Resiliency
Observations on AWS, Cassandra and AWS/Cassandra
Monitoring and maintenances
References
7
8. Cassandra by the numbers
41 Number of production clusters
13 Number of multi-region clusters
4 Max regions, one cluster
90 Total TB of data across all clusters
621 Number of Cassandra nodes
72/34 Largest Cassandra cluster (nodes/data in TB)
80k/250k Max read/writes per second on a single cluster
3* Size of Operations team
* We are hiring DevOps and Developers. Stop by our booth!
8
9. Netflix Deployed on AWS
Content Logs Play WWW API CS
Content S3 International
DRM Sign-Up Metadata
Management Terabytes CS lookup
EC2 Device Diagnostics &
EMR CDN routing Search
Encoding Configuration Actions
S3 Movie TV Movie Customer Call
Hive & Pig Bookmarks
Petabytes Choosing Choosing Log
Business Social
Logging Ratings CS Analytics
Intelligence Facebook
CDNs
ISPs
Terabits
Customers
10. Constructing clusters in AWS with Priam
Tomcat webapp for Cassandra administration
Token management
Full and incremental backups
JMX metrics collection
cassandra.yaml configuration
REST API for most nodetool commands
AWS Security Groups for multi-region clusters
Open sourced, available on github [3]
10
11. Autoscaling Groups
Region ASGs do not map directly to
nodetool ring output, but are
used to define the cluster (#
of instances, AZs, etc).
Address DC Rack Status State Load Owns Token
…
###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% …
###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image
##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% …
##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS
##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed
##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application.
##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% …
###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% …
##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% …
##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group
###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% …
##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control
between ASGs
Instance Availability Zone
(AZ)
AWS Terminology
A Constructing a cluster in AWS
11
12. APP is not an AWS
entity, but one that we
App = cass_cluster use internally to denote
a service. This is part
of asgard [4], our open-
ASG # 1 ASG # 2 ASG # 3 sourced cloud
Multi-region clusters application web
have the same Availabilty Zone = A Availability Zone = B Availability Zone = C interface
configuration in each
region. Just repeat what Region = us-east Region = us-east Region = us-east
you see here!
Instance count = 6 Instance count = 6 Instance count = 6
Instance type = Instance type = Instance type =
m2.4xlarge m2.4xlarge m2.4xlarge
External full backups
to an alternate region
saved for 30 days.
Full and incremental
Backups to local-region S3 S3
S3 via Priam
Cassandra Configuration
B Constructing a cluster in AWS
12
13. AMI contains os, base netflix packages Priam runs on each node and
and Cassandra and Priam will:
* Assign tokens to each
node, alternating (1) the
Cassandra
(1) Alternate C A B Priam
AZs around the ring (2).
availability zones * Perform nightly snapshot
Tomcat
(a, b, c) around the backup to S3
ring to ensure data B C
is written to * Perform incremental
multiple data SSTable backups to S3
centers.
A A * Bootstrap replacement
(2) Survive the nodes to use vacated
loss of a data tokens
center by ensuring C B S3 * Collect JMX metrics for our
that we only lose monitoring systems
one node from
each replication B c * REST API calls to most
set. A nodetool functions
Putting it all together
C Constructing a cluster in AWS
13
14. Resiliency - Instance
• RF=AZ=3
• Cassandra bootstrapping works really well
• Replace nodes immediately
• Repair often
15
15. Resiliency – One availability zone
RF=AZ=3
Alternating AZs ensures that each AZ has a full replica of
data
Provision cluster to run at 2/3 capacity
Ride out a zone outage; do not move to another zone
Bootstrap one node at a time
Repair after recovery
16
16. What happened on June 29th?
During outage
All Cassandra instances in us-east-1a were inaccessible
nodetool ring showed all nodes as DOWN
Monitoring other AZs to ensure availability
Recovery – power restored to us-east-1a
Majority of instances rejoined the cluster without issue
Majority of remainder required a reboot to fix
Remainder of nodes needed to be replaced, one at a time
17
17. Resiliency – Multiple availability zones
Outage; can no longer satisfy quorum
Restore from backup and repair
18
18. Resiliency - Region
Connectivity loss between regions – operate as island
clusters until service restored
Repair data between regions
If an entire region disappears, watch DVDs instead
19
19. Observations: AWS
Ephemeral drive performance is better than EBS
S3-backed AMIs help us weather EBS outages
Instances seldom die on their own
Use as many availability zones as you can afford
Understand how AWS launches instances
I/O is constrained in most AWS instance types
Repairs are very I/O intensive
Large size-tiered compactions can impact latency
SSDs[5] are game changers [6]
20
20. Observations: Cassandra
A slow node is worse than a down node
Cold cache increases load and kills latency
Use whatever dials you can find in an emergency
Remove node from coordinator list
Compaction throttling
Min/max compaction thresholds
Enable/disable gossip
Leveled compaction performance is very promising
1.1.x and 1.2.x should address some big issues
21
23. Maintenances
Repair clusters regularly
Run off-line major compactions to avoid latency
SSDs will make this unnecessary
Always replace nodes when they fail
Periodically replace all nodes in the cluster
Upgrade to new versions
Binary (rpm) for major upgrades or emergencies
Rolling AMI push over time
24
24. References
1. A bad night: Netflix and Instagram go down amid
Amazon Web Services outage (theverge.com)
2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)
3. github / Netflix / priam (github.com)
4. github / Netflix / asgard (github.com)
5. Announcing High I/O Instances for Amazon (aws.amazon.com)
6. Benchmarking High Performance I/O with SSD for
Cassandra on AWS (techblog.netflix.com)
25
Hinweis der Redaktion
Outline of presentationJun 29 outageContext - cassandra and aws - updated usage numbers - include architecture diagram with cassandra called outHow clusters are constructed – blueprint diagrams should include #1 – aws make-up – ASG and Azs #2 - instance particulars #3 - priam s3Resiliency - node, zone and region outagespriam – bootstrapping, monitoring, backup and restore, open sourceMonitoring - what we monitor - tools we use - epic/atlas and dashboards, and Maintenance tasks - jenkinsThings we monitor Issues we haveNote on SSDs
Minimum cluster size = 6
… Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor