Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Netflix Cassandra Architecture at Scale
1. Cassandra in the Netflix
Architecture
CassandraEU London March 28th, 2012
Denis Sheahan
2. Agenda
• Netflix and the Cloud
• Why Cassandra
• Cassandra Deployment @ Netflix
• Cassandra Operations
• Cassandra lessons learned
• Scalability
• Open Source
3. Netflix and the Cloud
With more than 23 million streaming members in the
United States, Canada, Latin America, the United
Kingdom and Ireland, Netflix, Inc. is the world's
leading internet subscription service for enjoying
movies and TV series..
Netflix.com is almost 100% deployed on the Amazon
Cloud
Source: http://ir.netflix.com
4. Out-Growing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html
37x Growth Jan
2010-Jan 2011
Datacenter
Capacity
5. Netflix Deployed on AWS
Content Logs Play WWW API CS
Video International
S3 DRM Sign-Up Metadata
Masters CS lookup
Device Diagnostics
EC2 EMR Hadoop CDN routing Search
Config & Actions
Movie TV Movie Customer
S3 Hive Bookmarks
Choosing Choosing Call Log
Business Social
CDNs Logging Ratings CS Analytics
Intelligence Facebook
8. Distributed Key-Value Store vs Central
SQL Database
• Datacenter had a central database DBA
• Schema changes require downtime
• Cloud has in contrast many key-value data
stores
– Joins take place in java code
– No schema to change, no scheduled downtime
9. Goals for moving from Netflix DC to
the Cloud
• Faster
– Lower latency than the equivalent datacenter web
• Scalable
– Avoid needing any more datacenter capacity as
subscriber count increases
• Available
– Substantially higher robustness and availability than
datacenter services
• Productive
– Optimize agility of a large development team with
automation and tools
10. Cassandra
• Faster
– Low latency, low latency variance
• Scalable
– Supports running on Amazon EC2
– High and scalable read and write throughput
– Support for Multi-region clusters
• Available
– We value Availability over Consistency – Cassandra Eventually
Consistent
– Supports Amazon Availability Zones
– Data integrity checks and repairs
– Online Snapshot Backup, Restore/Rollback
• Productive
– We want FOSS + Support
12. Netflix Cassandra Use Cases
• Many different profiles
– Read heavy environments with a strict SLA
– Batch Write environments (70 rows per batch)
also serving low latency Reads
– Read Modify Write environments with large rows
– Write only environments with rapidly increasing
data sets
– Any many more….
13. How much we use Cassandra
30 Number of production clusters
12 Number of multi-region clusters
3 Max regions, one cluster
65 Total TB of data across all clusters
472 Number of Cassandra nodes
72/28 Largest Cassandra cluster (nodes/data in TB)
6k/250k Max read/writes per second
14. Deployment
Architecture API
AWS EC2
Front End Load Balancer
Discovery
Service API Proxy
Load Balancer
Component API
Services
memcache
d Cassandra
EC2
Internal
Disks
Backup
S3
15. High Availability Deployment
• Fact of life – EC2 instances die
• We store 3 local replicas in 3 different Cassandra nodes
– One replica per EC2 Availability Zone (AZ)
• Minimum Cluster configuration is 6 nodes, 2 per AZ
– Single instance failure still leaves at least one node in each AZ
• Use Local Quorum for writes
• Use Quorum One for reads
• Entire cluster replicated in Multi-region deployments
• AWS Availability Zones
– Separate buildings
– Separate power etc.
– Fairly close together
16. Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Cassandra
•Disks
•Zone A
• Client Writes to Cassandra Cassandra If a node goes
nodes and Zones •Disks •Disks offline, hinted handoff
• Nodes return ack to •Zone C •Zone A completes the write
client when the node comes
• Data written to Token back up.
internal commit log Aware
disks (no more than Clients Requests can choose to
10 seconds later) Cassandra Cassandra wait for one node, a
•Disks •Disks quorum, or all nodes to
•Zone C •Zone B ack the write
Cassandra SSTable disk writes and
•Disks
compactions occur
•Zone B
asynchronously
17. Extending to Multi-Region
In production for UK/Ireland support
• Create cluster in EU
• Backup US cluster to S3
• Restore backup in EU Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone A
• Local repair EU cluster Cassandra
• Disks
Cassandra
• Disks
Cassandra Cassandra
•
• Disks • Disks
Global repair/join • Zone C • Zone A
• Zone C • Zone A
US EU
Clients Clients
Cassandra Cassandra
Cassandra Cassandra
• Disks • Disks
• Disks • Disks
• Zone C • Zone B
• Zone C • Zone B
Cassandra
Cassandra
• Disks
• Zone B • Disks
• Zone B
S3
18. Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
• Client writes to local replicas If a node or region goes offline, hinted handoff
• Local write acks returned to completes the write when the node comes back up.
Client which continues when Nightly global compare and repair jobs ensure
2 of 3 local nodes are everything stays consistent.
committed
• Local coordinator writes to Local Remote
remote coordinator.
Cassandra
•
Cassandra
When data arrives, remote • Disks
• Zone A
• Disks
• Zone A
coordinator node acks Cassandra Cassandra Cassandra Cassandra
•
• Disks • Disks
Remote co-ordinator sends • Zone C • Zone A
• Disks
• Zone C
• Disks
• Zone A
data to other remote zones US EU
• Remote nodes ack to local Clients Clients
Cassandra Cassandra Cassandra Cassandra
coordinator • Disks
• Zone C
• Disks
• Zone B
• Disks
• Zone C
• Disks
• Zone B
• Data flushed to internal Cassandra Cassandra
• Disks
commit log disks (no more • Zone B
• Disks
• Zone B
than 10 seconds later)
19. Priam – Cassandra Automation
Available at http://github.com/netflix
• Open Source Tomcat Code running as a sidecar on
each Cassandra node. Deployed as a separate rpm
• Zero touch auto-configuration
• Token allocation and assignment including multi-
region
• Broken node replacement and ring expansion
• Full and incremental backup/restore to/from S3
• Metrics collection and forwarding via JMX
20. Cassandra Backup/Restore
• Full Backup Cassandra
– Time based snapshot Cassandra Cassandra
– SSTable compress -> S3
• Incremental Cassandra Cassandra
– SSTable write triggers S3
compressed copy to S3 Cassandra
Backup
Cassandra
• Archive
– Copy cross region Cassandra Cassandra
• Restore Cassandra Cassandra
– Full restore or create
new Ring from Backup A
22. Consoles, Monitors and Explorers
• Netflix Application Console (NAC)
– Primary AWS provisioning/config interface
• EPIC Counters
– Primary method for issue resolution
• Dashboards
• Cassandra Explorer
– Browse clusters, keyspaces, column families
• AWS Usage Analyzer
– Breaks down costs by application and resource
23. Cloud Deployment Model
Elastic Load
Auto Scaling Balancer
Group
Instances
Security
Group
Launch
Configuration
Amazon Machine
Image
24. NAC
• Netflix Application Console (NAC) is Netflix’s primary
tool in both Prod and Test to:
• Create and Destroy Applications
• Create and Destroy Auto Scaling Groups (ASGs)
• Scale Instances up and down within an ASG and manage auto-
scaling
• Manage launch configs and AMIs
• http://www.slideshare.net/joesondow
27. Cassandra Explorer
• Kiosk mode – no alerting
• High level cluster status (thift, gossip)
• Warns on a small set of metrics 27
28. Epic
• Netflix-wide monitoring and alerting tool based on RRD
• Priam sends all JMX data to Epic
• Very useful for finding specific issues 28
29. Dashboards
• Next level cluster details
• Throughput
• Latency, Gossip status, Maintenance operations
• Trouble indicators
• Useful for finding anomalies
• Most investigations start here 29
30. Things we monitor
• Cassandra
– Throughput, Latency, Compactions , Repairs
– Pending threads, Dropped operations
– Backup failures
– Recent restarts
– Schema changes
• System
– Disk space, Disk throughput, Load average
• Errors and exceptions in Cassandra, System and
Tomcat log files
30
31. Cassandra AWS Pain Points
• Compactions cause spikes, esp. on read-heavy systems
– Affects clients (hector, astyanax)
– Throttling in newer Cassandra versions helps
• Repairs are toxic to performance
• Disk performance on Cloud instances and its impact on
SSTable count
• Memory requirements due to filesystem cache
• Compression unusable in our environment
• Multi-tenancy performance unpredictable
• Java Heap size and OOM issues
32. Lessons learned
• In EC2 best to choose instances that are not multi-
tenant
• Better to compact on our terms and not Cassandra’s.
Take nodes out of service for major compactions
• Size memtable flushes for optimizing compactions
– Helps when writes are uniformly distributed, easier to
determine flush patterns
– Best to optimize flushes based on memtable size, not time
– Makes minor compactions smoother
32
33. Lessons Learned (cont)
• Key and row caches
– Left unbounded can chew up jvm memory needed for
normal work
– Latencies will spike as the jvm needs to fight for
memory
– Off-heap row cache still maintains data structures on-
heap
• mmap() as in-memory cache
– When process terminated, mmap pages are added to
the free list
34. Lessons Learned (cont)
• Sharding
– If a single row has many gets/mutates, the nodes
holding it will become hot spots
– If a row grows too large, it won’t fit into memory
• Problem for reads, compactions, and repairs
• Some of our indices ran afoul of this
• For more info see Jason Brown’s slides
Cassandra from the trenches
slideshare.net/netflix
36. Scalability Testing
• Cloud Based Testing – frictionless, elastic
– Create/destroy any sized cluster in minutes
– Many test scenarios run in parallel
• Test Scenarios
– Internal app specific tests
– Simple “stress” tool provided with Cassandra
• Scale test, keep making the cluster bigger
– Check that tooling and automation works…
– How many ten column row writes/sec can we do?
43. Open Source @ Netflix
• Source at http://netflix.github.com
• Binaries at Maven https://issues.sonatype.org/browse/OSSRH-
2116
44. Cassandra JMeter Plugin
• Netflix uses JMeter across the fleet for
load testing
• JMeter plugin provides a wide range of
samplers for Get, Put, Delete and
Schema Creation
• Used extensively to load data, Cassandra
stress tests, feature testing etc.
• Described at
https://github.com/Netflix/CassJMeter/
wiki
45. Astyanax
Available at http://github.com/netflix
• Cassandra java client
• API abstraction on top of Thrift protocol
• “Fixed” Connection Pool abstraction (vs. Hector)
– Round robin with Failover
– Retry-able operations not tied to a connection
– Netflix PaaS Discovery service integration
– Host reconnect (fixed interval or exponential backoff)
– Token aware to save a network hop – lower latency
– Latency aware to avoid compacting/repairing nodes – lower
variance
• Simplified use of serializers via method overloading (vs.
Hector)
• ConnectionPoolMonitor interface
Hinweis der Redaktion
Find numbers some day
Send to John C once finished
Remove cluster names, hopefully find one without purpleMight want another slide with the clsuter details, if ready by 3/27
Compaction is continually happening after only a few seconds. Logs show it flushes the Memtable every 10 - 15 secondsLogs show a minor compaction every 30 seconds or soYou can see this also in the iostat data there are both reads and writes going to disk, the majority are writes.Stress command linejava -jar stress.jar -d "144 node ids" -e ONE -n 27000000 -l 3 -i 1 -t 200 -p 7102 -o INSERT -c 10 -rSo its writing 10 columns per row, keyid randomly chosen from 27 million idsThirty clients talk to the first 144 nodes and 30 talk to the second 144For the Insert we write three replicas which is specified in the keyspaceKeyspace: Keyspace1: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [us-east:3] Column Families: ColumnFamily: Standard1 Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds: 0.0/0 Key cache size / save period in seconds: 200000.0/14400 Memtable thresholds: 1.7671875/1440/128 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true
Cross AZ traffic calculation – per node average 23640+19770 = 43410 KB/s288 nodes times 3600 = 45007 GB/hour2/3s = 30000 GB/hour, $0.01/GB = $30010 minute test run = 300/6 = $50Slides error, test driver was m2.4xl not m1.xlTest driver TX 250 Mbit/s = 31MBytes/s, RX 35 Kbit/s = 4.3 Mbytes/s60 x 35MB/s * 3600 = 7.5TB/hrEach write is about 400 bytes on diskCreating the tests is heavily dependent on AWS and the fact that we can only launch 96 at a time.Looking at the AWS, Linux and Cassandra logs for 288 wayI kicked off the first ASG from 0->96 at 00:08:0227 seconds later the first Linux box was booted at SatOct 22 00:08:29 UTC 2011Last Linux (number 288) bootedat SatOct 22 01:09:51 UTC 2011Last Cassandracame online at 01:12:41 about 3 minuteslaterSo just overanhour to getthisbad boy up. Most ofthetimewaswaitingforthenodes to jointhe cluster. I waitedforall 96 to joinbeforestartingthenext AZAWS claimsittook 4 minutes and 40 seconds to launchthe 96 instancesThisisprettyconsistentacrossthe 3 AZs. So about 15 - 16 minutesofthe1 hourwas AWSItseems to takeabout 1 minute 30 seconds to bootthe Linux instancesNote in launchingthe 96 therewerefailures / retries in all 3 AZsus-east-1a had 9 failuresus-east-1c had 1 failureus-east-1d also had 9 failuresRun timesvaried a bit mostlybased on how long I couldsustaintheloadwiththenumberofclientswriting 27 millionrecords. In Cassandra stress youcannotspecifyanelapsedtime, just a totalnumberoftransactions. Italsodecaysirregularly as threadsterminate48 waysustainedloadfor 570 seconds96 waysustainedloadfor 550 seconds144 waysustainedfor 660 seconds288 waysustainedfor 780 seconds
Complete connection pool abstractionQueries and mutations wrapped in objects created by the Keyspace implementation making it possible to retry failed operations. This varies from other connection pool implementations on which the operation is created on a specific connection and must be completely redone if it fails.Simplified serialization via method overloading. The low level thrift library only understands data that is serialized to a byte array. Hector requires serializers to be specified for nearly every call. Astyanax minimizes the places where serializers are specified by using predefined ColumnFamiliy and ColumnPath definitions which specify the serializers. The API also overloads set and get operation for common data types.The internal library does not log anything. All internal events are instead ... calls to a ConnectionPoolMonitor interface. This allows customization of log levels and filtering of repeating events outside of the scope of the connection poolSuper columns will soon be replaced by Composite column names. As such it is recommended to not use super columns at all and to use Composite column names instead. There is some support for super columns in Astyanax but those methods have been deprecated and will eventually be removed.