Netflix Cassandra Architecture at Scale

Cassandra in the Netflix
Architecture
CassandraEU London March 28th, 2012
Denis Sheahan

Agenda
• Netflix and the Cloud
• Why Cassandra
• Cassandra Deployment @ Netflix
• Cassandra Operations
• Cassandra lessons learned
• Scalability
• Open Source

Netflix and the Cloud

With more than 23 million streaming members in the
United States, Canada, Latin America, the United
Kingdom and Ireland, Netflix, Inc. is the world's
leading internet subscription service for enjoying
movies and TV series..
Netflix.com is almost 100% deployed on the Amazon
Cloud

Source: http://ir.netflix.com

Out-Growing Data Center
http://techblog.netflix.com/2011/02/redesigning-netflix-api.html

37x Growth Jan
2010-Jan 2011

Datacenter
Capacity

Netflix Deployed on AWS

Content Logs Play WWW API CS
Video International
S3 DRM Sign-Up Metadata
Masters CS lookup

Device Diagnostics
EC2 EMR Hadoop CDN routing Search
Config & Actions

Movie TV Movie Customer
S3 Hive Bookmarks
Choosing Choosing Call Log

Business Social
CDNs Logging Ratings CS Analytics
Intelligence Facebook

Distributed Key-Value Store vs Central
SQL Database
• Datacenter had a central database DBA
• Schema changes require downtime
• Cloud has in contrast many key-value data
stores
– Joins take place in java code
– No schema to change, no scheduled downtime

Goals for moving from Netflix DC to
the Cloud
• Faster
– Lower latency than the equivalent datacenter web
• Scalable
– Avoid needing any more datacenter capacity as
subscriber count increases
• Available
– Substantially higher robustness and availability than
datacenter services
• Productive
– Optimize agility of a large development team with
automation and tools

Cassandra
• Faster
– Low latency, low latency variance
• Scalable
– Supports running on Amazon EC2
– High and scalable read and write throughput
– Support for Multi-region clusters
• Available
– We value Availability over Consistency – Cassandra Eventually
Consistent
– Supports Amazon Availability Zones
– Data integrity checks and repairs
– Online Snapshot Backup, Restore/Rollback
• Productive
– We want FOSS + Support

Netflix Cassandra Use Cases
• Many different profiles
– Read heavy environments with a strict SLA
– Batch Write environments (70 rows per batch)
also serving low latency Reads
– Read Modify Write environments with large rows
– Write only environments with rapidly increasing
data sets
– Any many more….

How much we use Cassandra
30 Number of production clusters
12 Number of multi-region clusters
3 Max regions, one cluster
65 Total TB of data across all clusters
472 Number of Cassandra nodes
72/28 Largest Cassandra cluster (nodes/data in TB)
6k/250k Max read/writes per second

Deployment
Architecture API
AWS EC2
Front End Load Balancer
Discovery
Service API Proxy

Load Balancer

Component API
Services

memcache
d Cassandra
EC2
Internal
Disks

Backup
S3

High Availability Deployment
• Fact of life – EC2 instances die
• We store 3 local replicas in 3 different Cassandra nodes
– One replica per EC2 Availability Zone (AZ)
• Minimum Cluster configuration is 6 nodes, 2 per AZ
– Single instance failure still leaves at least one node in each AZ
• Use Local Quorum for writes
• Use Quorum One for reads
• Entire cluster replicated in Multi-region deployments
• AWS Availability Zones
– Separate buildings
– Separate power etc.
– Fairly close together

Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware

Cassandra
•Disks
•Zone A

• Client Writes to Cassandra Cassandra If a node goes
nodes and Zones •Disks •Disks offline, hinted handoff
• Nodes return ack to •Zone C •Zone A completes the write
client when the node comes
• Data written to Token back up.
internal commit log Aware
disks (no more than Clients Requests can choose to
10 seconds later) Cassandra Cassandra wait for one node, a
•Disks •Disks quorum, or all nodes to
•Zone C •Zone B ack the write

Cassandra SSTable disk writes and
•Disks
compactions occur
•Zone B
asynchronously

Extending to Multi-Region
In production for UK/Ireland support

• Create cluster in EU
• Backup US cluster to S3
• Restore backup in EU Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone A

• Local repair EU cluster Cassandra
• Disks
Cassandra
• Disks
Cassandra Cassandra

•
• Disks • Disks

Global repair/join • Zone C • Zone A
• Zone C • Zone A

US EU
Clients Clients
Cassandra Cassandra
Cassandra Cassandra
• Disks • Disks
• Disks • Disks
• Zone C • Zone B
• Zone C • Zone B

Cassandra
Cassandra
• Disks
• Zone B • Disks
• Zone B

S3

Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum

• Client writes to local replicas If a node or region goes offline, hinted handoff
• Local write acks returned to completes the write when the node comes back up.
Client which continues when Nightly global compare and repair jobs ensure
2 of 3 local nodes are everything stays consistent.
committed
• Local coordinator writes to Local Remote
remote coordinator.
Cassandra
•
Cassandra
When data arrives, remote • Disks
• Zone A
• Disks
• Zone A

coordinator node acks Cassandra Cassandra Cassandra Cassandra

•
• Disks • Disks
Remote co-ordinator sends • Zone C • Zone A
• Disks
• Zone C
• Disks
• Zone A

data to other remote zones US EU
• Remote nodes ack to local Clients Clients
Cassandra Cassandra Cassandra Cassandra
coordinator • Disks
• Zone C
• Disks
• Zone B
• Disks
• Zone C
• Disks
• Zone B

• Data flushed to internal Cassandra Cassandra
• Disks
commit log disks (no more • Zone B
• Disks
• Zone B

than 10 seconds later)

Priam – Cassandra Automation
Available at http://github.com/netflix

• Open Source Tomcat Code running as a sidecar on
each Cassandra node. Deployed as a separate rpm
• Zero touch auto-configuration
• Token allocation and assignment including multi-
region
• Broken node replacement and ring expansion
• Full and incremental backup/restore to/from S3
• Metrics collection and forwarding via JMX

Cassandra Backup/Restore
• Full Backup Cassandra

– Time based snapshot Cassandra Cassandra

– SSTable compress -> S3
• Incremental Cassandra Cassandra

– SSTable write triggers S3
compressed copy to S3 Cassandra
Backup
Cassandra

• Archive
– Copy cross region Cassandra Cassandra

• Restore Cassandra Cassandra

– Full restore or create
new Ring from Backup A

Consoles, Monitors and Explorers
• Netflix Application Console (NAC)
– Primary AWS provisioning/config interface
• EPIC Counters
– Primary method for issue resolution
• Dashboards
• Cassandra Explorer
– Browse clusters, keyspaces, column families
• AWS Usage Analyzer
– Breaks down costs by application and resource

Cloud Deployment Model

Elastic Load
Auto Scaling Balancer
Group

Instances
Security
Group

Launch
Configuration
Amazon Machine
Image

NAC
• Netflix Application Console (NAC) is Netflix’s primary
tool in both Prod and Test to:
• Create and Destroy Applications
• Create and Destroy Auto Scaling Groups (ASGs)
• Scale Instances up and down within an ASG and manage auto-
scaling
• Manage launch configs and AMIs
• http://www.slideshare.net/joesondow

Cassandra Explorer

• Kiosk mode – no alerting
• High level cluster status (thift, gossip)
• Warns on a small set of metrics 27

Epic

• Netflix-wide monitoring and alerting tool based on RRD
• Priam sends all JMX data to Epic
• Very useful for finding specific issues 28

Dashboards

• Next level cluster details
• Throughput
• Latency, Gossip status, Maintenance operations
• Trouble indicators
• Useful for finding anomalies
• Most investigations start here 29

Things we monitor
• Cassandra
– Throughput, Latency, Compactions , Repairs
– Pending threads, Dropped operations
– Backup failures
– Recent restarts
– Schema changes
• System
– Disk space, Disk throughput, Load average
• Errors and exceptions in Cassandra, System and
Tomcat log files

30

Cassandra AWS Pain Points
• Compactions cause spikes, esp. on read-heavy systems
– Affects clients (hector, astyanax)
– Throttling in newer Cassandra versions helps
• Repairs are toxic to performance
• Disk performance on Cloud instances and its impact on
SSTable count
• Memory requirements due to filesystem cache
• Compression unusable in our environment
• Multi-tenancy performance unpredictable
• Java Heap size and OOM issues

Lessons learned
• In EC2 best to choose instances that are not multi-
tenant
• Better to compact on our terms and not Cassandra’s.
Take nodes out of service for major compactions
• Size memtable flushes for optimizing compactions
– Helps when writes are uniformly distributed, easier to
determine flush patterns
– Best to optimize flushes based on memtable size, not time
– Makes minor compactions smoother

32

Lessons Learned (cont)
• Key and row caches
– Left unbounded can chew up jvm memory needed for
normal work
– Latencies will spike as the jvm needs to fight for
memory
– Off-heap row cache still maintains data structures on-
heap
• mmap() as in-memory cache
– When process terminated, mmap pages are added to
the free list

Lessons Learned (cont)
• Sharding
– If a single row has many gets/mutates, the nodes
holding it will become hot spots
– If a row grows too large, it won’t fit into memory
• Problem for reads, compactions, and repairs
• Some of our indices ran afoul of this
• For more info see Jason Brown’s slides
Cassandra from the trenches
slideshare.net/netflix

Scalability Testing
• Cloud Based Testing – frictionless, elastic
– Create/destroy any sized cluster in minutes
– Many test scenarios run in parallel

• Test Scenarios
– Internal app specific tests
– Simple “stress” tool provided with Cassandra

• Scale test, keep making the cluster bigger
– Check that tooling and automation works…
– How many ten column row writes/sec can we do?

Scale-Up Linearity
Client Writes/s by node count – Replication Factor = 3
1200000
1099837
1000000

800000
Transactions

600000
537172
400000 366828
200000 174373

0
0 50 100 150 200 250 300 350

EC2 Instances

Measured at the Cassandra Server
Throughput 3.3 Million writes/sec
Cassandra Writes per second

Elapsed time seconds

Response time 0.014ms
Cassandra Response time

Elapsed time seconds

Per Node Activity
Per Node 48 Nodes 96 Nodes 144 Nodes 288 Nodes
Per Server Writes/s 10,900 w/s 11,460 w/s 11,900 w/s 11,456 w/s
Mean Server Latency 0.0117 ms 0.0134 ms 0.0148 ms 0.0139 ms
Mean CPU %Busy 74.4 % 75.4 % 72.5 % 81.5 %
Disk Read 5,600 KB/s 4,590 KB/s 4,060 KB/s 4,280 KB/s
Disk Write 12,800 KB/s 11,590 KB/s 10,380 KB/s 10,080 KB/s
Network Read 22,460 KB/s 23,610 KB/s 21,390 KB/s 23,640 KB/s
Network Write 18,600 KB/s 19,600 KB/s 17,810 KB/s 19,770 KB/s

Node specification – Xen Virtual Images, AWS US East, three zones
• Cassandra 0.8.6, CentOS, SunJDK6
• AWS EC2 m1 Extra Large – Standard price $ 0.68/Hour
• 15 GB RAM, 4 Cores, 1Gbit network
• 4 internal disks (total 1.6TB, striped together, md, XFS)

Time is Money
48 nodes 96 nodes 144 nodes 288 nodes
Writes Capacity 174373 w/s 366828 w/s 537172 w/s 1,099,837 w/s
Storage Capacity 12.8 TB 25.6 TB 38.4 TB 76.8 TB
Nodes Cost/hr $32.64 $65.28 $97.92 $195.84
Test Driver Instances 10 20 30 60
Test Driver Cost/hr $20.00 $40.00 $60.00 $120.00
Cross AZ Traffic 5 TB/hr 10 TB/hr 15 TB/hr 301 TB/hr
Traffic Cost/10min $8.33 $16.66 $25.00 $50.00
Setup Duration 15 minutes 22 minutes 31 minutes 662 minutes
AWS Billed Duration 1hr 1hr 1 hr 2 hr
Total Test Cost $60.97 $121.94 $182.92 $561.68
1 Estimate two thirds of total network traffic
2 Workaround for a tooling bug slowed setup

Open Source @ Netflix

• Source at http://netflix.github.com
• Binaries at Maven https://issues.sonatype.org/browse/OSSRH-
2116

Cassandra JMeter Plugin
• Netflix uses JMeter across the fleet for
load testing
• JMeter plugin provides a wide range of
samplers for Get, Put, Delete and
Schema Creation
• Used extensively to load data, Cassandra
stress tests, feature testing etc.
• Described at
https://github.com/Netflix/CassJMeter/
wiki

Astyanax
Available at http://github.com/netflix

• Cassandra java client
• API abstraction on top of Thrift protocol
• “Fixed” Connection Pool abstraction (vs. Hector)
– Round robin with Failover
– Retry-able operations not tied to a connection
– Netflix PaaS Discovery service integration
– Host reconnect (fixed interval or exponential backoff)
– Token aware to save a network hop – lower latency
– Latency aware to avoid compacting/repairing nodes – lower
variance
• Simplified use of serializers via method overloading (vs.
Hector)
• ConnectionPoolMonitor interface

Netflix Cassandra Architecture at Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Netflix Cassandra Architecture at Scale

Ähnlich wie Netflix Cassandra Architecture at Scale (20)

Mehr von Acunu

Mehr von Acunu (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Netflix Cassandra Architecture at Scale

Hinweis der Redaktion