WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
1. C* Upgrades at Scale
Ajay Upadhyay & Vinay Chella
Cloud Data Architects
2. Who are We??
● Ajay Upadhyay
○ Cloud Data Architect @ Netflix
● Vinay Chella
○ Cloud Data Architect @ Netflix
○ https://www.linkedin.com/in/vinaykumarchella
● Cloud Database Engineering [CDE]
● Providing Cassandra, Dynomite, Elastic
Search and other data stores as a service
3. ● About Netflix
● Cassandra @ Netflix
● What is our C* scale ??
● What do we upgrade, Why?
● How does it impact applications and services
● What makes C* upgrades smoother??
● Online Upgrade process
● Infrastructure for upgrade
● Pappy framework
● Migration Hiccups/ Roadblocks
Agenda
4. 4
• Started as DVD rental
• Internet delivery/ streaming of TV shows and movies directly on TVs,
computers, and mobile devices in the United States and internationally
About Netflix
5. Cassandra @ Netflix
• 95% of streaming data is stored in Cassandra
• Data ranges from customer details to Viewing history to streaming
bookmarks
10. What do we upgrade, Why?
• Software
• Cassandra
• SSL Certificates
• Priam
• OS
• Other sidecars
• Hardware
• AWS instance types
11. What do we upgrade, Why?(Continue…)
Why new release? What’s wrong with
current release?
12. • Typically nothing wrong with
current release
• Pretty stable and clusters may
still run with current release
without any issues
• End of life cycle??
What do we upgrade, Why?(Continue…)
13. • Good to be on latest and stable
release
• Lots of great features
• Bug fixes and enhancements /
new features typically are being
added to the new release
What do we upgrade, Why?(Continue…)
14. Application and service level impact/change
● Before Migration
● During migration
● After migration
15. None*
*so long as application is using latest Netflix
OSS components (Astyanax/ karyon)
Application and service level impact/change
22. Simulate PROD
• Pappy framework
• Mirror production traffic
• Identical AWS instance types
How do we certify a binary for the upgrade
23. Typical usage patterns:
• Read Heavy
• Write Heavy
• STCS
• LCS
• Aggressive TTLs
• Variable Payloads
How do we certify a binary for the upgrade
24. Usage Pattern
C* 1.2
C* 1.2
=
Metrics Upgrade C* 2.0 Metrics
MetricsMetrics
=
How do we certify a binary for the upgrade
During Upgrade:
25. Summary
• Functional validation
• Performance testing
• Load test with routine
maintenance tasks
• Repair and Compaction
• Replacement and Restarts
• Key Metrics Used
• 95th and 99th Latencies
• Dropped Metrics on C*
• Exceptions on C*
• Heap Usage on C*
• Threads Pending on C*
How do we certify a binary for the upgrade
34. Online upgrade process (steps)
Upgrade:
● Running binary upgrade of Priam and C* on a node
○ Continue to make sure that the cluster is still healthy
○ Disable node in Eureka
○ Disable thrift, gossip and run nodetool drain
○ Stop Cassandra and Priam
○ Install new C* package (tarball)
○ Make sure the cluster is still healthy
■ Disable/Enable gossip if new node does not see all others vice-versa
● Make sure the cluster is still healthy before next step
35. Online upgrade process (steps)
Post-Upgrade:
1. Set/change parameters for KeySpaces and CFs (if any)
2. Run SSTABLE Upgrade on all nodes (STCS - parallel, LCS - Serial) (if applicable)
3. Enable Chaos Monkey
4. Make sure the cluster is still healthy before considering the upgrade done
44. What is Pappy??
• Pappy is a fictional character featured in Popeye cartoons and comics.
• CDE’s new performance Test harness framework
• Netflix homegrown: Well integrated with netflix OSS infrastructure
(Archaius, Servo, Eureka and Karyan)
• To be Open Sourced @ Netflix- OSS https://netflix.github.io/
45. High level features
• Perf tests for
• Various Instance types comparison
• Varying data models in terms of payload, shard and comparators
• Different driver(s) versions
• Forklifter
• Wide Row Finder
• Pluggable architecture