Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

2.408 Aufrufe

Veröffentlicht am

.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

  1. 1. Cassandra Meetup
  2. 2. Monitoring C* Health at Scale Jason Cacciatore @jasoncac
  3. 3. How do we assess health ? ● Node Level - dmesg errors - gossip status - thresholds (heap, disk usage, …) ● Cluster Level – ring aggregate
  4. 4. Scope ● Production + Test ● Hundreds of clusters ● Over 9,000 nodes
  5. 5. Current Architecture
  6. 6. So… what’s the problem ? ● Cron system for monitoring is problematic ○ No state, just snapshot ● Stream processor is a better fit
  7. 7. Mantis ● A reactive stream processing system that runs on Apache Mesos ○ Cloud native ○ Provides a flexible functional programming model ○ Supports job autoscaling ○ Deep integration into Netflix ecosystem ● Current Scale ○ ~350 jobs ○ 8 Million Messages/sec processed ○ 250 Gb/sec data processed ● QConNY 2015 talk on Netflix Mantis
  8. 8. Modeled as Mantis Jobs
  9. 9. Resources
  10. 10. Real-time Dashboard
  11. 11. How can I try this ? ● Priam - JMXNodeTool ● Stream Processor - Spark
  12. 12. THANK YOU
  13. 13. C* Gossip: the good, the bad and the ugly Minh Do @timiblossom
  14. 14. What is Gossip Protocol or Gossip? ● A peer-to-peer communication protocol in a distributed system ● Inspired by the form of gossip seen in human social networks ● Nodes spread out information to whichever peers they can contact ● Used in C* primarily as a membership protocol and information sharing
  15. 15. Gossip flow in C* ● At start, Gossiper loads seed addresses from configuration file into the gossip list ● doShadowRound on seeds ● Every 1s, gossip up to 3 nodes from the peers: a random peer, a seed peer, and unreachable peer
  16. 16. C* Gossip round in 3 stages
  17. 17. How Gossip Helps C*? ● Discover cluster topology (DC, Rack) ● Discover token owners ● Figure out peer statuses: ○ moving ○ leaving/left ○ normal ○ down ○ bootstrapping ● Exchange Schema version ● Share Load (used disk space)/Severity (CPU) ● Share Release version/Net version
  18. 18. What Gossip does not do for C*? ● Detect crashes in Thrift or Native servers ● Manage cluster (need Priam or OpsCenter) ● Collect performance metrics: latencies, RPS, JVM stats, network stats, etc. ● Give C* admins a good sleep.
  19. 19. Gossip race condition Most of Gossip issues and bugs are caused by incorrect code logics in handling race conditions ● The larger the cluster the higher chance of having race condition ● There are several C* components running in different threads that can affect gossip status: ○ Gossiper ○ FailureDetector ○ Snitches ○ StorageService ○ InboundTcpConnection ○ OutboundTcpConnection
  20. 20. Pain - Gossip Can Inflict on C* ● CASSANDRA-6125 An example to show race condition ● CASSANDRA-10298 Gossiper does not clean out metadata on a dead peer properly to cause a dead peer to stay in ring forever ● CASSANDRA-10371 Dead nodes remain in gossip to prevent a replacement due to FailureDetector unable to evict a down node ● CASSANDRA-8072 Unable to gossip to any seeds ● CASSANDRA-8336 Shut down issue that peers resurrect a down node
  21. 21. Pain - Gossip Can Inflict on C*, cont. ● CASSANDRA-8072 and CASSANDRA-7292 Problems on reusing IP of a dead node on a new node ● CASSANDRA-10969 Long running cluster (over 1yr) has restarting issue ● CASSANDRA-8768 Upgrading issue to newer version ● CASSANDRA-10321 Gossip to dead nodes caused CPU usage to be 100% ● A lemon node or AWS network issue to cause one node not to see the other to display a confusing gossip view
  22. 22. What can we do? ● Rolling restart C* cluster once in awhile ● On AWS, when there is an gossip issue, try reboot. If still have bad gossip view,, replace with a new instance ● Node assassination (unsafe and need a repair/clean-up) ● Monitor network activities to take pre-emptive actions ● Search community for the issues reported in the system logs ● Fix it yourself ● Pray
  23. 23. THANK YOU
  24. 24. References ● https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf ● https://wiki.apache.org/cassandra/ArchitectureGossip
  25. 25. Cassandra Tickler @chriskalan
  26. 26. When does repair fall down? ● Running LCS on an old version of C* ● Space issues ● Repair gets stuck
  27. 27. Solution - Cassandra Tickler
  28. 28. Solution - Cassandra Tickler
  29. 29. Solution - Cassandra Tickler https://github.com/ckalantzis/cassTickler
  30. 30. THANK YOU

×