This document summarizes a presentation about monitoring Cassandra systems. It discusses gathering metrics from Cassandra using JMX and nodetool, including thread pool statistics, latency histograms, and metric types. It also provides an overview of the Cassandra read/write process involving memtables and SSTables.
2. About Me
● Sr. Engineer at Pythian
o Lead of Cassandra Practice
● Remote in Minnesota
● DataStax MVP for Cassandra ‘14
● Interests
o Java, Clojure, Python dev
o Data science
o Hobbyist electronics
#CassandraSummit 2014
3. About Pythian
Pythian is a global data outsourcing and consulting company that
specializes in optimizing and managing mission-critical data systems.
Pythian blends the world’s leading data experts with advanced, secure
service delivery processes to create the industry’s best standard of care
for its clients.
Since its inception, Pythian has managed some of the world’s largest,
most business-critical data infrastructures.
#CassandraSummit 2014
10,000
Pythian currently manages more than 10,000
systems.
350
Pythian currently employs more than 350 people
in 25 countries worldwide.
1997
Pythian was founded in 1997
4. About Cassandra
● No Single Point of Failure
● Fault Tolerant
● Awesome properties for an operations team who does
not want to get up at 3am
#CassandraSummit 2014
5. About Cassandra
● Nothing should be set up and forgotten about
● Easy to do with Cassandra though
o Fault tolerance on properly configured setup handles
single node being down or having temp performance
issues
o No back pressure on writes until there is a lot of
#CassandraSummit 2014
trouble
6. Utilize the fault tolerance buffer
● Need to observe and react to current issues
● Predict future issues
● Divide this into two approaches
#CassandraSummit 2014
o Proactive
o Reactive
7. Proactive
● Daily & Weekly checkups to prevent, and
predict problems
o Capacity
o Performance bottlenecks
o Data Modeling issues
#CassandraSummit 2014
8. Reactive
● Something about best laid plans…
o Hardware failures
o Bugs
o Malicious or Non-Malicious users
● Alarms, Pager Duty
#CassandraSummit 2014
12. Metrics
but of course…
Without context, the data is just pretty graphs
13. JMX
● Java Management Extensions
● Complex… very engineered
● Resources represented as objects with
attributes and operations
● Used for monitoring or as input
#CassandraSummit 2014
14. JMX
● The annoying gateway to metrics
○ Poor tooling - requires java
○ Slow, Memory Leaks
○ Historically and currently frustrating for ops (pre 2.0.8)
Cassandra
Init connection to port
7199 Reply with hostname:port for
1024-65535
#CassandraSummit 2014
RMI connection
Client (You)
Gets new hostname:port,
drops old connection and
attempts to connect
7199
7199
Connected!
15. JMX
#CassandraSummit 2014
● Visual
o jconsole
o visualvm
● Command line
o jmxterm
o jmxsh
● MX4J
● Jolokia
24. Metrics
● Toolkit called metrics for metrics
o By Coda Hale @ Yammer
● Easy to use
● Easy to read (if you know java)
● Popular
#CassandraSummit 2014
25. Types of Metrics
#CassandraSummit 2014
● Gauge
o instantaneous value
● Counter
o number that can be incremented & decremented
● Meter
o rate of events over time (1/5/15 min moving avg)
● Histogram
o representation of statistical distribution
50, 75, 95, 98, 99, 99.9 percentile
average, median, min, max, standard deviation
● Timer
o rate of events (meter)
o histogram of duration
26. JMX
#CassandraSummit 2014
75th percentile is 683 MICROSECONDS
(75% took 683us or less)
One minute rate is 13,915 calls per SECOND
27. JMX
● Overwhelming at first
● Hard to tell what they mean without the source
● Moves around a lot
● Fortunately there is nodetool
#CassandraSummit 2014
28. Nodetool
● JMX command line wrapper
● Many options
● Operations and diagnostic procedures
● For reactive analysis
o ad hoc, spot checks
#CassandraSummit 2014
30. Staged Event Driven Architecture
● Decomposes complex event system
● Set of stages (thread pools)
● Queue between each
● Shares a lot of pros cons as SOA
#CassandraSummit 2014
32. Staged Event Driven Architecture
● Possible to overrun the processing capabilities of a
stage that is not in the requests feedback loop (i.e.
ReadRepairStage)
#CassandraSummit 2014
68. Reporting Interface
● Configurable with yaml
o console, csv, ganglia, graphite, riemann
● Create reporter with premain agent
o compiling new jar with manifest
o add to classpath
o add javaagent in cassandra-env.sh
#CassandraSummit 2014
69. Garbage Collection
● Death, Taxes, and a stop the world GC
● Common issue to all JVM based applications
#CassandraSummit 2014
70. Garbage Collection
Enable gc logging
● Virtually no overhead
● Can be very helpful in diagnosing
performance issues
#CassandraSummit 2014
73. Garbage Collection
Could be its own talk
Honorable mentions:
● https://github.com/chewiebug/GCViewer
● http://jworks.idv.tw/GcWeb/
● Python, R, Octave
#CassandraSummit 2014
74. Logging
/var/log/cassandra/system.log
o provides a rolling log
o log4j
/var/log/cassandra/output.log
o captured standard error and standard out
o truncated on restart
#CassandraSummit 2014
System Logs
o syslog, dmesg, etc