Presenter: Chris Lohfink, Engineer at Pythian
This session will cover a walk-through to provide an understanding of key metrics critical to operating a Cassandra cluster effectively. Without context to the metrics, we just have pretty graphs. With context, we have a powerful tool to determine problems before they happen and to debug production issues more quickly.
1. About Me
● Sr. Engineer at Pythian
o Lead of Cassandra Practice
#CassandraSummit 2014
● Remote in Minnesota
● Interests
o Java, Clojure, Python dev
o Data science
o Information Security
o Hobbyist electronics
2. About Pythian
Pythian is a global data outsourcing and consulting company that
specializes in optimizing and managing mission-critical data systems.
Pythian blends the world’s leading data experts with advanced, secure
service delivery processes to create the industry’s best standard of care
for its clients.
Since its inception, Pythian has managed some of the world’s largest,
most business-critical data infrastructures.
#CassandraSummit 2014
10,000
Pythian currently manages more than 10,000
systems.
350
Pythian currently employs more than 350 people
in 25 countries worldwide.
1997
Pythian was founded in 1997
3. About Cassandra
● No Single Point of Failure
● Fault Tolerant
● Awesome properties for an operations team who does
not want to get up at 3am
#CassandraSummit 2014
4. About Cassandra
● Nothing should be set up and forgotten about
● Easy to do with Cassandra though
o Fault tolerance on properly configured setup handles
single node being down or having temp performance
issues
o No back pressure on writes until there is a lot of
trouble
#CassandraSummit 2014
5. Utilize the fault tolerance buffer
● Need to observe and react to current issues
● Predict future issues
● Divide this into two approaches
#CassandraSummit 2014
o Proactive
o Reactive
6. Proactive
● Daily & Weekly checkups to prevent, and
predict problems
o Capacity
o Performance bottlenecks
o Data Modeling issues
#CassandraSummit 2014
7. Reactive
● Something about best laid plans…
o Hardware failures
o Bugs
o Malicious or Non-Malicious users
● Alarms, Pager Duty
#CassandraSummit 2014
11. Metrics
but of course…
Without context, the data is just pretty graphs
12. JMX
● Java Management Extensions
● Complex… very engineered
● Resources represented as objects with
attributes and operations
● Used for monitoring or as input
#CassandraSummit 2014
13. JMX
● The annoying gateway to metrics
○ Poor tooling - requires java
○ Slow, Memory Leaks
○ Historically and currently frustrating for ops (pre 2.0.8)
Cassandra
Init connection to port
7199 Reply with hostname:port for
1024-65535
#CassandraSummit 2014
RMI connection
Client (You)
Gets new hostname:port,
drops old connection and
attempts to connect
7199
7199
Connected!
14. JMX
#CassandraSummit 2014
● Visual
o jconsole
o visualvm
● Command line
o jmxterm
o jmxsh
● MX4J
● Jolokia
23. Metrics
● Toolkit called metrics for metrics
o By Coda Hale @ Yammer
● Easy to use
● Popular
#CassandraSummit 2014
24. Types of Metrics
#CassandraSummit 2014
● Gauge
o instantaneous value
● Counter
o number that can be incremented & decremented
● Meter
o rate of events over time (1/5/15 min moving avg)
● Histogram
o representation of statistical distribution
§ 50, 75, 95, 98, 99, 99.9 percentile
§ average, median, min, max, standard deviation
● Timer
o rate of events (meter)
o histogram of duration
25. JMX
#CassandraSummit 2014
75th percentile is 683 MICROSECONDS
(75% took 683us or less)
One minute rate is 13,915 calls per SECOND
26. JMX
● Overwhelming at first
● Hard to tell what they mean without the source
● Moves around a lot
● Fortunately there is nodetool
#CassandraSummit 2014
27. Nodetool
● JMX command line wrapper
● Many options
● Operations and diagnostic procedures
● For reactive analysis
o ad hoc, spot checks
#CassandraSummit 2014
29. Staged Event Driven Architecture
● Decomposes complex event system
● Set of stages (thread pools)
● Queue between each
● Shares a lot of pros cons as SOA
#CassandraSummit 2014
31. Staged Event Driven Architecture
● Its easy to overrun the processing capabilities of a stage
that is not in the requests feedback loop (i.e.
ReadRepairStage).
● No write back pressure
#CassandraSummit 2014
65. Nodetool compactionstats
#CassandraSummit 2014
nodetool compactionstats
org.apache.cassandra.metrics:type=Compaction
pending tasks: 1
compaction type keyspace table completed total unit Progress
Compaction Keyspace1 Standard1 6076415 29605054 bytes 20.06%
Active compaction remaining time : 0h00m03s
66. Nodetool compactionstats
#CassandraSummit 2014
nodetool compactionstats
org.apache.cassandra.metrics:type=Compaction
pending tasks: 1
compaction type keyspace table completed total unit Progress
Compaction Keyspace1 Standard1 6076415 29605054 bytes 20.06%
Active compaction remaining time : 0h00m03s
67. Nodetool compactionstats
#CassandraSummit 2014
nodetool compactionstats
org.apache.cassandra.metrics:type=Compaction
pending tasks: 1
compaction type keyspace table completed total unit Progress
Compaction Keyspace1 Standard1 6076415 29605054 bytes 20.06%
Active compaction remaining time : 0h00m03s
68. Nodetool compactionstats
#CassandraSummit 2014
nodetool compactionstats
org.apache.cassandra.metrics:type=Compaction
pending tasks: 1
compaction type keyspace table completed total unit Progress
Compaction Keyspace1 Standard1 6076415 29605054 bytes 20.06%
Active compaction remaining time : 0h00m03s
69. Nodetool
Much more!!
http://www.datastax.com/documentation/
cassandra/2.0/cassandra/tools/
toolsNodetool_r.html
#CassandraSummit 2014
70. OpsCenter
● Provides visibility to key metrics
● Alarming
● Basic orchestration and config management
● Constantly improving
● Free*
● Almost zero barrier to get setup
● Very few reasons not to run it
#CassandraSummit 2014
71. OpsCenter
● Homogeneous tooling with rest of stack
o Integrate metrics in with what app is using
o orchestration and config management
● (paid version) “Good enough”
o a mature environment should have more
#CassandraSummit 2014
73. Reporting Interface
● Configurable with yaml
o console, csv, ganglia, graphite
● Create reporter with premain agent
o compiling new jar with manifest
o add to classpath
o add javaagent in cassandra-env.sh
#CassandraSummit 2014
74. Garbage Collection
● Death, Taxes, and a stop the world GC
● Common issue to all JVM based applications
#CassandraSummit 2014
75. Garbage Collection
Enable gc logging
● Virtually no overhead
● Can be very helpful in diagnosing
performance issues
#CassandraSummit 2014
78. Garbage Collection
Could be its own talk
Honorable mentions:
● https://github.com/chewiebug/GCViewer
● http://jworks.idv.tw/GcWeb/
● Python, R, Octave
#CassandraSummit 2014
79. Logging
/var/log/cassandra/system.log
o provides a rolling log
o log4j
/var/log/cassandra/output.log
o captured standard error and standard out
o truncated on restart
#CassandraSummit 2014
System Logs
o syslog, dmesg, etc