I gave this talk at MinneBar 2014: http://sessions.minnestar.org/sessions/162
When I joined a SaaS startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low.
Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. We'll talk tools and pitfalls, missteps and dead ends, and everything we haven't yet done but should.
Tools covered will include Nagios, StatsD, Graphite, and Sentry, with some digressions into others such as NewRelic and MMS.
8. “Sensu has so many
moving parts that I
wouldn’t be able to
sleep at night unless
I set up a Nagios
instance to make
sure they were all
running.”
-- @murphy_slaw (via @lozzd)
9. HBase: monitor all the ports?!?
hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
14. “cyber” monday:
1988 called; wants its word back.
the rewards of hubris
MMS showed the issue
but we weren't alerting on it
didn't understand the global write lock
15. If it moves, we track it.
Sometimes we’ll draw a graph
of something that isn’t moving
yet, just in case it decides to
make a run for it. -- @indec
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
16. Graphite & StatsD
➔ Graphite
◆ Store and visualize time-series data
◆ http://graphite.readthedocs.org/
➔ StatsD
◆ Measure everything! (Timers, counters, events, …)
◆ https://github.com/etsy/statsd/
17. Where we were
➔ Graphite 0.9.9 (wanted 0.9.12)
◆ over 2 years old
◆ missing new features (Consolidate by!)
➔ StatsD was newish, but…
◆ hand-rolled
◆ running in a screen session
◆ on a special snowflake box
18. Community cookbooks?
➔ Graphite ones good, but…
◆ focus on Apache (we use nginx)
◆ we haven’t moved to Chef 11 (gasp!)
➔ StatsD
◆ https://github.com/librato/statsd-cookbook
◆ launches daemons via upstart
◆ generates config file based on attributes
19. Graphite cookbook (Part 1)
➔ Install in a virtualenv (django, uwsgi, nginx)
➔ Dependencies recommended
◆ https://github.com/graphite-project/graphite-
web/blob/master/requirements.txt
➔ libcairo2-dev package on Ubuntu 12.04 LTS
➔ install graphite’s 3 parts via pip
20. Graphite cookbook (Part 2)
➔ graphite-web
◆ Django app, renders graphs
➔ whisper
◆ fixed-size database for storing time-series data
◆ like RRD
➔ carbon
◆ carbon-cache.py - stores data
◆ carbon-aggregator.py - buffers, then stores
◆ carbon-relay.py - for sharding/replication
21. when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
22. carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to
# DESTINATIONS in addition to
# the output of the aggregation rules. If set false
# the carbon-aggregator will
# only ever send the output of aggregation.
FORWARD_ALL = True
26. ❏ finds real problems
❏ actionable alerting
❏ usable by all
❏ …?
the ideal
monitoring
solution...
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
27. What we’re actually using now
StatsD
Application-level error
analysis
Alarms for autoscaling
Timers &
counters
Log & host-level
Hadoop & HBase
visualization
MongoDB
Graphs
Time-series
data graphing
client-side
plugins
External uptime checks
oncall rotation/alerting
Threshold-based alarms
Dashboard