Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity.
Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions.
For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring
3. What is monitoring?
Host-based checks?
• Typically shell scripts with success/fail
• Failure causes alerts
• More blackbox than whitebox
• About machines, not services
4. Brian’s Pet Peeve #1
Thinking in terms of machines rather than services.
It’s the future, it’s not the “Webserver machine” it’s one
machine what happens to run a Webserver as part of the
Webserver service.
5. What is monitoring?
Highly granular information about a subsystem?
• Tends to be focused on one subsystem, such as
incoming http requests
• No visibility into the rest of the system
6. What is monitoring?
High frequency high granularity profiling?
• Great for debugging once you’ve narrowed down the
problem
• Not so useful for general monitoring
7. What is monitoring?
Tailing logs?
• Easy to miss something
• Tend to get very noisy
• We have computers, why are humans doing repetitive
tasks?
9. Why: Alerting
We want to know when things go wrong
We want to know when things aren’t quite right
We want to know in advance of problems
10. Why: Debugging
When something is up, you need to debug.
You want to go from high-level problem, and drill down to
what’s causing it. Need to be able to reason about things.
Sometimes want to go from code back to metrics.
11. Brian’s Pet Peeve #2
Instrumentation that you need to read the code to
understand.
Make the names such that a random person not intimately
familiar with the system would have a good chance at
guessing what it means. Specify your units.
12. Why: Trending
How the various bits of a system are being used.
For example, how many static requests per dynamic
request? How many sessions active at once? How many hit
a certain corner case?
For some stats, also want to know how they change over
time for capacity planning and design discussions.
13. A different approach
What if we instrumented everything?
• RPCs
• Interfaces between subsystems
• Business logic
• Every time you’d log something
What if we monitored systems and subsystems to know
how everything is generally doing?
14. That’s a lot of metrics
That could be tens of thousands of codepoints across an
entire system.
You’d need some way to make it easy to instrument all
code, not just the externally facing parts of applications.
You’d need something able to handle a million time series.
15. Presenting Prometheus
An open-source service monitoring system and time series
database.
Started in 2012, primarily developed in Soundcloud with
committers also in Boxever and Docker.
Publicly announced January 2015, many contributions and
users since then.
17. The Server
• Can handle over a million time series in one instance
• No dependencies such as HBase or Cassandra
• Stores data on local disk
• Written in Go
• Easy to run
18. Data Model
Tired of aggregating and alerting off metrics like http.
responses.500.myserver.mydc.production?
Time series have structured key-value pairs, e.g.
http_responses_total{
response_code=”500”,instance=”myserver”,
dc=”mydc”,env=”production”}
19. Brian’s Pet Peeve #3
Munging structured data in a way that loses the structure
Is it so much to ask for some escaping, or at least sanitizing
any separators in the data?
20. Query Language
Aggregation based on the key-value labels
Arbitrarily complex math
And all of this can be used in pre-computed rules and alerts
21. Query Language: Example
Column families with the 10 highest read rates per second
topk(10,
sum by(job, keyspace, columnfamily) (
rate(cassandra_columnfamily_readlatency[5m])
)
)
22. Client Libraries
How you instrument your code
• Official: Python, Java, Go, Ruby
• Unofficial: Bash, NodeJS, .Net
Text-based format, easy to write new client libraries
24. Client Libraries: In and Out
Client libraries don’t tie you to Prometheus instrumentation
Custom collectors allow pulling data from other
instrumentation systems into Prometheus client library
Similarly, can pull data out of client library and expose as
you wish
27. Things to have
• Client and server qps/errors/latency
• Every log message should be a metric
• Every failure should be a metric
• Threadpool/queue size, in progress, latency
• Business logic inputs and outputs
• Data sizes in/out
• Process cpu/ram/language internals (e.g. GC)
• Blackbox and end-to-end monitoring
28. Batch/Offline Processing Metrics
• Last time it succeeded
• Records processed/throughput
• Duration of major batch stages
• Heartbeats for end-to-end testing
Use the PushGateway for ephemeral jobs
29. Brian’s Pet Peeve #4
Wrapping instrumentation libraries to make them “simpler”
Tend to confuse abstractions, encourage bad practices and
make it difficult to write correct and useable instrumentation
e.g. Prometheus values are doubles, if you only allow ints
then end user has to do math to convert back to seconds
30. Speaking of Correct Instrumentation
It’s better to have math done in the server, not the client
Many instrumentation systems are exponentially decaying
Do you really want to do calculus during an outage?
Prometheus has monotonic counters
Races and missed scrapers don’t lose data
31. Integrations
More powerful data model needs integrations and
instrumentation written to take advantage of that
Machine (Node Exporter), HAProxy, CloudWatch, Statsd,
Collectd, JMX, Mesos, Consul, MySQL
Direct instrumentation: cadvisor, etcd
32. Dashboards
• Promdash: Ruby on Rails web app
• Console templates: More power for those who like
checking things in
• Expression browser: Ad-hoc queries
• JSON interface: Roll your own
35. Dashboards
Goal: Make it easy to logically drill down a problem
Most services are a graph.
Make it easy to go from a console about one service, see
which of it’s backends is the problem, and repeat.
36. Brian’s Pet Peeve #5
Dashboard anti-patterns:
● Graph of a hundred plots
● Page of a thousand graphs
● Consoles that their creator can barely understand
● “Put it on a console somewhere”
37. Dashboard Guidelines
• Don’t put every possible metric on the dashboard
• Focus on the top few metrics, based on the most likely
failure modes and things you’ll want to know
• No more than 5 graphs per console, 5 plots per graph
• Have units, y-labels, legends and descriptions
• Split out or trim if dashboards are getting too complex
• It’s difficult for a dashboard to serve two masters
• Dashboards are not for alerting
38. Alerting
Alertmanager aggregates alerts from Prometheus servers
Supports notifications to Pagerduty, Email, Pushover
Best practices:
• Alert on symptoms not causes
• Have a way to deal with non-critical alerts
39. What to Alert On
Online Serving: Overall latency, errors
Offline Processing: Propagation/Processing Delay
Batch Jobs: When it last Suceeded
40. The Live Demo
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work
41. The Future
Many features on roadmap:
• Service discovery
• Federation
• Long term storage
• HA Alertmanager
• More exporters, client libraries and integrations
42. Final Word
Systems Monitoring is your first port of call in an
emergency, keep it working without needing lots of effort.
Prometheus is awesome - can be lead to non-critical data
taking lots of management, crowding out critical metrics.
At some point, have to move non-critical things to generic
data processing system.
43. Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?
44. Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?
No, when the unittests find a bug in your filesystem!
45. Try it out!
http://www.boxever.com/tag/monitoring has step-by-step instructions on
monitoring:
• Machines
• Cassandra
• HAProxy
• Python Batch host-based jobs
• Java applications
Problems?
We’re on #prometheus on Freenode