In this talk, we'll explain our journey from having near-zero monitoring to having all of our infrastructure monitored with the necessary metrics and alerts.
We will share with the audience some of the mistakes we did and what lessons we have learned. We currently have around 200 instances monitored with a comfortable cost-effective in-house monitoring stack based on Prometheus.
We want to demonstrate that you don't need to have a big fleet to embrace Prometheus and that it is a non-expensive solution for monitoring.
----------
ShuttleCloud is a small startup specialized in email and contacts migrations. We developed a reliable migration platform in high availability used by clients like Gmail, GContacts and Comcast.
For example, Gmail alone has imported data for 3 million users with our API and we process hundreds of terabytes every month.
-------------
Follow us on Twitter:
@ShuttleCloud: https://twitter.com/ShuttleCloud
@ShuttleCloudEng: https://twitter.com/ShuttleCloudEng
ShuttleCloud.com
4. ShuttleFacts
• Gmail: 3 million users with our API
• High availability HA SLA 99.5%
• 6 TB/h
• +18k migrations per day
• ~30 million emails per day
• ~3 million contacts per day
• 247 providers around the world
5. What to expect from this presentation
• Disclaimers
• In the beginning…
• A new dawn
• Middle Ages
• Modern History
• Back to the Future
18. Why Prometheus?
• Metrics have labels - flexibility (can be added/changed)
• No need of external services (i.e. Sensu with RabbitMQ)
• Service Discovery from our DNS
• Easy to install and test
19. Some first steps
• Bronze Age: The targets were statically fed using Ansible
• Iron Age: DNS service discovery
• Operations Metrics from node_exporter
• Others using textfile in node_exporter:
20. Some first alerts
• Only Operations Alerts:
• Hard drive usage
• InstanceDown
• Absolute thresholds HD 85% capacity — Send email
HD 90% capacity — Page
30. HA
• Publisher code is not ready to have two instances of itself
• Publisher operations are not atomic
Prometheus
Publisher
Push
Gateway
Prometheus
Publisher
31. Clusterize
• Clustering solution because of other services in the same
ecosystem
Prometheus
Publisher
Push
Gateway
Prometheus
Publisher
42. Alert Manager
• 0.0.4 0.1.1(now 0.5)
• Alerts as a condition-based tree vs condition-based list
• Similar alerts are grouped when notified
• Pagerduty integration improved
49. Blackbox Exporter
• Metrics on certificates expiring date
• Kudos to:
(http://www.robustperception.io/get-alerted-before-your-ssl-certificates-expire/)