ZMON is OS monitoring in the cloud used by Zalando for their operations. It provides:
1) Flexible checks and alerts written in Python that integrate via REST APIs and OAuth.
2) Team dashboards and alert inheritance for great team collaboration.
3) Fast scaling metrics using Redis, KairosDB and Grafana for displaying historical data.
4) Notifications via email and mobile apps along with full authentication of all endpoints.
Automating Google Workspace (GWS) & more with Apps Script
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs
1. ZMON - OS monitoring in the cloud
Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
2. 15 countries
3 fulfillment centers
18+ million active customers
3.0+ billion € revenue
135+ million visits per month
1.000+ employees in tech
Europe's Leading Fashion Platform
Visit us: tech.zalando.com
18. Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* Nodes
C* Nodes
C* Nodes
C* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
19. Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
25. HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
32. ● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
33. REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
34.
35. Trial Run - Quick feedback and easier development
37. ● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
38.
39.
40. Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
41. Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
44. ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
ApplianceEC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
45. ● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
51. Spring boot (extending metrics)
https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)
https://github.com/zalando/connexion
Clojure (Swagger first)
https://github.com/zalando-stups/friboo/
Example libraries and framework support ...