SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
ZMON - OS monitoring in the cloud
Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
15 countries
3 fulfillment centers
18+ million active customers
3.0+ billion € revenue
135+ million visits per month
1.000+ employees in tech
Europe's Leading Fashion Platform
Visit us: tech.zalando.com
Zalando’s Technology History
RADICAL AGILITY
AUTONOMY
➊ One AWS account per Team
➋ Deployment with Docker
➌ Managed SSH Access
➍ REST/OAuth 2.0 mandatory
➎ Traceability of changes
IN A NUTSHELL
STUPS
AWS
DEPLOYMENT
Senza CLI
Deploy Tool
Pier One
Docker Registry
docker pull
docker push
Taupage
AMI
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
ISOLATED AWS ACCOUNTS
EC2EC2
ELBELB
EC2
ZMON
Flexible and extendable: Checks & Alerts in Python
Integrate: REST APIs, OAUTH2, AWS Auto Discovery
Fully configurable via UI / API: no restarts required!
Great for teams: team dashboards, alerts inheritance
Fast/scaling metrics: Redis, KairosDB + Grafana2
Hackweek 2015 - iOS app and Android app ;-)
ZMON - High Lights ;-)
Display historic data using Grafana 2
Notifications plus iOS and Android App
E-Mail
Full authentication for all endpoints
OAUTH2 login flow (e.g. via Github login)
“TV Tokens” for “read-only” dashboard login
Grafana 2 bundled and API implemented
● ZMON stores dashboards incl. tags/stars
● KairosDB proxy
● ElasticSearch proxy (in progress)
ZMON Controller -> UI + REST API
Example
Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* Nodes
C* Nodes
C* Nodes
C* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"
id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]
type: instance
application_id: planb-tokeninfo
host: 172.31.169.6
infrastructure_account: aws:999
instance_type: c4.xlarge
ip: 172.31.169.6
ports: { '9020': 9020, '9021': 9021 }
region: eu-west-1
source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44
stack_name: planb-tokeninfo-eu-west-1
stack_version: cd44
Example Instance Entity
Instance Metrics
● Memory usage
● Disk space usage
● CPU usage
● Application logs
● Application metrics
Monitoring Plan-B instances on AWS
Scalyr Agent
Log shipping
Prometheus
Node Agent
:9100/metrics
Taupage AMI (Ubuntu base)
Application Container
Go / Spring Boot / Cassandra
Docker run time
:8080 -> app
:7979 -> metrics
Jolokia Request Example
Check Results
Alert on application metrics
HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
Annotated Metric Data in Grafana
Annotated Metric Data in Grafana
Entities
● hosts, databases, applications, instances ...
● generic key value object
● 10000+ entities in our deployment
Entities
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
Entity "node01:8080"
Entity Service (part of controller)
id: localhost:5432
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
Integrated easy-to-use entity store with REST API
Build your own discovery agent (K8S, …)
>zmon entities push local-postgres.yaml
Checks
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
Trial Run - Quick feedback and easier development
Alerts
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
Deployment
Workers
(Python)
Workers
(Python)
ZMON Core + UI + KairosDB
Scheduler
(jvm)
Redis
Worker
(Python)
KairosDB
(Java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(AngularJS)
Metric Cache
ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
ApplianceEC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
Micro
Services
Expose your data / Convention on key names/structure
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512,
"zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.min": 1114
}
Application metrics
Continued ...
Spring boot (extending metrics)
https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)
https://github.com/zalando/connexion
Clojure (Swagger first)
https://github.com/zalando-stups/friboo/
Example libraries and framework support ...
Demo:
https://demo.zmon.io
ZMON on Github:
https://github.com/zalando/zmon
Documentation:
https://docs.zmon.io
Zalando Tech:
https://tech.zalando.com

Weitere ähnliche Inhalte

Andere mochten auch

MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...PROIDEA
 
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...PROIDEA
 
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...PROIDEA
 
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...PROIDEA
 
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...PROIDEA
 
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst [CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst PROIDEA
 
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]PROIDEA
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...PROIDEA
 
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)PROIDEA
 
[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)PROIDEA
 

Andere mochten auch (10)

MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
 
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
 
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
 
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
 
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
 
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst [CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
 
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
 
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
 
[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)
 

Ähnlich wie Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...NETWAYS
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker Zalando Technology
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZalando Technology
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureMikhail Prudnikov
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivAmazon Web Services
 
F5 Automation and service discovery
F5 Automation and service discoveryF5 Automation and service discovery
F5 Automation and service discoveryScott van Kalken
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkCloudflare
 
AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.Splunk
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made SimpleLuciano Mammino
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaRyan Cuprak
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino
 
Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Lucas Jellema
 
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Christian Schneider
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksEmmanuel Quentin
 
Deep Dive into SpaceONE
Deep Dive into SpaceONEDeep Dive into SpaceONE
Deep Dive into SpaceONEChoonho Son
 
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechDocker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechHenning Jacobs
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 

Ähnlich wie Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs (20)

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering Platform
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless Architecture
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
Monitoring klassisch oder Cloud
Monitoring klassisch oder CloudMonitoring klassisch oder Cloud
Monitoring klassisch oder Cloud
 
F5 Automation and service discovery
F5 Automation and service discoveryF5 Automation and service discovery
F5 Automation and service discovery
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS Lambda
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...
 
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacks
 
Deep Dive into SpaceONE
Deep Dive into SpaceONEDeep Dive into SpaceONE
Deep Dive into SpaceONE
 
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechDocker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

  • 1. ZMON - OS monitoring in the cloud Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
  • 2. 15 countries 3 fulfillment centers 18+ million active customers 3.0+ billion € revenue 135+ million visits per month 1.000+ employees in tech Europe's Leading Fashion Platform Visit us: tech.zalando.com
  • 5. ➊ One AWS account per Team ➋ Deployment with Docker ➌ Managed SSH Access ➍ REST/OAuth 2.0 mandatory ➎ Traceability of changes IN A NUTSHELL STUPS
  • 6. AWS DEPLOYMENT Senza CLI Deploy Tool Pier One Docker Registry docker pull docker push Taupage AMI
  • 7. Internet *.abc.example.org *.xyz.example.org Team ABC Team XYZ ISOLATED AWS ACCOUNTS EC2EC2 ELBELB EC2
  • 9. Flexible and extendable: Checks & Alerts in Python Integrate: REST APIs, OAUTH2, AWS Auto Discovery Fully configurable via UI / API: no restarts required! Great for teams: team dashboards, alerts inheritance Fast/scaling metrics: Redis, KairosDB + Grafana2 Hackweek 2015 - iOS app and Android app ;-) ZMON - High Lights ;-)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Display historic data using Grafana 2
  • 15. Notifications plus iOS and Android App E-Mail
  • 16. Full authentication for all endpoints OAUTH2 login flow (e.g. via Github login) “TV Tokens” for “read-only” dashboard login Grafana 2 bundled and API implemented ● ZMON stores dashboards incl. tags/stars ● KairosDB proxy ● ElasticSearch proxy (in progress) ZMON Controller -> UI + REST API
  • 18. Tokeninfo (GO)Tokeninfo (GO) Provider (Java) Provider (Java) Tokeninfo (GO)Tokeninfo (GO) C* Nodes C* Nodes C* Nodes C* Nodes Plan B Deployment - Multi Region Setup (JWT issue/verification) C* NodesProvider (Java)ELB Tokeninfo (Go)ELB C* NodesProvider (Java)ELB Tokeninfo (Go)ELB
  • 19. Will create “entities” to describe deployment ELBs, ASGs, Application, instances,... Crawls AWS API every 60 sec to update ZMON AWS Agent - Auto Discovery
  • 20. ➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]" id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1] type: instance application_id: planb-tokeninfo host: 172.31.169.6 infrastructure_account: aws:999 instance_type: c4.xlarge ip: 172.31.169.6 ports: { '9020': 9020, '9021': 9021 } region: eu-west-1 source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44 stack_name: planb-tokeninfo-eu-west-1 stack_version: cd44 Example Instance Entity
  • 21. Instance Metrics ● Memory usage ● Disk space usage ● CPU usage ● Application logs ● Application metrics Monitoring Plan-B instances on AWS Scalyr Agent Log shipping Prometheus Node Agent :9100/metrics Taupage AMI (Ubuntu base) Application Container Go / Spring Boot / Cassandra Docker run time :8080 -> app :7979 -> metrics
  • 25. HTTP requests reading JSON application metrics Read JMX data via Jolokia/HTTP for Cassandra Read Prometheus Node data for EC2 metrics CloudWatch() queries for ELB metrics Scalyr API queries for application logs Check commands used so far
  • 26. Annotated Metric Data in Grafana
  • 27. Annotated Metric Data in Grafana
  • 29. ● hosts, databases, applications, instances ... ● generic key value object ● 10000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  • 30. Entity Service (part of controller) id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml Integrated easy-to-use entity store with REST API Build your own discovery agent (K8S, …) >zmon entities push local-postgres.yaml
  • 32. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch, Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ... ● returns "value" object ○ Quickly, every check returned "dicts" Checks
  • 33. REST API to update or use web front end zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team ZMON" command: | sql().execute("select 1 as a").results() entities: - type: postgresql interval: 15 description: "Test connection to PostgreSQL" select-1-check.yaml
  • 34.
  • 35. Trial Run - Quick feedback and easier development
  • 37. ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities(color) and tags Alerts
  • 38.
  • 39.
  • 40. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  • 41. Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  • 43. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (Java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (AngularJS) Metric Cache
  • 44. ZMON in AWS / Multi DC Setup *.foo.example.org *.bar.example.org Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON ApplianceEC2 Instance EC2 Instance ZMON Data Service ELB ELB
  • 45. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  • 46.
  • 48. Expose your data / Convention on key names/structure { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114 }
  • 51. Spring boot (extending metrics) https://github.com/zalando/zmon-actuator Python (Swagger first on Flask) https://github.com/zalando/connexion Clojure (Swagger first) https://github.com/zalando-stups/friboo/ Example libraries and framework support ...