SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Please stop using Nagios
(so it can die peacefully)
Andy Sykes
Devops @ Forward3D
@supersheep
andy@forward3d.com
Do you use Nagios?
Tell me why you picked it.
Go on.
If you don't, why don't you?
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
So why did you pick Nagios?
Because it's the "safe", default choice.
Because we've grown accustomed to the things
that really, really suck about it.
It's a little like we've all got Stockholm
Syndrome.
What Nagios gets right
Incredibly simple plugin model.
Fairly secure (SSL between agents + master).
Very simple conceptually.
Reliable.
Nagios, I hate thee; let me count thy ways
Doesn't scale. At all.
World's second most horrible configuration*.
Horrendous interface**.
Assumes a static infrastructure.
No decent programmatic interfaces***.
Throws away perfdata.
Stupid wire format for clients (NRPE/NSCA).
* the world's most horrible configuration is, obviously, Sendmail.
** even the paid Nagios XI one is ugly as sin and unusable.
*** if I catch you parsing status.dat, I will beat your ass.
Expansion about config
Configuration has to be in two places:
Server has to know what checks to invoke
via NRPE.
Client has to know what checks it will be
asked to invoke with NRPE.
THIS IS MADNESS.
Scaling, or lack of it
No such thing as a Nagios cluster.
More checks = more work = longer before you
know something's happened!
Every check increases your master's load
average.
Okay, yes, there’s mod_gearman
But it’s a hack at best.
No redundancy for the machine that distributes
the checks, so it’s not a real cluster.
API poverty
Can't easily integrate with other systems.
Can't easily write custom dashboards.
Can't get information out again!

Assumes a static infra
Master has to be told about a client before
things can happen.
The bandaids we make
Interface:
Opsview, Icinga, Shinken, others

API:
Parsing status.dat, NDO

Client wire format:
Opsview's NRPE, NRD

Config management:
Puppet types, Chef cookbooks
None of it is good enough.
The take-home point:

"If we keep using Nagios,
we'll never get anything
better."
(Writing monitoring systems is hard, and needs community involvement and
real world adoption. Nagios steals mindshare by being just good enough. It's
the monitoring system we deserve, but not the one we need right now.)
So, smart guy. What do we do?

Steal all the things that are great about Nagios.
(existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome.
(scalable, API-ready, config management friendly, modern!)
THIS DOESN’T MEAN WRITING
YOUR OWN MONITORING SYSTEM
Points for thought:

●  What else are people using?
●  Should we greenfield or lift existing tools?
●  What tools could we go with?
My suggestion:

Like OMD, but better.
Wrap up a series of “best in breed” tools to
make one kickass monitoring tool.
What we need:
Core
Agent
Graphing
Anomaly detection
Alerting
UI
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
There’s something we can use for this.
Sensu!
Sensu is often described as the “monitoring router”.
{
"checks": {
"chef_client": {
"command": "check-chef-client.rb",
"subscribers": [
"production" ],
"interval": 60,
"handlers": [
"pagerduty",
"irc"
]
}
}
}

Only on the server
Client requires no registration for the server
to know about it
Uses Nagios status return codes
Doesn’t talk to the server - talks to
RabbitMQ
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing
Anomaly detection
Alerting
UI
Graphing is easy now.
If you’re not using Graphite, you should be.
Sensu “metric” checks can pump data to it.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection
Alerting
UI
Anomaly detection is hard.
We’ve got all this metric data, but how do we check it?
- Skyline/Oculus (Etsy)
- Grok (very early days)
- ???
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting
UI
Alerting is tricky, but mostly solved.
Flapjack! - flapjack.io
Alerting is not the concern of your monitoring tool.
Push all alerts at Flapjack
- define gateways (PagerDuty, email)
- create relationships between checks and gateways
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
User interfaces are hard.
What do we need from it?
- What’s broken
- When it broke, when it broke in the past
- Say “OK, I know it’s broken”
- View graphs to see how quickly it broke
- See every check everywhere, and filter the list
The Sensu Dashboard sucks.
No history!
Acknowledgements aren’t easy to do.
No graphing.
Can’t see anything that’s reporting an OK status.
This won’t do.
I’m going to have to write a UI. Sigh.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
- ???
In Summary

Nagios sucks.
There are good tools for each concern
of monitoring.
If we can package them together, we
can have something that rocks.
Thank You.

Contact
andy@forward3d.com (@supersheep)

Weitere ähnliche Inhalte

Was ist angesagt?

[0903 구경원] recast 네비메쉬
[0903 구경원] recast 네비메쉬[0903 구경원] recast 네비메쉬
[0903 구경원] recast 네비메쉬
KyeongWon Koo
 
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
devCAT Studio, NEXON
 
TCP가 실패하는 상황들
TCP가 실패하는 상황들TCP가 실패하는 상황들
TCP가 실패하는 상황들
ssuser7c5a40
 

Was ist angesagt? (20)

Multiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theoremMultiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theorem
 
Http4s
Http4s Http4s
Http4s
 
게임 기획과 Oop
게임 기획과 Oop게임 기획과 Oop
게임 기획과 Oop
 
KGC 2016: HTTPS 로 모바일 게임 서버 구축한다는 것 - Korea Games Conference
KGC 2016: HTTPS 로 모바일 게임 서버 구축한다는 것 - Korea Games ConferenceKGC 2016: HTTPS 로 모바일 게임 서버 구축한다는 것 - Korea Games Conference
KGC 2016: HTTPS 로 모바일 게임 서버 구축한다는 것 - Korea Games Conference
 
MMOG Server-Side 충돌 및 이동처리 설계와 구현
MMOG Server-Side 충돌 및 이동처리 설계와 구현MMOG Server-Side 충돌 및 이동처리 설계와 구현
MMOG Server-Side 충돌 및 이동처리 설계와 구현
 
[0903 구경원] recast 네비메쉬
[0903 구경원] recast 네비메쉬[0903 구경원] recast 네비메쉬
[0903 구경원] recast 네비메쉬
 
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
강성훈, 실버바인 대기열 서버 설계 리뷰, NDC2019
 
Windows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance ComparisonWindows IOCP vs Linux EPOLL Performance Comparison
Windows IOCP vs Linux EPOLL Performance Comparison
 
NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기NoSQL 위에서 MMORPG 개발하기
NoSQL 위에서 MMORPG 개발하기
 
게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가
 
[야생의 땅: 듀랑고] 서버 아키텍처 Vol. 2 (자막)
[야생의 땅: 듀랑고] 서버 아키텍처 Vol. 2 (자막)[야생의 땅: 듀랑고] 서버 아키텍처 Vol. 2 (자막)
[야생의 땅: 듀랑고] 서버 아키텍처 Vol. 2 (자막)
 
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
 
TCP가 실패하는 상황들
TCP가 실패하는 상황들TCP가 실패하는 상황들
TCP가 실패하는 상황들
 
게임 분산 서버 구조
게임 분산 서버 구조게임 분산 서버 구조
게임 분산 서버 구조
 
코딩 테스트 및 알고리즘 문제해결 공부 방법 (고려대학교 KUCC, 2022년 4월)
코딩 테스트 및 알고리즘 문제해결 공부 방법 (고려대학교 KUCC, 2022년 4월)코딩 테스트 및 알고리즘 문제해결 공부 방법 (고려대학교 KUCC, 2022년 4월)
코딩 테스트 및 알고리즘 문제해결 공부 방법 (고려대학교 KUCC, 2022년 4월)
 
Scalable Gaming with AWS - GDC 2014
Scalable Gaming with AWS - GDC 2014Scalable Gaming with AWS - GDC 2014
Scalable Gaming with AWS - GDC 2014
 
Tdd with python unittest for embedded c
Tdd with python unittest for embedded cTdd with python unittest for embedded c
Tdd with python unittest for embedded c
 
온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?온라인 게임과 소셜 게임 서버는 어떻게 다른가?
온라인 게임과 소셜 게임 서버는 어떻게 다른가?
 
NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현NDC12_Lockless게임서버설계와구현
NDC12_Lockless게임서버설계와구현
 
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
 

Andere mochten auch

Andere mochten auch (7)

Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
 
Andrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on Linux
 
Icinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of Icinga
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Ähnlich wie Stop using Nagios (so it can die peacefully)

Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
arunvr
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
Nick Galbreath
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
Devopsdays
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
tzang ms
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
Priyanka Aash
 

Ähnlich wie Stop using Nagios (so it can die peacefully) (20)

How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
 
Automating Monitoring with Puppet
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with Puppet
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
 
Advanced googling
Advanced googlingAdvanced googling
Advanced googling
 
Google Hacking
Google HackingGoogle Hacking
Google Hacking
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
 
Django Girls Tutorial
Django Girls TutorialDjango Girls Tutorial
Django Girls Tutorial
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Stop using Nagios (so it can die peacefully)

  • 1. Please stop using Nagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
  • 2. Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?
  • 3. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 4. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 5. So why did you pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
  • 6. What Nagios gets right Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
  • 7. Nagios, I hate thee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
  • 8. Expansion about config Configuration has to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
  • 9. Scaling, or lack of it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
  • 10. Okay, yes, there’s mod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
  • 11. API poverty Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
  • 12. The bandaids we make Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
  • 13. The take-home point: "If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
  • 14. So, smart guy. What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
  • 15. THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM
  • 16. Points for thought: ●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
  • 17. My suggestion: Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
  • 19. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 20. There’s something we can use for this. Sensu! Sensu is often described as the “monitoring router”.
  • 21.
  • 22. { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
  • 23. Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
  • 24. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 25. What we need: Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
  • 26. Graphing is easy now. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
  • 27. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
  • 28. Anomaly detection is hard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
  • 29. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
  • 30. Alerting is tricky, but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
  • 31. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
  • 32. User interfaces are hard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
  • 33. The Sensu Dashboard sucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
  • 34. I’m going to have to write a UI. Sigh.
  • 35. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
  • 36. In Summary Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.