SlideShare a Scribd company logo
1 of 17
Sensu at Brightpearl
Turning a hatred of Nagios into a
love of Sensu
www.brightpearl.com
Who the hell am I?
Dave Tibbs
@LowlySysadm1n
l
Systems Administrator at Brightpearl Inc
l
Started at Brightpearl UK in October 2010
l
Back then, only about 20 people in the company
– I was the only Systems Administrator/General
IT Dogsbody
l
~7 years experience as Sysadmin working with
various flavours of Linux
Monitoring – who needs it anyway?
l
Basically everyone – if you're running production
software that people depend on, you need to know
what's going on with your servers
l
You can't rely on screaming users to let you know
when things go wrong
l
Certain metrics can be a very good indicator of
failures before they happen – think disk space,
memory consumption, failed backups, web
requests/sec, etc
Monitoring in place when I started
Right, better get some monitoring.
Nagios, then?
l
Reputation of being the default, safe choice
l
Claim to be “Industry Standard” on their website
l
Historically people were put off by extortionate
costs of enterprise software (e.g. HP Openview) –
now cloud-based software still requires a
subscription.
l
Hey, Nagios is free.
l
Neckbeards rejoice – it's open source.
In the beginning, it was joyous.
l
MONITOR ALL TEH THINGZ
l
(Relatively) low server count means it was still
manageable. Easy to tune alerts to specific
servers.
l
All the plugins you can imagine means we could
monitor RDS instances, internal office servers,
UPS, etc etc
l
Email alerts for warnings keep us abreast of
things that might happen
l
Pagerduty integration for critical alerts
l
Configuration assisted with Chef.
But then...
l As the number of servers increases, so does the
configuration required
l ...and so do the spurious alerts, where the
thresholds aren't so simple to set. Hosting cost
restraints means sometimes running close to the
wire on some servers but not others.
l Because of this, NAGIOSAGEDDON in your
email inbox. Soon enough, everyone's ignoring
them, especially the warnings. And especially if
stuff is still working
A quick note on Nagios checks.
l Monitoring host sends check command over NRPE and waits for a response
l Queue of checks are processed one by one – if networking to certain hosts is
slow, it's slower to process the list.
l If the list of checks doesn't get processed before the next check is due.....
So Nagios sucks then?
l Well, Nagios gets some things right -
The plugin model is simple (4 exit codes!) and
reasonably well-designed
●
It's pretty reliable
●
SSL Support = secure
l If you're running a small office/datacentre with
servers and requirements that rarely or never
change it works – but still with a lot of painful
setting up
l But as soon as you deviate from this, it all goes
wrong.
Yes, bascially Nagios sucks.
l A lot has changed in the IT world in 15 years –
Nagios hasn't.
l It's completely unscalable. There is no such
thing as a Nagios cluster. More checks = more
server load on master
l The configuration format is horrible –
chef/puppet only slightly dulls the pain
l It has a horrendous interface – even if you pay
for Nagios XI, which isn't cheap
l It assumes a static infrastructure, which in the
days of Cloud is almost never.
l Configuration has to be duplicated in two places
So what to do?
l Reached the limit of Nagios pain – determined to
shake the Stockholm Syndrome we all appear to
have
l Alerts are pretty much ignored by all, once flood
gets large enough they WILL end up filtered.
Nagios has gone stopped for days without
anybody noticing.
l A monitoring system that people ignore is utterly
pointless.
l Started to investigate other alternatives.
Alternatives to Nagios
l NagiosXI - $$$ and apparently not much better.
l Zabbix – Not as much support as Nagios, lots of
people seem to think it's worse. Configuration
possibly even more complex
l ZenOSS – Confusing config, issues with false
positives and massive numbers of alerts
l Then I found Sensu.
What is this Sensu then?
l Much, much better model (queue-subscriber)
l Purpose-built for this, best tool for the job. Think
Graphite for graphing, pagerduty for alerting.
l Supports existing Nagios plugins
l Integrates with graphite, pagerduty
l Easy to scale – automatically handles clustering.
l Great REST API – you can do most things with it
No really, what is is it?
l Often described as a “monitoring router”
l Results of “check” scripts are passed onto one
or more handlers, depending on certain
conditions
l Written in Ruby (yay!)
l Configuration is all in JSON
l Four main components:
●
Server
●
Client
●
API
●
Dashboard
Compared to Nagios, this is good
l Hosting our infrastructure in the cloud, we need
to have our monitoring solution be
●
able to cope with changing
instances/infrastructure
●
aware of new servers without us having to
remember to tell it
●
Able to cope with possbibly rapid expansion
l Sensu fulfills these objectives reasonably well.
So is Sensu perfect?
●
No, nothing is.
●
The dashboard is immature – basically still a
bit rubbish
●
Current release is only version 0.12 – so the
whole software itself is fairly immature.
●
Fairly complicated install process, with
dependencies on quite a bit more than Nagios.
It's been Chef'd (and Chef'd well) but seems
easy for these dependencies to break with
version inconsistencies.
But it's still immeasurably better.
●
It'll scale well when our infrastructure expands
●
Has performed great in a test environment
●
Looking forward to rolling it out to production!

More Related Content

What's hot

What's hot (20)

Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
Migrating big data
Migrating big dataMigrating big data
Migrating big data
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot Out
 
Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
 
How Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureHow Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA Infrastructure
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
 
Sensu Monitoring
Sensu MonitoringSensu Monitoring
Sensu Monitoring
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based Monitoring
 
OWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP HackathonOWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP Hackathon
 
AllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CIAllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CI
 
OWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced FeaturesOWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced Features
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 
sensu
sensusensu
sensu
 
Sensu
SensuSensu
Sensu
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Logmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKLogmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELK
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
 

Viewers also liked

Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
m_richardson
 
Monitoring solutions comparison
Monitoring solutions comparisonMonitoring solutions comparison
Monitoring solutions comparison
Wouter Hermans
 
RMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odpRMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odp
Charles JUDITH
 
[SINS] Présentation de Nagios
[SINS] Présentation de Nagios[SINS] Présentation de Nagios
[SINS] Présentation de Nagios
jeyg
 

Viewers also liked (15)

Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
 
Monitoring using Sensu
Monitoring using SensuMonitoring using Sensu
Monitoring using Sensu
 
Security in the face of adversity
Security in the face of adversitySecurity in the face of adversity
Security in the face of adversity
 
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
 
Comparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring ToolsComparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring Tools
 
Writing Nagios Plugins in Python
Writing Nagios Plugins in PythonWriting Nagios Plugins in Python
Writing Nagios Plugins in Python
 
Nagios
NagiosNagios
Nagios
 
Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009
 
Monitoring solutions comparison
Monitoring solutions comparisonMonitoring solutions comparison
Monitoring solutions comparison
 
Monitoring as code
Monitoring as codeMonitoring as code
Monitoring as code
 
Rapport de stage nagios
Rapport de stage nagiosRapport de stage nagios
Rapport de stage nagios
 
Rapport nagios miniprojet
Rapport nagios miniprojetRapport nagios miniprojet
Rapport nagios miniprojet
 
RMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odpRMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odp
 
[SINS] Présentation de Nagios
[SINS] Présentation de Nagios[SINS] Présentation de Nagios
[SINS] Présentation de Nagios
 
Cours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de NagiosCours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de Nagios
 

Similar to Sensu at brightpearl

Watching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native ObservabilityWatching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native Observability
Ronald McCollam
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
Steve Pember
 

Similar to Sensu at brightpearl (20)

Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
 
Project: Intrusion Detection
Project: Intrusion DetectionProject: Intrusion Detection
Project: Intrusion Detection
 
Watching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native ObservabilityWatching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native Observability
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
 
Information Security: Advanced SIEM Techniques
Information Security: Advanced SIEM TechniquesInformation Security: Advanced SIEM Techniques
Information Security: Advanced SIEM Techniques
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 
Sensepost assessment automation
Sensepost assessment automationSensepost assessment automation
Sensepost assessment automation
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)
 
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
 
Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerCodemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
 
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
 
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire
 
Devops down-under
Devops down-underDevops down-under
Devops down-under
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Sensu at brightpearl

  • 1. Sensu at Brightpearl Turning a hatred of Nagios into a love of Sensu www.brightpearl.com
  • 2. Who the hell am I? Dave Tibbs @LowlySysadm1n l Systems Administrator at Brightpearl Inc l Started at Brightpearl UK in October 2010 l Back then, only about 20 people in the company – I was the only Systems Administrator/General IT Dogsbody l ~7 years experience as Sysadmin working with various flavours of Linux
  • 3. Monitoring – who needs it anyway? l Basically everyone – if you're running production software that people depend on, you need to know what's going on with your servers l You can't rely on screaming users to let you know when things go wrong l Certain metrics can be a very good indicator of failures before they happen – think disk space, memory consumption, failed backups, web requests/sec, etc
  • 4. Monitoring in place when I started
  • 5. Right, better get some monitoring. Nagios, then? l Reputation of being the default, safe choice l Claim to be “Industry Standard” on their website l Historically people were put off by extortionate costs of enterprise software (e.g. HP Openview) – now cloud-based software still requires a subscription. l Hey, Nagios is free. l Neckbeards rejoice – it's open source.
  • 6. In the beginning, it was joyous. l MONITOR ALL TEH THINGZ l (Relatively) low server count means it was still manageable. Easy to tune alerts to specific servers. l All the plugins you can imagine means we could monitor RDS instances, internal office servers, UPS, etc etc l Email alerts for warnings keep us abreast of things that might happen l Pagerduty integration for critical alerts l Configuration assisted with Chef.
  • 7. But then... l As the number of servers increases, so does the configuration required l ...and so do the spurious alerts, where the thresholds aren't so simple to set. Hosting cost restraints means sometimes running close to the wire on some servers but not others. l Because of this, NAGIOSAGEDDON in your email inbox. Soon enough, everyone's ignoring them, especially the warnings. And especially if stuff is still working
  • 8. A quick note on Nagios checks. l Monitoring host sends check command over NRPE and waits for a response l Queue of checks are processed one by one – if networking to certain hosts is slow, it's slower to process the list. l If the list of checks doesn't get processed before the next check is due.....
  • 9. So Nagios sucks then? l Well, Nagios gets some things right - The plugin model is simple (4 exit codes!) and reasonably well-designed ● It's pretty reliable ● SSL Support = secure l If you're running a small office/datacentre with servers and requirements that rarely or never change it works – but still with a lot of painful setting up l But as soon as you deviate from this, it all goes wrong.
  • 10. Yes, bascially Nagios sucks. l A lot has changed in the IT world in 15 years – Nagios hasn't. l It's completely unscalable. There is no such thing as a Nagios cluster. More checks = more server load on master l The configuration format is horrible – chef/puppet only slightly dulls the pain l It has a horrendous interface – even if you pay for Nagios XI, which isn't cheap l It assumes a static infrastructure, which in the days of Cloud is almost never. l Configuration has to be duplicated in two places
  • 11. So what to do? l Reached the limit of Nagios pain – determined to shake the Stockholm Syndrome we all appear to have l Alerts are pretty much ignored by all, once flood gets large enough they WILL end up filtered. Nagios has gone stopped for days without anybody noticing. l A monitoring system that people ignore is utterly pointless. l Started to investigate other alternatives.
  • 12. Alternatives to Nagios l NagiosXI - $$$ and apparently not much better. l Zabbix – Not as much support as Nagios, lots of people seem to think it's worse. Configuration possibly even more complex l ZenOSS – Confusing config, issues with false positives and massive numbers of alerts l Then I found Sensu.
  • 13. What is this Sensu then? l Much, much better model (queue-subscriber) l Purpose-built for this, best tool for the job. Think Graphite for graphing, pagerduty for alerting. l Supports existing Nagios plugins l Integrates with graphite, pagerduty l Easy to scale – automatically handles clustering. l Great REST API – you can do most things with it
  • 14. No really, what is is it? l Often described as a “monitoring router” l Results of “check” scripts are passed onto one or more handlers, depending on certain conditions l Written in Ruby (yay!) l Configuration is all in JSON l Four main components: ● Server ● Client ● API ● Dashboard
  • 15. Compared to Nagios, this is good l Hosting our infrastructure in the cloud, we need to have our monitoring solution be ● able to cope with changing instances/infrastructure ● aware of new servers without us having to remember to tell it ● Able to cope with possbibly rapid expansion l Sensu fulfills these objectives reasonably well.
  • 16. So is Sensu perfect? ● No, nothing is. ● The dashboard is immature – basically still a bit rubbish ● Current release is only version 0.12 – so the whole software itself is fairly immature. ● Fairly complicated install process, with dependencies on quite a bit more than Nagios. It's been Chef'd (and Chef'd well) but seems easy for these dependencies to break with version inconsistencies.
  • 17. But it's still immeasurably better. ● It'll scale well when our infrastructure expands ● Has performed great in a test environment ● Looking forward to rolling it out to production!

Editor's Notes

  1. EXPLAIN WHY NAGIOS CHECKS ARE BAD – NRPE check fired to each server, the more checks, the more they queue up. Check can fire off on server before previous one has completed – never get a result back.Chef kind of helps with configuration, but not by a lot. As there are more servers, there are more exceptions not covered so easily by configuration management. What follows NAGIOSAGEDDON? Mail queue overload and eventual crash. Alerts stop all together, which nobody notices, because they're ignoring them.
  2. If the list of checks doesn't get processed before the next check is due..... we may never get results back for the later checks in the list.Or, consider that the server is able to process the checks required within the time “window” (e.g. 1 minute for checks that are made every minute) – what if the number of checks is doubled? Tripled?
  3. Reliability – when was the last time you saw the nagios daemon crash? It's usually things external to Nagios that are the problem, Painful setting up – there are bolt-ons like Groundworks to improve setting up but they're not that much better than arsing about with configuration files Deviation = non-static hostnames in the cloud. Generally in a datacentre most is static.
  4. A lot has changed in 15 years – biggest of which is is a) everyone's running more servers and more servicesb) Most people relying on the cloud = many many non-static IP addresses. Nagios is 15 years old, give or take – released in 1999 and the design hasn't changed much in years. It's not fair to expect them to predict the changes back then, but neither has the software moved with the times. Configuration duplication – the server has to be aware of what checks it wants clients to make, the client has to be aware of what checks it's going to be expected to be run. Absolutely crazy setup.
  5. Stockholm syndrome not just in our company or even with me – everyone seems to have it. Reference everyone defending Nagios when it's basically shit.
  6. “Sensu” from the Japanese word for “fan” - relates to the “fanout exchange”, one of the exchange types used by RabbitMQ.
  7. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  8. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  9. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  10. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.