SlideShare a Scribd company logo
1 of 69
Download to read offline
www.netways.de
Troubleshooting Icinga 2
OSMC | Thomas Widhalm | 2017-11-23
www.netways.de
Me
●
Thomas Widhalm
●
Netways: Senior Consultant / Lead Support Engineer
●
Icinga: Team member QA & Docs
●
Co-Author Icinga 2 book
●
@widhalmt
●
Thomas.widhalm@netways.de / thomas.widhalm@icinga.com
●
Does not like making slides :-(
www.netways.de
Please
●
I‘ll tell you lots of obvious things
– Read the docs
– Keep accounts at the ready
– Test your setup
– Do config checks
– Read logs
www.netways.de
Please
www.netways.de
Why do I still have to tell it?
www.netways.de
• Is Icinga 2 doing what I want it to do?
• Why is Icinga 2 not working?
• What can I do if Icinga 2 is not working?
• Where can I get help if Icinga 2 is not working?
• What information can I provide beforehand if Icinga 2 is not working?
Agenda
www.netways.de
Is Icinga doing what I want it to do?
www.netways.de
Is Icinga 2 doing what I want it to do?
●
Flexibility is good and bad at the same time
www.netways.de
Check what you are checking
●
Reachability of hosts (ping)
●
Basic OS checks (load, mem, disk,…)
●
Reachabilitiy of services (http, ports,…)
●
Details of services (certificates, statuscodes,...)
●
End-to-end monitoring (send mail – receive mail,…)
●
Business processes
Is this what you need to find out about outages?
www.netways.de
Common mistakes
●
„It‘s pingable so it must be fully functional“
●
„I can connect to port 80/tcp so my LAMP stack is OK“
●
„We can reach our website from my office so our website is available“
www.netways.de
The cure
●
Think you monitor everything you need? Think again. And again.
●
Do tests
– Turn off Services
– Turn off parts of your stacks (only database, only middleware,…)
– Block connections (Firewall)
●
If you had an outage without alarm, adjust your Checks!
www.netways.de
Don‘t overdo it
●
Do you really need to know about every single switchport?
●
Think a lot about check intervals.
●
Don‘t kill your hosts/remote sites by „overmonitoring“
– API calls
– SNMP checks
– SQL Queries
www.netways.de
Monitor your Monitoring
●
Monitoring hosts are production hosts. Treat them like that
●
Don‘t forget about your satellites
www.netways.de
Keep your Monitoring running
www.netways.de
About High Availability
●
„I won‘t need Monitoring to tell me that my whole VMware farm went
offline.“
– Maybe not. But you will want it to help you keep an overview what exactly went down.
●
„I‘ll have other problems than watch Monitoring during a major outage.“
– Why? Use it for coordinating what is up and what still needs attention.
●
„I‘ll cluster everything so I won‘t have to bother about my Montoring.“
– Almost every HA-solution needs regular attention. You might end up with more work to do than
without high availability.
– Think about it. HA is a great thing to have but is it really what you want?
●
Consider having autonomous Monitoring hosts
– Hardware, extra UPS, SMS Modem, etc.
www.netways.de
Rethink you notifications
●
Is separation of OS and services always a good thing?
– A service might be ok while the OS is close to collapsing
– Knowing about the OS helps with root cause analysis
– Knowing about failing services helps with prioritizing problems
●
Is „critical“ the only state to notify?
– Always test what state every plugin returns for what sort of problem
●
Could I flood users with alarms?
– Use dependencies
– Don‘t escalate every alarm to ticketing
www.netways.de
No passive check without active check
●
Don‘t rely on passive data
– Passive checks
– SNMP Traps
– Logfiles
●
At least use one active check to know whether the sender has nothing to
send or is dead
●
Passive data can enrich your information, though
www.netways.de
Why is Icinga not working?
www.netways.de
Why is Icinga 2 not working?
www.netways.de
Reasons for Icinga 2 not working
1)Configuration errors
2)Monitoring the wrong thing (see previous slides)
3)Not monitoring the monitoring system
4)Something else
5)Another thing
6) Bugs
www.netways.de
Configuration errors
●
Can be found with the icinga2 daemon -C command. Use it!
●
Check configuration on satellites with the icinga check
●
Beware of checks running on the wrong node
– Defining which host executes a plugin can be a bit tricky
– Use checks like disk that have host specific output
●
Use icinga2 object list to check if your apply-rules work as desired
www.netways.de
Monitoring the wrong thing
●
Running plugin on the wrong host
●
Running checks that go Not-OK and Not-Critical when a service fails
●
Plugins that use Icinga states wrong (e.g. Critical instead of unknown)
●
Sending notifications only on critical
– Rethink if „Unknown“ could indicate a loss of service, too
●
Checking specific metrics which might be ok even when the service is dead
www.netways.de
Not monitoring the monitoring system
●
Full disks on Icinga hosts
●
Broken database connection between Icinga 2 and Icinga Web 2
●
Failed HA
●
Dead satellites
●
Notifications not going through
●
Time offset (->NTP!)
www.netways.de
Monitor Icinga
●
Use same basic checks like for other Linux hosts
– Focus on I/O and disk usage
●
Use the internal checks
– New checks introduced
– Updates for existing checks
●
Monitor services used by Icinga
– IDO Database
– Grapher
– Webserver
●
Provide alternate way for notifications
– SMS
– Telegram
www.netways.de
Internal Check „icinga“
●
Configuration check on satellites
●
Lots of performance data
– Use e.g. latency to identify very slow hosts
●
Shows version so you can use it for updates
www.netways.de
Internal check „cluster“
●
Use on nodes with just a few connected endpoints
www.netways.de
Internal check „cluster-zone“
●
Add to every agent but run on parent node
●
Use as parent for dependencies
www.netways.de
Internal check „IDO“
●
Run on every node with IDO feature enabled
●
Very helpful performance data (esp. pending_queries)
www.netways.de
Check services used by Icinga
●
Don‘t just monitor if they are available
●
Check if Icinga can connect (e.g. check_mysql)
●
Check if the service is doing OK, too (e.g. check_mysql_health)
www.netways.de
Bugs
●
Errormessages, Logentries, Crashdumps, etc.
●
Very few bugs don‘t have beforementioned symptoms
●
Check https://github.com/Icinga/icinga2/issues
●
Try to reproduce (e.g. Vagrant boxes)
●
File a bugreport (Even if you have a support contract)
www.netways.de
Reproduce with Vagrant
●
Build your own boxes
●
Use the official ones
– https://github.com/Icinga/icinga-vagrant
– Some bugs are OS specific, many aren‘t
– Config errors are almost never OS specific
www.netways.de
Why is Icinga 2 not working like I want it to?
●
What sort of „not working“?
●
Icinga 2 not running
– Configuration checks
– Logs
●
No notification
– icinga2 object list --type notification
– Logs
●
...
www.netways.de
Logs
●
Collect them
– Elastic Stack (Logstash rules available soon)
– Graylog
●
Monitor them
●
Tweak them
– Use „Debug“ Log when searching for problems
– Keep an eye on the disk usage
●
Use „icingabeat“ for even more complete data collection
www.netways.de
Monitor Icinga
●
Read the docs
– https://www.icinga.com/docs/icinga2/latest/doc/08-advanced-topics/#monitoring-icinga-2
●
Logs (again)
●
Prepare for complete loss of Icinga 2
– High availability
– Users watching Icinga Web 2 / Dashing
– Rudimentary monitoring (e.g. SMS Gateways)
– Extra Satellites with Notifications (Netways Web Services)
www.netways.de
Reduce oddities
●
Try to stick to standards
●
Use official packages
– Vendors ship patched libraries with matching versions
– Custom built packages work most of the time but can have hard-to-find problems
●
Use certificates from the Icinga CA
– Or at least from a proven alternative like Puppet CA
– No benefit from certificates from your companys CA
www.netways.de
What can I do if Icinga is not working?
www.netways.de
What can I do if Icinga 2 is not working?
www.netways.de
Read the docs
●
Troubleshooting section
– https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
www.netways.de
Configuration errors
●
Always run icinga2 daemon -C when changing something
●
Use version control system for your configuration
– Consider using a script:
●
Check configuration
●
Commit to version control
●
Reload Icinga
www.netways.de
Read the logs
●
Standard Mainlog should provide plenty information
●
Use debuglog carefully
– Lots of data
– Lots of I/O
●
Don‘t forget about system logs
– /var/log/messages, dmesg, etc.
– OOM kills, etc.
www.netways.de
Check your Icinga nodes
●
External factors are common sources of monitoring problems
– Most common bottleneck: I/O.
●
IDO Database
●
Grapher
– Network bottlenecks
●
Checkresults
●
Configsync
●
API Log filling up
– Out of memory
●
Again: Monitor your monitoring nodes
www.netways.de
Problems with single checks
●
Lots of components involved
●
Break them down
– Run plugin manually (as icinga user!)
– Review log of executing host
– Check again if the plugin is executed on the right host
www.netways.de
Why always run checks as icinga user?
●
Check permissions (obviously)
●
Temporary files (not so obvious)
– Some plugins create temporary files
– When run as a different user, Icinga might not be permitted to use/change them
www.netways.de
Compare different versions of reality
www.netways.de
Compare different versions of reality
●
Check the configuration on disk
– Reviewing files
– Icinga2 object list
●
Check the API
– https://www.icinga.com/docs/icinga2/latest/doc/12-icinga2-api/
– Prepare a user before problems occur
– Use script, curl configuration file or alias for connection
www.netways.de
Use a .curlrc file
www.netways.de
Where can I get help if Icinga is not working?
www.netways.de
Where can I get help?
www.netways.de
Ressources for help
●
Discussion Boards
– https://monitoring-portal.org
●
Mailinglists
– https://lists.icinga.org/mailman/listinfo/icinga-users
●
Partners
– https://www.icinga.com/partners/
www.netways.de
Be prepared!
www.netways.de
Common reactions to problems that need assistance to fix
●
„OMGOMGOMGOMG!!!“
– Fix by having disaster plans ready
– Gain confidence by tests and rehearsals
●
„What‘s my GitHub account again?“
– Store your account data in a password safe where you can find it
– Not as common as you might think...
●
„If I hit this problem, someone else will, too, so I‘ll wait for them to file a
report.“
– Nope. Do it. NOW. (As long as it is a confirmed bug)
●
„I‘m just too dumb for RTFM.“
– You can file documentation enhancement issues, too
www.netways.de
Even better – get involved early!
www.netways.de
Ways to get involved
●
Become a regular reader (and poster) in the boards / lists
– Help others
– Get used to the community support channels
– Be informed about common / upcoming problems early
●
Review Issues on GitHub
– Help in solving „non-bug“ issues
– Provide information to issues you had yourself
– Again, be informed early
●
Contribute
– File pull requests on GitHub
www.netways.de
Get on the team
www.netways.de
Get on the team
●
Get on the team
– https://www.icinga.com/about/team/
– You can contribute in every way without being on the team
– If you think you contributed enough ask for acceptance
– By far not every team member is a developer
www.netways.de
Support Contracts
●
Partners provide support, Icinga doesn‘t
●
Synchronized service levels
www.netways.de
Bugs and Feature requests when you have a support contract
●
„If there‘s something to code, we‘ll need an issue“
– Development is always tracked on GitHub
●
You‘ll want to file your own issues
– Get informed of every reply afap
– Provide more information
– Gather reputation for finding bugs / reporting feature requests
www.netways.de
What information can I provide beforehand?
www.netways.de
What information can I provide beforehand if Icinga 2 is not working?
●
See documentation again
– https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
www.netways.de
What information (an overview)
●
What did you expect Icinga to do?
●
What didn‘t Icinga do?
– Common question from support: „What tells you that something went wrong?“
www.netways.de
What information (an overview)
●
Your Icinga master
– HA? Cluster? Features?
– Database? Which? HA? Extra Host?
– Configuration. Flat Files? Director? API?
●
Your satellites
– HA? Cluster?
– How many?
●
Agents
– What OS‘es?
– How many?
●
Grapher
– Which?
– Extra Host?
www.netways.de
What information (an overview)
●
Environment
– OS version
– Software versions
– Virtualisation
– Installed Components/Modules
– Network (Firewall Zones, remote sites,...)
www.netways.de
What information (an overview)
●
Oddities
– Certificates not from the Icinga CA
– Custom built packages
www.netways.de
Automate it!
www.netways.de
Automated Diagnostics Script
●
Temporary Source
– https://github.com/widhalmt/icinga2-diagnostics
– Will be moved ASAP
●
Current status: Quick hack after remote session for support customer
●
Goal
– Gather information asked in most replies to new support tickets
– Don‘t overwhelm with too much information (don‘t just copy the configuration)
– Sending full configuration for deep dives might become an option
www.netways.de
Sample output
www.netways.de
Because ego dictates it...
www.netways.de
www.netways.de
Questions?

More Related Content

What's hot

Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...vsoshnikov
 
Vagrant in 15 minutes
Vagrant in 15 minutesVagrant in 15 minutes
Vagrant in 15 minutesAnton Weiss
 
Sensu at brightpearl
Sensu at brightpearlSensu at brightpearl
Sensu at brightpearlDavid Tibbs
 
2020 ADDO Spring Break OWASP ZAP Automation
2020 ADDO Spring Break OWASP ZAP Automation2020 ADDO Spring Break OWASP ZAP Automation
2020 ADDO Spring Break OWASP ZAP AutomationSimon Bennetts
 
Nagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and TricksNagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and TricksNagios
 
How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture
 How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture
How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecturevsoshnikov
 
JavaOne 2014 Security Testing for Developers using OWASP ZAP
JavaOne 2014 Security Testing for Developers using OWASP ZAPJavaOne 2014 Security Testing for Developers using OWASP ZAP
JavaOne 2014 Security Testing for Developers using OWASP ZAPSimon Bennetts
 
2020 OWASP Thailand - ZAP intro
2020 OWASP Thailand - ZAP intro2020 OWASP Thailand - ZAP intro
2020 OWASP Thailand - ZAP introSimon Bennetts
 
Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017Amirhossein Saberi
 
Google App Engine: For PHP Developers
Google App Engine: For PHP DevelopersGoogle App Engine: For PHP Developers
Google App Engine: For PHP DevelopersAbu Ashraf Masnun
 
OSDC 2015: Bernd Erk | Why favour Icinga over Nagios
OSDC 2015: Bernd Erk | Why favour Icinga over NagiosOSDC 2015: Bernd Erk | Why favour Icinga over Nagios
OSDC 2015: Bernd Erk | Why favour Icinga over NagiosNETWAYS
 
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native Codemotion
 
2014 ZAP Workshop 1: Getting Started
2014 ZAP Workshop 1: Getting Started2014 ZAP Workshop 1: Getting Started
2014 ZAP Workshop 1: Getting StartedSimon Bennetts
 
AllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CIAllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CISimon Bennetts
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Zabbix
 
BSides Manchester 2014 ZAP Advanced Features
BSides Manchester 2014 ZAP Advanced FeaturesBSides Manchester 2014 ZAP Advanced Features
BSides Manchester 2014 ZAP Advanced FeaturesSimon Bennetts
 
JoinSEC 2013 London - ZAP Intro
JoinSEC 2013 London - ZAP IntroJoinSEC 2013 London - ZAP Intro
JoinSEC 2013 London - ZAP IntroSimon Bennetts
 

What's hot (20)

Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...
 
Vagrant in 15 minutes
Vagrant in 15 minutesVagrant in 15 minutes
Vagrant in 15 minutes
 
Sensu at brightpearl
Sensu at brightpearlSensu at brightpearl
Sensu at brightpearl
 
2020 ADDO Spring Break OWASP ZAP Automation
2020 ADDO Spring Break OWASP ZAP Automation2020 ADDO Spring Break OWASP ZAP Automation
2020 ADDO Spring Break OWASP ZAP Automation
 
Nagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and TricksNagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
 
How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture
 How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture
How to Shrink from 5 Tiers to 2 in a Multitier Microservices Architecture
 
JavaOne 2014 Security Testing for Developers using OWASP ZAP
JavaOne 2014 Security Testing for Developers using OWASP ZAPJavaOne 2014 Security Testing for Developers using OWASP ZAP
JavaOne 2014 Security Testing for Developers using OWASP ZAP
 
2020 OWASP Thailand - ZAP intro
2020 OWASP Thailand - ZAP intro2020 OWASP Thailand - ZAP intro
2020 OWASP Thailand - ZAP intro
 
OMD and Check_mk
OMD and Check_mkOMD and Check_mk
OMD and Check_mk
 
Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017
 
Google App Engine: For PHP Developers
Google App Engine: For PHP DevelopersGoogle App Engine: For PHP Developers
Google App Engine: For PHP Developers
 
Sensu Monitoring
Sensu MonitoringSensu Monitoring
Sensu Monitoring
 
Sensu
SensuSensu
Sensu
 
OSDC 2015: Bernd Erk | Why favour Icinga over Nagios
OSDC 2015: Bernd Erk | Why favour Icinga over NagiosOSDC 2015: Bernd Erk | Why favour Icinga over Nagios
OSDC 2015: Bernd Erk | Why favour Icinga over Nagios
 
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native
Webinar - Matteo Manchi: Dal web al nativo: Introduzione a React Native
 
2014 ZAP Workshop 1: Getting Started
2014 ZAP Workshop 1: Getting Started2014 ZAP Workshop 1: Getting Started
2014 ZAP Workshop 1: Getting Started
 
AllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CIAllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CI
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
 
BSides Manchester 2014 ZAP Advanced Features
BSides Manchester 2014 ZAP Advanced FeaturesBSides Manchester 2014 ZAP Advanced Features
BSides Manchester 2014 ZAP Advanced Features
 
JoinSEC 2013 London - ZAP Intro
JoinSEC 2013 London - ZAP IntroJoinSEC 2013 London - ZAP Intro
JoinSEC 2013 London - ZAP Intro
 

Similar to OSMC 2017 | Troubleshooting-icinga 2 by Thomas Widhalm

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...Dave Stokes
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsDave Stokes
 
Performance Monitoring with Icinga2, Graphite und Grafana
Performance Monitoring with Icinga2, Graphite und GrafanaPerformance Monitoring with Icinga2, Graphite und Grafana
Performance Monitoring with Icinga2, Graphite und GrafanaIcinga
 
Don't Suck at Building Stuff - Mykel Alvis at Puppet Camp Altanta
Don't Suck at Building Stuff  - Mykel Alvis at Puppet Camp AltantaDon't Suck at Building Stuff  - Mykel Alvis at Puppet Camp Altanta
Don't Suck at Building Stuff - Mykel Alvis at Puppet Camp AltantaPuppet
 
CIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMICF CIRCUIT
 
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...dcieslak
 
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014Netgate
 
2. Icinga Meetup Zurich - Monitor your Monitoring
2. Icinga Meetup Zurich - Monitor your Monitoring2. Icinga Meetup Zurich - Monitor your Monitoring
2. Icinga Meetup Zurich - Monitor your MonitoringMarco Fretz
 
Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source B.A.
 
Liferay portals in real projects
Liferay portals  in real projectsLiferay portals  in real projects
Liferay portals in real projectsIBACZ
 
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsLinuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsDave Stokes
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsNETWAYS
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGPablo Garbossa
 
Introduction To ICT Security Audit OWASP Day Malaysia 2011
Introduction To ICT Security Audit OWASP Day Malaysia 2011Introduction To ICT Security Audit OWASP Day Malaysia 2011
Introduction To ICT Security Audit OWASP Day Malaysia 2011Linuxmalaysia Malaysia
 
Eko10 Workshop Opensource Database Auditing
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database AuditingJuan Berner
 
Monitoring in the cloud with Puppet
Monitoring in the cloud with PuppetMonitoring in the cloud with Puppet
Monitoring in the cloud with PuppetKris Buytaert
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGearsAlessandro Molina
 

Similar to OSMC 2017 | Troubleshooting-icinga 2 by Thomas Widhalm (20)

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
 
Performance Monitoring with Icinga2, Graphite und Grafana
Performance Monitoring with Icinga2, Graphite und GrafanaPerformance Monitoring with Icinga2, Graphite und Grafana
Performance Monitoring with Icinga2, Graphite und Grafana
 
Don't Suck at Building Stuff - Mykel Alvis at Puppet Camp Altanta
Don't Suck at Building Stuff  - Mykel Alvis at Puppet Camp AltantaDon't Suck at Building Stuff  - Mykel Alvis at Puppet Camp Altanta
Don't Suck at Building Stuff - Mykel Alvis at Puppet Camp Altanta
 
Nagios En
Nagios EnNagios En
Nagios En
 
CIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEM
 
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...
RandomTest - Random Software Integration Tests That Just Work for C/C++, Java...
 
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014
Squid, SquidGuard, and Lightsquid - pfSense Hangout March 2014
 
Adminblast 2013
Adminblast 2013Adminblast 2013
Adminblast 2013
 
2. Icinga Meetup Zurich - Monitor your Monitoring
2. Icinga Meetup Zurich - Monitor your Monitoring2. Icinga Meetup Zurich - Monitor your Monitoring
2. Icinga Meetup Zurich - Monitor your Monitoring
 
Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source Proactive monitoring tools or services - Open Source
Proactive monitoring tools or services - Open Source
 
Liferay portals in real projects
Liferay portals  in real projectsLiferay portals  in real projects
Liferay portals in real projects
 
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux AdminsLinuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
Linuxfest Northwest Proper Care and Feeding Of a MySQL for Busy Linux Admins
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 
Introduction To ICT Security Audit OWASP Day Malaysia 2011
Introduction To ICT Security Audit OWASP Day Malaysia 2011Introduction To ICT Security Audit OWASP Day Malaysia 2011
Introduction To ICT Security Audit OWASP Day Malaysia 2011
 
Eko10 Workshop Opensource Database Auditing
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database Auditing
 
Monitoring in the cloud with Puppet
Monitoring in the cloud with PuppetMonitoring in the cloud with Puppet
Monitoring in the cloud with Puppet
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 

OSMC 2017 | Troubleshooting-icinga 2 by Thomas Widhalm

  • 1. www.netways.de Troubleshooting Icinga 2 OSMC | Thomas Widhalm | 2017-11-23
  • 2. www.netways.de Me ● Thomas Widhalm ● Netways: Senior Consultant / Lead Support Engineer ● Icinga: Team member QA & Docs ● Co-Author Icinga 2 book ● @widhalmt ● Thomas.widhalm@netways.de / thomas.widhalm@icinga.com ● Does not like making slides :-(
  • 3. www.netways.de Please ● I‘ll tell you lots of obvious things – Read the docs – Keep accounts at the ready – Test your setup – Do config checks – Read logs
  • 5. www.netways.de Why do I still have to tell it?
  • 6. www.netways.de • Is Icinga 2 doing what I want it to do? • Why is Icinga 2 not working? • What can I do if Icinga 2 is not working? • Where can I get help if Icinga 2 is not working? • What information can I provide beforehand if Icinga 2 is not working? Agenda
  • 7. www.netways.de Is Icinga doing what I want it to do?
  • 8. www.netways.de Is Icinga 2 doing what I want it to do? ● Flexibility is good and bad at the same time
  • 9. www.netways.de Check what you are checking ● Reachability of hosts (ping) ● Basic OS checks (load, mem, disk,…) ● Reachabilitiy of services (http, ports,…) ● Details of services (certificates, statuscodes,...) ● End-to-end monitoring (send mail – receive mail,…) ● Business processes Is this what you need to find out about outages?
  • 10. www.netways.de Common mistakes ● „It‘s pingable so it must be fully functional“ ● „I can connect to port 80/tcp so my LAMP stack is OK“ ● „We can reach our website from my office so our website is available“
  • 11. www.netways.de The cure ● Think you monitor everything you need? Think again. And again. ● Do tests – Turn off Services – Turn off parts of your stacks (only database, only middleware,…) – Block connections (Firewall) ● If you had an outage without alarm, adjust your Checks!
  • 12. www.netways.de Don‘t overdo it ● Do you really need to know about every single switchport? ● Think a lot about check intervals. ● Don‘t kill your hosts/remote sites by „overmonitoring“ – API calls – SNMP checks – SQL Queries
  • 13. www.netways.de Monitor your Monitoring ● Monitoring hosts are production hosts. Treat them like that ● Don‘t forget about your satellites
  • 15. www.netways.de About High Availability ● „I won‘t need Monitoring to tell me that my whole VMware farm went offline.“ – Maybe not. But you will want it to help you keep an overview what exactly went down. ● „I‘ll have other problems than watch Monitoring during a major outage.“ – Why? Use it for coordinating what is up and what still needs attention. ● „I‘ll cluster everything so I won‘t have to bother about my Montoring.“ – Almost every HA-solution needs regular attention. You might end up with more work to do than without high availability. – Think about it. HA is a great thing to have but is it really what you want? ● Consider having autonomous Monitoring hosts – Hardware, extra UPS, SMS Modem, etc.
  • 16. www.netways.de Rethink you notifications ● Is separation of OS and services always a good thing? – A service might be ok while the OS is close to collapsing – Knowing about the OS helps with root cause analysis – Knowing about failing services helps with prioritizing problems ● Is „critical“ the only state to notify? – Always test what state every plugin returns for what sort of problem ● Could I flood users with alarms? – Use dependencies – Don‘t escalate every alarm to ticketing
  • 17. www.netways.de No passive check without active check ● Don‘t rely on passive data – Passive checks – SNMP Traps – Logfiles ● At least use one active check to know whether the sender has nothing to send or is dead ● Passive data can enrich your information, though
  • 20. www.netways.de Reasons for Icinga 2 not working 1)Configuration errors 2)Monitoring the wrong thing (see previous slides) 3)Not monitoring the monitoring system 4)Something else 5)Another thing 6) Bugs
  • 21. www.netways.de Configuration errors ● Can be found with the icinga2 daemon -C command. Use it! ● Check configuration on satellites with the icinga check ● Beware of checks running on the wrong node – Defining which host executes a plugin can be a bit tricky – Use checks like disk that have host specific output ● Use icinga2 object list to check if your apply-rules work as desired
  • 22. www.netways.de Monitoring the wrong thing ● Running plugin on the wrong host ● Running checks that go Not-OK and Not-Critical when a service fails ● Plugins that use Icinga states wrong (e.g. Critical instead of unknown) ● Sending notifications only on critical – Rethink if „Unknown“ could indicate a loss of service, too ● Checking specific metrics which might be ok even when the service is dead
  • 23. www.netways.de Not monitoring the monitoring system ● Full disks on Icinga hosts ● Broken database connection between Icinga 2 and Icinga Web 2 ● Failed HA ● Dead satellites ● Notifications not going through ● Time offset (->NTP!)
  • 24. www.netways.de Monitor Icinga ● Use same basic checks like for other Linux hosts – Focus on I/O and disk usage ● Use the internal checks – New checks introduced – Updates for existing checks ● Monitor services used by Icinga – IDO Database – Grapher – Webserver ● Provide alternate way for notifications – SMS – Telegram
  • 25. www.netways.de Internal Check „icinga“ ● Configuration check on satellites ● Lots of performance data – Use e.g. latency to identify very slow hosts ● Shows version so you can use it for updates
  • 26. www.netways.de Internal check „cluster“ ● Use on nodes with just a few connected endpoints
  • 27. www.netways.de Internal check „cluster-zone“ ● Add to every agent but run on parent node ● Use as parent for dependencies
  • 28. www.netways.de Internal check „IDO“ ● Run on every node with IDO feature enabled ● Very helpful performance data (esp. pending_queries)
  • 29. www.netways.de Check services used by Icinga ● Don‘t just monitor if they are available ● Check if Icinga can connect (e.g. check_mysql) ● Check if the service is doing OK, too (e.g. check_mysql_health)
  • 30. www.netways.de Bugs ● Errormessages, Logentries, Crashdumps, etc. ● Very few bugs don‘t have beforementioned symptoms ● Check https://github.com/Icinga/icinga2/issues ● Try to reproduce (e.g. Vagrant boxes) ● File a bugreport (Even if you have a support contract)
  • 31. www.netways.de Reproduce with Vagrant ● Build your own boxes ● Use the official ones – https://github.com/Icinga/icinga-vagrant – Some bugs are OS specific, many aren‘t – Config errors are almost never OS specific
  • 32. www.netways.de Why is Icinga 2 not working like I want it to? ● What sort of „not working“? ● Icinga 2 not running – Configuration checks – Logs ● No notification – icinga2 object list --type notification – Logs ● ...
  • 33. www.netways.de Logs ● Collect them – Elastic Stack (Logstash rules available soon) – Graylog ● Monitor them ● Tweak them – Use „Debug“ Log when searching for problems – Keep an eye on the disk usage ● Use „icingabeat“ for even more complete data collection
  • 34. www.netways.de Monitor Icinga ● Read the docs – https://www.icinga.com/docs/icinga2/latest/doc/08-advanced-topics/#monitoring-icinga-2 ● Logs (again) ● Prepare for complete loss of Icinga 2 – High availability – Users watching Icinga Web 2 / Dashing – Rudimentary monitoring (e.g. SMS Gateways) – Extra Satellites with Notifications (Netways Web Services)
  • 35. www.netways.de Reduce oddities ● Try to stick to standards ● Use official packages – Vendors ship patched libraries with matching versions – Custom built packages work most of the time but can have hard-to-find problems ● Use certificates from the Icinga CA – Or at least from a proven alternative like Puppet CA – No benefit from certificates from your companys CA
  • 36. www.netways.de What can I do if Icinga is not working?
  • 37. www.netways.de What can I do if Icinga 2 is not working?
  • 38. www.netways.de Read the docs ● Troubleshooting section – https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
  • 39. www.netways.de Configuration errors ● Always run icinga2 daemon -C when changing something ● Use version control system for your configuration – Consider using a script: ● Check configuration ● Commit to version control ● Reload Icinga
  • 40. www.netways.de Read the logs ● Standard Mainlog should provide plenty information ● Use debuglog carefully – Lots of data – Lots of I/O ● Don‘t forget about system logs – /var/log/messages, dmesg, etc. – OOM kills, etc.
  • 41. www.netways.de Check your Icinga nodes ● External factors are common sources of monitoring problems – Most common bottleneck: I/O. ● IDO Database ● Grapher – Network bottlenecks ● Checkresults ● Configsync ● API Log filling up – Out of memory ● Again: Monitor your monitoring nodes
  • 42. www.netways.de Problems with single checks ● Lots of components involved ● Break them down – Run plugin manually (as icinga user!) – Review log of executing host – Check again if the plugin is executed on the right host
  • 43. www.netways.de Why always run checks as icinga user? ● Check permissions (obviously) ● Temporary files (not so obvious) – Some plugins create temporary files – When run as a different user, Icinga might not be permitted to use/change them
  • 45. www.netways.de Compare different versions of reality ● Check the configuration on disk – Reviewing files – Icinga2 object list ● Check the API – https://www.icinga.com/docs/icinga2/latest/doc/12-icinga2-api/ – Prepare a user before problems occur – Use script, curl configuration file or alias for connection
  • 47. www.netways.de Where can I get help if Icinga is not working?
  • 49. www.netways.de Ressources for help ● Discussion Boards – https://monitoring-portal.org ● Mailinglists – https://lists.icinga.org/mailman/listinfo/icinga-users ● Partners – https://www.icinga.com/partners/
  • 51. www.netways.de Common reactions to problems that need assistance to fix ● „OMGOMGOMGOMG!!!“ – Fix by having disaster plans ready – Gain confidence by tests and rehearsals ● „What‘s my GitHub account again?“ – Store your account data in a password safe where you can find it – Not as common as you might think... ● „If I hit this problem, someone else will, too, so I‘ll wait for them to file a report.“ – Nope. Do it. NOW. (As long as it is a confirmed bug) ● „I‘m just too dumb for RTFM.“ – You can file documentation enhancement issues, too
  • 52. www.netways.de Even better – get involved early!
  • 53. www.netways.de Ways to get involved ● Become a regular reader (and poster) in the boards / lists – Help others – Get used to the community support channels – Be informed about common / upcoming problems early ● Review Issues on GitHub – Help in solving „non-bug“ issues – Provide information to issues you had yourself – Again, be informed early ● Contribute – File pull requests on GitHub
  • 55. www.netways.de Get on the team ● Get on the team – https://www.icinga.com/about/team/ – You can contribute in every way without being on the team – If you think you contributed enough ask for acceptance – By far not every team member is a developer
  • 56. www.netways.de Support Contracts ● Partners provide support, Icinga doesn‘t ● Synchronized service levels
  • 57. www.netways.de Bugs and Feature requests when you have a support contract ● „If there‘s something to code, we‘ll need an issue“ – Development is always tracked on GitHub ● You‘ll want to file your own issues – Get informed of every reply afap – Provide more information – Gather reputation for finding bugs / reporting feature requests
  • 58. www.netways.de What information can I provide beforehand?
  • 59. www.netways.de What information can I provide beforehand if Icinga 2 is not working? ● See documentation again – https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
  • 60. www.netways.de What information (an overview) ● What did you expect Icinga to do? ● What didn‘t Icinga do? – Common question from support: „What tells you that something went wrong?“
  • 61. www.netways.de What information (an overview) ● Your Icinga master – HA? Cluster? Features? – Database? Which? HA? Extra Host? – Configuration. Flat Files? Director? API? ● Your satellites – HA? Cluster? – How many? ● Agents – What OS‘es? – How many? ● Grapher – Which? – Extra Host?
  • 62. www.netways.de What information (an overview) ● Environment – OS version – Software versions – Virtualisation – Installed Components/Modules – Network (Firewall Zones, remote sites,...)
  • 63. www.netways.de What information (an overview) ● Oddities – Certificates not from the Icinga CA – Custom built packages
  • 65. www.netways.de Automated Diagnostics Script ● Temporary Source – https://github.com/widhalmt/icinga2-diagnostics – Will be moved ASAP ● Current status: Quick hack after remote session for support customer ● Goal – Gather information asked in most replies to new support tickets – Don‘t overwhelm with too much information (don‘t just copy the configuration) – Sending full configuration for deep dives might become an option