What do I do if Icinga 2 stops working?
How can I find out what’s wrong and how do I fix it?
Where can I find help and what information should I provide?
And foremost how do I know *if* Icinga 2 is doing what I want?
If you ask yourself at least one of these questions regularly than this talk is for you.
2. www.netways.de
Me
●
Thomas Widhalm
●
Netways: Senior Consultant / Lead Support Engineer
●
Icinga: Team member QA & Docs
●
Co-Author Icinga 2 book
●
@widhalmt
●
Thomas.widhalm@netways.de / thomas.widhalm@icinga.com
●
Does not like making slides :-(
3. www.netways.de
Please
●
I‘ll tell you lots of obvious things
– Read the docs
– Keep accounts at the ready
– Test your setup
– Do config checks
– Read logs
6. www.netways.de
• Is Icinga 2 doing what I want it to do?
• Why is Icinga 2 not working?
• What can I do if Icinga 2 is not working?
• Where can I get help if Icinga 2 is not working?
• What information can I provide beforehand if Icinga 2 is not working?
Agenda
9. www.netways.de
Check what you are checking
●
Reachability of hosts (ping)
●
Basic OS checks (load, mem, disk,…)
●
Reachabilitiy of services (http, ports,…)
●
Details of services (certificates, statuscodes,...)
●
End-to-end monitoring (send mail – receive mail,…)
●
Business processes
Is this what you need to find out about outages?
10. www.netways.de
Common mistakes
●
„It‘s pingable so it must be fully functional“
●
„I can connect to port 80/tcp so my LAMP stack is OK“
●
„We can reach our website from my office so our website is available“
11. www.netways.de
The cure
●
Think you monitor everything you need? Think again. And again.
●
Do tests
– Turn off Services
– Turn off parts of your stacks (only database, only middleware,…)
– Block connections (Firewall)
●
If you had an outage without alarm, adjust your Checks!
12. www.netways.de
Don‘t overdo it
●
Do you really need to know about every single switchport?
●
Think a lot about check intervals.
●
Don‘t kill your hosts/remote sites by „overmonitoring“
– API calls
– SNMP checks
– SQL Queries
15. www.netways.de
About High Availability
●
„I won‘t need Monitoring to tell me that my whole VMware farm went
offline.“
– Maybe not. But you will want it to help you keep an overview what exactly went down.
●
„I‘ll have other problems than watch Monitoring during a major outage.“
– Why? Use it for coordinating what is up and what still needs attention.
●
„I‘ll cluster everything so I won‘t have to bother about my Montoring.“
– Almost every HA-solution needs regular attention. You might end up with more work to do than
without high availability.
– Think about it. HA is a great thing to have but is it really what you want?
●
Consider having autonomous Monitoring hosts
– Hardware, extra UPS, SMS Modem, etc.
16. www.netways.de
Rethink you notifications
●
Is separation of OS and services always a good thing?
– A service might be ok while the OS is close to collapsing
– Knowing about the OS helps with root cause analysis
– Knowing about failing services helps with prioritizing problems
●
Is „critical“ the only state to notify?
– Always test what state every plugin returns for what sort of problem
●
Could I flood users with alarms?
– Use dependencies
– Don‘t escalate every alarm to ticketing
17. www.netways.de
No passive check without active check
●
Don‘t rely on passive data
– Passive checks
– SNMP Traps
– Logfiles
●
At least use one active check to know whether the sender has nothing to
send or is dead
●
Passive data can enrich your information, though
20. www.netways.de
Reasons for Icinga 2 not working
1)Configuration errors
2)Monitoring the wrong thing (see previous slides)
3)Not monitoring the monitoring system
4)Something else
5)Another thing
6) Bugs
21. www.netways.de
Configuration errors
●
Can be found with the icinga2 daemon -C command. Use it!
●
Check configuration on satellites with the icinga check
●
Beware of checks running on the wrong node
– Defining which host executes a plugin can be a bit tricky
– Use checks like disk that have host specific output
●
Use icinga2 object list to check if your apply-rules work as desired
22. www.netways.de
Monitoring the wrong thing
●
Running plugin on the wrong host
●
Running checks that go Not-OK and Not-Critical when a service fails
●
Plugins that use Icinga states wrong (e.g. Critical instead of unknown)
●
Sending notifications only on critical
– Rethink if „Unknown“ could indicate a loss of service, too
●
Checking specific metrics which might be ok even when the service is dead
23. www.netways.de
Not monitoring the monitoring system
●
Full disks on Icinga hosts
●
Broken database connection between Icinga 2 and Icinga Web 2
●
Failed HA
●
Dead satellites
●
Notifications not going through
●
Time offset (->NTP!)
24. www.netways.de
Monitor Icinga
●
Use same basic checks like for other Linux hosts
– Focus on I/O and disk usage
●
Use the internal checks
– New checks introduced
– Updates for existing checks
●
Monitor services used by Icinga
– IDO Database
– Grapher
– Webserver
●
Provide alternate way for notifications
– SMS
– Telegram
29. www.netways.de
Check services used by Icinga
●
Don‘t just monitor if they are available
●
Check if Icinga can connect (e.g. check_mysql)
●
Check if the service is doing OK, too (e.g. check_mysql_health)
30. www.netways.de
Bugs
●
Errormessages, Logentries, Crashdumps, etc.
●
Very few bugs don‘t have beforementioned symptoms
●
Check https://github.com/Icinga/icinga2/issues
●
Try to reproduce (e.g. Vagrant boxes)
●
File a bugreport (Even if you have a support contract)
31. www.netways.de
Reproduce with Vagrant
●
Build your own boxes
●
Use the official ones
– https://github.com/Icinga/icinga-vagrant
– Some bugs are OS specific, many aren‘t
– Config errors are almost never OS specific
32. www.netways.de
Why is Icinga 2 not working like I want it to?
●
What sort of „not working“?
●
Icinga 2 not running
– Configuration checks
– Logs
●
No notification
– icinga2 object list --type notification
– Logs
●
...
33. www.netways.de
Logs
●
Collect them
– Elastic Stack (Logstash rules available soon)
– Graylog
●
Monitor them
●
Tweak them
– Use „Debug“ Log when searching for problems
– Keep an eye on the disk usage
●
Use „icingabeat“ for even more complete data collection
34. www.netways.de
Monitor Icinga
●
Read the docs
– https://www.icinga.com/docs/icinga2/latest/doc/08-advanced-topics/#monitoring-icinga-2
●
Logs (again)
●
Prepare for complete loss of Icinga 2
– High availability
– Users watching Icinga Web 2 / Dashing
– Rudimentary monitoring (e.g. SMS Gateways)
– Extra Satellites with Notifications (Netways Web Services)
35. www.netways.de
Reduce oddities
●
Try to stick to standards
●
Use official packages
– Vendors ship patched libraries with matching versions
– Custom built packages work most of the time but can have hard-to-find problems
●
Use certificates from the Icinga CA
– Or at least from a proven alternative like Puppet CA
– No benefit from certificates from your companys CA
39. www.netways.de
Configuration errors
●
Always run icinga2 daemon -C when changing something
●
Use version control system for your configuration
– Consider using a script:
●
Check configuration
●
Commit to version control
●
Reload Icinga
40. www.netways.de
Read the logs
●
Standard Mainlog should provide plenty information
●
Use debuglog carefully
– Lots of data
– Lots of I/O
●
Don‘t forget about system logs
– /var/log/messages, dmesg, etc.
– OOM kills, etc.
41. www.netways.de
Check your Icinga nodes
●
External factors are common sources of monitoring problems
– Most common bottleneck: I/O.
●
IDO Database
●
Grapher
– Network bottlenecks
●
Checkresults
●
Configsync
●
API Log filling up
– Out of memory
●
Again: Monitor your monitoring nodes
42. www.netways.de
Problems with single checks
●
Lots of components involved
●
Break them down
– Run plugin manually (as icinga user!)
– Review log of executing host
– Check again if the plugin is executed on the right host
43. www.netways.de
Why always run checks as icinga user?
●
Check permissions (obviously)
●
Temporary files (not so obvious)
– Some plugins create temporary files
– When run as a different user, Icinga might not be permitted to use/change them
45. www.netways.de
Compare different versions of reality
●
Check the configuration on disk
– Reviewing files
– Icinga2 object list
●
Check the API
– https://www.icinga.com/docs/icinga2/latest/doc/12-icinga2-api/
– Prepare a user before problems occur
– Use script, curl configuration file or alias for connection
51. www.netways.de
Common reactions to problems that need assistance to fix
●
„OMGOMGOMGOMG!!!“
– Fix by having disaster plans ready
– Gain confidence by tests and rehearsals
●
„What‘s my GitHub account again?“
– Store your account data in a password safe where you can find it
– Not as common as you might think...
●
„If I hit this problem, someone else will, too, so I‘ll wait for them to file a
report.“
– Nope. Do it. NOW. (As long as it is a confirmed bug)
●
„I‘m just too dumb for RTFM.“
– You can file documentation enhancement issues, too
53. www.netways.de
Ways to get involved
●
Become a regular reader (and poster) in the boards / lists
– Help others
– Get used to the community support channels
– Be informed about common / upcoming problems early
●
Review Issues on GitHub
– Help in solving „non-bug“ issues
– Provide information to issues you had yourself
– Again, be informed early
●
Contribute
– File pull requests on GitHub
55. www.netways.de
Get on the team
●
Get on the team
– https://www.icinga.com/about/team/
– You can contribute in every way without being on the team
– If you think you contributed enough ask for acceptance
– By far not every team member is a developer
57. www.netways.de
Bugs and Feature requests when you have a support contract
●
„If there‘s something to code, we‘ll need an issue“
– Development is always tracked on GitHub
●
You‘ll want to file your own issues
– Get informed of every reply afap
– Provide more information
– Gather reputation for finding bugs / reporting feature requests
59. www.netways.de
What information can I provide beforehand if Icinga 2 is not working?
●
See documentation again
– https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
60. www.netways.de
What information (an overview)
●
What did you expect Icinga to do?
●
What didn‘t Icinga do?
– Common question from support: „What tells you that something went wrong?“
61. www.netways.de
What information (an overview)
●
Your Icinga master
– HA? Cluster? Features?
– Database? Which? HA? Extra Host?
– Configuration. Flat Files? Director? API?
●
Your satellites
– HA? Cluster?
– How many?
●
Agents
– What OS‘es?
– How many?
●
Grapher
– Which?
– Extra Host?
62. www.netways.de
What information (an overview)
●
Environment
– OS version
– Software versions
– Virtualisation
– Installed Components/Modules
– Network (Firewall Zones, remote sites,...)
65. www.netways.de
Automated Diagnostics Script
●
Temporary Source
– https://github.com/widhalmt/icinga2-diagnostics
– Will be moved ASAP
●
Current status: Quick hack after remote session for support customer
●
Goal
– Gather information asked in most replies to new support tickets
– Don‘t overwhelm with too much information (don‘t just copy the configuration)
– Sending full configuration for deep dives might become an option