This document provides tips for maintaining sanity when using Nagios for system monitoring. It recommends starting with global practices like thorough documentation, centralized authentication, and using existing plugins when possible. For small-medium setups, it emphasizes well-structured configuration files and automation. Large installations should consider distributed monitoring, centralized management, and web-based configuration tools. The document warns against over-engineering practices and provides additional resources for learning more about Nagios.
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Staying Sane with Nagios
1. Staying Sane with Nagios
Matt Simmons
@standaloneSA
standalone.sysadmin@gmail.com
http://www.standalone-sysadmin.com
2. Introduction & Outline
Confessions:
I am not actually a Nagios Expert
I do actually LIKE Nagios
Outline:
Global Sanity
Small & Medium Shops
Large Scale Shops
Add Ons
Warnings
Additional Resources
3. I know what you're thinking...
Nagios?
Sane???
Unlikely!!!
Serenity Now!!!
6. Global Sanity: Documentation
Read the documentation
Object Definitions
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html
Use 3_0 when searching
Bookmark the good ones
Nagiosbook.org will be soon coming out with 3.x docs
http://www.nagiosbook.org/
7. Global Sanity: Central Auth
Centralized Authentication
LDAP / AD with Apache
(I use Likewise Open)
Domain users -> Nagios Contacts
msimmons@EXAMPLE.COM
Access to CGI interface
8. Global Sanity:
Do Not Reinvent the Wheel...
Nagios Exchange
http://exchange.nagios.org/
Pros:
Nearly 2000 Listings
>1600 plugins
Cons:
Varying quality and reliability
Old, unmaintained, code rot, etc
9. Global Sanity:
...unless you have to
Writing your own Nagios Plugins
Great guide
http://nagiosplug.sourceforge.net/developer-guidelines.html
Extended Output
Huge Community
Any language you want
10. Small & Medium Shops
Not exclusively small or medium, just a nonautomatic way of doing things
For people who:
Manually edit / create entries in config files
Don't use extensive 3rd party management software
Have a small team of responsible admins
Don't require large distributed monitoring networks
14. Config File Hierarchy
Default config is stupid.
cfg_dir directive is key
*.cfg – recursively
Hierarchy should resemble “real life”
Allows for additional “group” security
Use what makes sense to you and document it
15. Config File Hierarchy: Example
Output of “tree -d” on my Nagios objects directory
|-- commands
|-- computers
| |-- groups
| |-- linux
| | `-- services
| `-- windows
|-- misc
`-- network
|-- firewalls
|-- links
|-- routers
`-- switches
16. Regular Expressions
Not all regexes are created equal
use_regexp_matching
Only when object names contain:
*
?
use_true_regexp_matching
'man regex'
All object names
Caution: Unintended consequences
23. Script / Automate
Automate as much as possible
New Services
New Hosts
Commands
mkhost.sh as a template
24. Use alternate contacts file when
testing new features
Coworkers are under enough stress as it is
No messy explanations
Use symlinks to point to “real” contacts file
28. When checking disk usage
Do not specify the partitions to check
Instead, specify the partitions to NOT check
Too easy to forget to add new partitions.
If possible, use a plugin that produces statistics
for graphing usage trends
30. Alternate Communication Method
When the network Is down, email is down too
Have a non-email contact method
SMS, cell modem, smoke signals
Test it occasionally
31. Use parents
Establish a path FROM THE NAGIOS SERVER
Failure will trigger “unreachable” states
“u” notification flag
Only useful for non-local-subnet hosts typically
If the local switch dies, alerts don't go out anyway
Typically
32. Use Dependencies
Available for both hosts and services
The disks didn't blow up, SNMP crashed
What do you mean, the website is unavailable when
the database crashes
Dependencies != parents
Parents establish a line between the host and
Nagios
Dependencies establish logical object relationships
33. Notifications are Commands
Use Them
Execute what you need, when you need, where you
need through extra-nagios scripts
Your imagination is the limit
Electrical relays?
Flashing lights?
HALON release?
Please don't.
34. Use Passive Checks
(when necessary / appropriate)
For “normal” passive checks, specify freshness
checks
Useful for SNMP traps
Combine with snmptrapd
Distributed Monitoring
Use for capacity reasons
Physical separation calls for separate Nagios
installs (in my opinion)
35. Macros GOOD
60 bajillion available
http://nagios.sourceforge.net/docs/3_0/macrolist.html
On Demand Macros
Specify “remote” macros from other hosts
Custom Variable Macros
_MACADDRESS 00:01:02:03:04:05
$HOSTMACRO:SOMEHOST$
$_HOSTMACADDRESS$
Available as environmental variables in scripts
$NAGIOS_MACRONAME
36. Use Flap Detection
Or not. Who wants a charged cellphone battery?
Measures state changes:
Weighted measure of the last 21 checks
More recent counts higher
37. Large Shops
Too many nodes to easily configure by hand, or
too many nodes to deal with using one server
Scaling Nagios
Centralized Management
Web Configurators
40. Nagios Web Configuration
Dozen, If not hundreds
I don't know of a great one.
May be worth building or finding one that
matches your inventory system
Don't double-up on data if you don't have to
41. Malproductive Practices
Overreliance on Event Handlers
Please don't do anything terribly important.
Edge cases are scary.
Overabuse of inheritance
Spaghetti code
Hard to trace
Overcomplification
Simple is nearly always better
42. Learn More
Mailing List
Nagios Users
https://lists.sourceforge.net/lists/listinfo/nagios-users
LinkedIn
Nagios Users
http://www.linkedin.com/groupAnswers?viewQuestions=&gid=