Daniel Wittenberg's presentation on Scaling Nagios Core 4.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
2. About MeAbout Me
â—Ź Unix/Linux admin since mid 90's
â—Ź Nagios/Netsaint user since early 2000's
â—Ź Owned/operated consulting business for almost 10 years that
provided distributed monitoring using Nagios
â—Ź Previously employed by Fortune 50 Insurance company
â—Ź Currently Monitoring Platform Manager at IPsoft Inc.
3. About IPsoftAbout IPsoft
â—Ź Provider of Remote Infrastructure Management and automation
services
â—Ź ITIL and 6 Sigma compliance management framework
â—Ź Automation that resolves 56% of all incidents, and 90% L1
â—Ź Monitoring, Automation, Event Correlation, Management....
â—Ź Offices around the world in ten countries
â—Ź http://www.ipsoft.com
5. My ConfigurationMy Configuration
â—Ź ~700 Nagios Servers
â—Ź ~130,000 Monitored Devices
â—Ź ~3,000,000 Service Checks
â—Ź Mix of customized Nagios 3.2.3 and 4.0.0
â—Ź Scientific Linux 6.2/6.4
â—Ź Managed by Puppet 3.x
â—Ź 2/3 on VMware ESX rest are bare metal
â—Ź Adding new Nagios servers almost daily
6. What's different with Nagios 4What's different with Nagios 4
SPEED!
â—Ź Current testing shows on average 500% faster over 3.2.3
7. What's different with Nagios 4What's different with Nagios 4
Some things that would impact performance/stability
http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html
● Embedded Perl – Gone
â—Ź external_command_buffer_slots - Gone
â—Ź -x option to not verify circular paths no longer needed in rc scripts
â—Ź Configuration Verification algorithm changes, massive startup speed increase
â—Ź Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.
â—Ź Disk I/O reduced to virtually 0
â—Ź NEW query handler interface, better communication with core
● NEW core workers – reduces I/O, memory, CPU
â—Ź Completely re-written spec file for better installs, debug modes
8. Perf Testing Lab SetupPerf Testing Lab Setup
â—Ź Servers are all ESX 5 based VM's on the same cluster
â—Ź Variable CPU cores, 4GB memory
â—Ź Metrics used to consider a test failure:
â—Ź CPU Block Queue > 3
â—Ź CPU I/O Wait > 3
â—Ź CPU Idle < 10%
â—Ź Service Check Latency > 1s
â—Ź Host Check Latency > 1s
â—Ź 30 minute run time, > 3% failure rate failed the test
â—Ź Fully automated increasing work load, consistent results
● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
10. Test ResultsTest Results
CPU Cores Service Checks
Version 3.2.3
Service Checks
Version 4.0.0rc1
Difference
1 1700 10500 617%
2 3300 20800 630%
4 6500 35300 543%
8 11700 45100 385%
11. Other software usedOther software used
â—Ź Customized livestatus based on Andreas updates for Nagios 4
â—Ź https://github.com/ageric/livestatus
● Developing custom “single pane” interface to replace CGI/Check_mk Multisite
â—Ź Developing full REST API to talk to QH, livestatus and config files
â—Ź nagios-qh.rb Query Handler interface to gather loadctl metrics
â—Ź https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb
â—Ź Custom load control daemon that talks to QH
â—Ź Custom Event Broker to send perf data directly to ActiveMQ for post-
processing
â—Ź Custom agent, like NRPE on steroids without limitations like buffer size
12. Other performance tweaksOther performance tweaks
â—Ź Sysctl Changes
â—Ź net.ipv4.tcp_fin_timeout
â—Ź net.ipv4.tcp_keepalive_profiles
â—Ź net.ipv4.tcp_tw_recycle
â—Ź net.ipv4.tcp_tw.reuse
â—Ź No longer need RAMDISK, but still in the default sysconfig/RC script for now
â—Ź Keep logging levels as low as possible
â—Ź Disable CGI's whenever possible
â—Ź Disable Environment Macros
â—Ź Don't use resource macros when you don't need to, they are not cached
13. Other performance tweaksOther performance tweaks
â—Ź /etc/security/limits.d/nagios.conf
â—Ź ipmon soft nofile 131072
â—Ź ipmon hard nofile 131072
â—Ź ipmon soft nproc 131072
â—Ź ipmon hard nproc 131072
â—Ź Nearly disable OOM killer for the nagios process, saves it until last
â—Ź echo '-16' > /proc/<nagios pid>/oom_adj
â—Ź Re-nice puppet to run at 10 so less impacting (true for any extra services)
● /etc/sysconfig/puppet – NICELEVEL=10
â—Ź This should apply to any other running services that might take resources
14. Common Perf ToolsCommon Perf Tools
● vmstat / top – cpu/memory
● iostat / iotop – disk usage
â—Ź iptraf - network
● sar – cpu/memory/disk
● strace – immediate debugging, also debugging QA
● esxtop – VM stats
● tuned – can dynamically tune system
â—Ź perf record -p <pid> / perf list / perf top -u nagios
15. How to keep it running goodHow to keep it running good
â—Ź Monitor everything...you can never have too much info!
â—Ź CPU load and CPU stats (idle/wait/user/system)
â—Ź Disk space, inodes free
â—Ź All application/system logs (apache, syslog, nagios.log, etc.)
â—Ź Hardware status
â—Ź Swap / Physical Memory Usage
â—Ź Puppet state (state.yaml)
â—Ź Apache Stats (if have GUI/API)
â—Ź Network performance and stats (errors, throughput, etc.)
â—Ź NTP time and drift (more important on VM's)
17. Known IssuesKnown Issues (and complaints)(and complaints)
â—Ź Number of workers on smaller (1-2 core) systems easily overloaded
â—Ź No remote workers (yet)
â—Ź Still have to restart to add new hosts/services
â—Ź No REST API natively
â—Ź Livestatus (or similar) not native
18. Questions ?Questions ?
â—Ź Daniel.Wittenberg@ipsoft.com
â—Ź dwittenberg2008@gmail.com
â—Ź @dwittenberg2008
â—Ź www.linkedin.com/in/dwittenberg
â—Ź nagios and nagios-devel IRC
â—Ź Nagios Users and Devel mailing lists
â—Ź Always looking to hire new people so contact me!