Troy Lea's presentation on Leveraging and Understanding Performance Data and Graphs.
The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
3. 3
About This Presentation
Understanding how performance data is stored
in the back end and how Nagios accesses it
Goal is to give you key pieces of information
A good reference for understanding concepts
This presentation is centered around Nagios XI
Valid for other Nagios implementations
6. 6
Basic Concepts - Part 3
Service check command is executed by the monitoring engine
Monitoring engine receives the result of the check
Data received has performance data
Performance data is anything after the | (pipe)
The performance data is inserted into an RRD file
When viewing the performance graph, PNP4Nagios retrieves the
performance data from the RRD file and generates a pretty graph
Every time the service check receives performance data, it inserts
this performance data into the RRD file which allows you to look at
trends over time
7. 7
Plugins
The power of Nagios is in the plugins!
Monitor what you want, how you want!
Resources available that clearly define the
guidelines around creating plugins
Nagios Plug-in Developer Guidelines
http://nagiosplug.sourceforge.net/developer-
guidelines.html
PNP Documentation
http://docs.pnp4nagios.org/pnp-0.4/doc_complete
8. 8
Plugin Output Explained - Part 1
Plugins produce data divided into two parts
The pipe symbol â|â is used as a delimiter
Example check_icmp
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
Data to the left of the pipe symbol is processed
by the monitoring engine
Data to the right of the pipe symbol is used for
inserting into RRD and XML files
9. 9
Plugin Output Explained - Part 2
The exit code Nagios receives from the plugin
determines the state of the service
0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN
The exit code is not âvisibleâ when running a
check from the command line or looking at the
output returned from the plugin
10. 10
Plugin Output Explained - Part 3
No performance data = no pretty graphs
You can create a plugin using whatever
language and tools are available
All that matters is the end result which is
returned back to Nagios when the plugin has
finished running
11. 11
Plugin Output Explained - Part 4
Examples:
Shell script
Something you might want to check on the Nagios
host itself
perl script
Remotely checking a device using SNMP OR using
third party APIs like the VMware vSphere SDK to
remotely access virtual environments
Visual Basic script
Using NSClient on a Windows host to perform a
check (like RDP usage)
12. 12
Performance Data Specifics - Part 1
Asterix (*) fields are required fields, everything
else is optional
In this instance, rta is the FIRST DS, or DS 1
13. 13
Performance Data Specifics - Part 2
Multiple DS
Each DS is separated by a space
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
The label can have spaces however the label
MUST be enclosed by single quotes
'Round Trip Average'=2.687ms;3000.000;5000.000;0;
'Packet Loss'=0%;80;100;;
13
14. 14
Basic Plugin - Part 1
Example shell script demonstrating how a plugin
outputs performance data
NUMBER1=$[ ( $RANDOM % 100 ) + 1 ]
NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number
2'=$NUMBER2;;;;â
exit "0"
15. 15
Basic Plugin - Part 2
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;;
OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;;;; 'Number 2'=758;;;;
OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;;;; 'Number 2'=60;;;;
OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;;;; 'Number 2'=338;;;;
OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;;;; 'Number 2'=612;;;;
16. 16
Basic Plugin - Part 3
Performance data
displayed as a
pretty graph
Demonstration of
how you can
generate
performance data
in a plugin
17. 17
Basic Plugin - Part 4
Now lets add warning and critical thresholds to
the performance data string
Number1
WARNING @ 50
CRITICAL @ 75
Number2
WARNING @ 500
CRITICAL @ 750
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;50;75;;
'Number 2'=$NUMBER2;500;750;;"
18. 18
Basic Plugin - Part 5
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;;
OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;50;75;; 'Number 2'=758;500;750;;
OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;50;75;; 'Number 2'=60;500;750;;
OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;50;75;; 'Number 2'=338;500;750;;
OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;50;75;; 'Number 2'=612;500;750;;
19. 19
Basic Plugin - Part 6
This demonstrates
how the
performance data
does not have any
effect on the state
of the service
Warning and
Critical thresholds
are inside the .xml
file
19
20. 20
.rrd and .xml files
Used for recording the results from Nagios checks
Useful for observing daily trends of your environment
Invaluable for helping resolve performance issues
RRD = Round Robin Database
XML = Information about the Nagios check
PNP4Nagios uses the RRD and XML files to
generate pretty graphs
21. 21
Location of .rrd and .xml files
When a service check returns performance data,
Nagios dumps this into:
/usr/local/nagios/var/spool/perfdata
A background process detects the spooled data
and creates / updates the relevant .rrd and .xml
The Performance Data files live in:
/usr/local/nagios/share/perfdata/<host>
22. 22
Extract .rrd data
You can extract data from an .rrd file
Example (from the CLI):
rrdtool fetch
/usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX
-r 900 -s -1h
23. 23
.rrd and .xml Gotchya - Part 1
The .xml file can contain sensitive data
<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u
readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90!
</NAGIOS_SERVICECHECKCOMMAND>
24. 24
.rrd and .xml Gotchya - Part 2
Perhaps use a central credential file
<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!
check_vmware_config_vcenter01!cpu!90!95!!!!
</NAGIOS_SERVICECHECKCOMMAND>
25. 25
.rrd and .xml Gotchya - Part 3
RRD Data is averaged out over time
Looking at performance graphs for past day / week /
month / year will show results with less spikey data
This generally only occurs with data that has lots of
peaks and troughs
Constant data like disk space used will generally not
average out that much
It all depends on your environment!
When reviewing RRD data you need to take into
consideration these factors, itâs all relative!
26. 26
Graphs - How Templates Are Used - Part 1
http://docs.pnp4nagios.org/pnp-0.4/tpl
27. 27
Graphs - How Templates Are Used - Part 2
PNP4Nagios queries the XML file for the
<TEMPLATE> tag
Each datasource has itâs own <TEMPLATE> tag
<TEMPLATE>check-host-alive</TEMPLATE>
Also can be a trailing string in the performance
data (good for distributed monitoring)
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
[check_icmp]
28. 28
Graphs - How Templates Are Used - Part 3
From the example graphs:
<TEMPLATE>check-host-alive</TEMPLATE>
<TEMPLATE>check_local_load_alt</TEMPLATE>
PNP4Nagios looks for a php file with this name
in the following folders:
/usr/local/nagios/share/pnp/templates.dist
/usr/local/nagios/share/pnp/templates
29. 29
Graphs - How Templates Are Used - Part 4
check-host-alive
/usr/local/nagios/share/pnp/templates.dist/check-host-
alive.php
This PHP file generates the performance graph
check_local_load_alt
check_local_load_alt.php does NOT exist
Default template is used:
/usr/local/nagios/share/pnp/templates.dist/default.php
29
30. 30
Graphs - Creating Your Own Template - Part 1
The check_command name is what Nagios uses
to insert into the <TEMPLATE> tag in the XML
file (how PNP determines which template to use)
So for this example I have created a copy of an
existing command
check_xi_service_nsclient_alt
31. 31
Graphs - Creating Your Own Template - Part 2
The service definition using the new command
32. 32
Graphs - Creating Your Own Template - Part 3
The graph currently being generated
Default Template being used
Check Command being used
.rrd and .xml files currently contain valid data
33. 33
Graphs - Creating Your Own Template - Part 4
Copy the file:
/usr/local/nagios/share/pnp/templates.dist/default.php
To the following location with the name:
/
usr/local/nagios/share/pnp/templates/check_xi_servic
e_nsclient_alt.php
Edit check_xi_service_nsclient_alt.php
34. 34
Graphs - Creating Your Own Template - Part 5
In the graph we are removing the bottom two lines
Default Template
Check Command command name
Which are lines 62 and 63
$def[$i] .= 'COMMENT:"Default Templater" ';
$def[$i] .= 'COMMENT:"Check Command ' .
$TEMPLATE[$i] . 'r" ';
Save check_xi_service_nsclient_alt.php
34
35. 35
Graphs - Creating Your Own Template - Part 6
How easy was that!
Updated graph
Template Name and Check Command removed
36. 36
PNP Templates In Detail - Part 1
Lets get into specifics
Template we just
modified
Itâs not that
complicated! (LOL)
36
37. 37
PNP Templates In Detail - Part 2
.rrd files can have multiple datasources (DS)
Round Trip Time and Packet Loss for example
38. 38
PNP Templates In Detail - Part 3
Example of .rrd file with five DS
Two graphs generated using these DS
39. 39
PNP Templates In Detail - Part 4
Default Template creates one graph per DS
This is a simple PHP foreach loop
The code within the loop references the relevant
DS by the $i variable
40. 40
PNP Templates In Detail - Part 5
This section of the template uses three DS
One graph will be generated using three DS
$opt[1] and $def[1] is a reference for the first graph
being generated
41. 41
PNP Templates In Detail - Part 6
Number formatting
Our modified template and the relative code
The relevant information:
%3.4lf
42. 42
PNP Templates In Detail - Part 7
The three DS template and the relative code
The relevant information:
%4.0lf
43. 43
PNP Templates In Detail - Part 8
Numbers are displayed with four decimal points
%3.4lf
Numbers are displayed as whole numbers
%4.0lf
44. 44
PNP Templates In Detail - Part 9
PNP documentation defines the number
formatting using the printf standard defined here
http://en.wikipedia.org/wiki/Printf
The number (1) and the letter "L" look alike
%3.4lg contains a lower case "L"
The syntax is
%[parameter][flags][width][.precision][length]type
45. 45
PNP Templates In Detail - Part 10
width
When the number is generated on the graph, it will
allocate a minimum specific width, this helps you
align numbers in a column style
precision
Determines if the number displayed is a whole
number, or a number with a specific number of digits
following the decimal place
46. 46
PNP Templates In Detail - Part 11
%3.4lf
width = 3
precision = .4
hence the displayed number is 25.3800
%4.0lf
width = 4
precision = .0
hence the displayed number is 14
Because the precision is 0, NO decimal place is used
47. 47
MRTG - Part 1
MRTG = Multi Router Traffic Grapher
Nagios Addon that is useful for monitoring
network switch and router bandwidth using SNMP
Can be complicated to understand configuration
48. 48
MRTG - Part 2
Nagios XI Wizard called âNetwork Switch /
Routerâ automates the configuration of MRTG
MRTG configuration file
/etc/mrtg/mrtg.cfg
MRTG runs as a cron job every five minutes
cron comes from the Greek word for time, ÏÏÏÎœÎżÏ
[chronos]
Hence cron is a software utility on linux which is a
time-based job scheduler
In the windows world it's the Task Scheduler
49. 49
MRTG - Part 3
When MRTG runs, it gathers data from the
devices defined in the mrtg.cfg file
It dumps this data into the folder
/var/lib/mrtg
For every port monitored, an .rrd file is created
(no .xml file created at this point)
Another background process will then take the
data in /var/lib/mrtg and put it into the correct
location
/usr/local/nagios/share/perfdata/<host>
50. 50
MRTG Gotchya - Part 1
When the Wizard populates the mrtg.cfg file it will
add ALL ports on the switch to the config file
Even if you only selected to monitor 10 ports on
the switch
The Nagios XI Service Configuration will only have 10
ports defined as service definitions
Every time the MRTG cron job runs, it will collect
data from all ports on the switch (as defined in the
mrtg.cfg file)
Extra CPU cycles, extra disk space
50
51. 51
MRTG Gotchya - Part 2
On a 48 port switch this might not concern you
But in a stack of two 48 port switches this
becomes 96 ports + also other internal ports like
link aggregation ports (another 32 ports perhaps)
So these additional 128 ports have now added
8700+ configuration lines to the mrtg.cfg file
128 ports consume about 24 MB of .rrd disk
space
In my past environment, the mrtg.cfg file was
59,000 lines long!
51
52. 52
MRTG Gotchya - Part 3
Suggestion
Clean up the mrtg.cfg file
Remove the ports you do not wish to gather data on
Can this cause Problems?
Yes!
Problem 1
Monitoring additional ports later using the wizard will
not work
The wizard will NOT re-add the ports to the mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file
53. 53
MRTG Gotchya - Part 4
Problem 2 - Adding a switch (or module) to an
existing switch
Monitoring additional ports later using the wizard will
not work
The wizard will NOT add newly detected ports to the
mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file
Very similar behaviour to Problem 1
Only relevant when the new switch / module is managed
through the existing IP Address / FQDN
Common with stacked switches, adding another switch to
the stack
54. 54
MRTG Gotchya - Part 5
Solutions to Problems 1 & 2
cfgmaker
This is how the Wizard configures mrtg.cfg
The wizard updates the existing mrtg.cfg using a php
function (not available from the CLI)
Run cfgmaker @ CLI to generate a config file
Add the contents of the config file to the existing mrtg.cfg
cfgmaker --noreversedns âpublic@192.168.1.1" --output=output.txt
55. 55
MRTG Gotchya - Part 6
Problem 3 - With a frequently changing
environment, keep mrtg.cfg clean
Monitoring WAN links for remote routers?
WAN link no longer exists?
Disable / Delete service definition(s) in Core Configuration
Manager (CCM)
You will NEED to remove device from mrtg.cfg
Why?
MRTG will still try and collect data from WAN links no longer
accessible
Causes delays and can make MRTG run past the default 5
minute schedule ... can cause graph anomalies
56. 56
MRTG Gotchya - Part 7
Problem 4 - Firmware Upgrade causes port
numbering to change
Major firmware revision applied to switch / router
New data collected for ports is no longer the same pattern
Internal port numbering has changed
mrtg.cfg queries specific port numbers, does not use port
names or descriptions
Example
Old Firmware: WAN = Port 1 LAN = Port 2
New Firmware: WAN = Port 0 LAN = Port 1
Have seen this behaviour on SonicWALL Firewalls
58. 58
Discount Offer
But wait, there's more ...
When visiting the Nagios XI use my affiliate link
http://www.nagios.com/#ref=3oHG00
Hinweis der Redaktion
Good afternoon all and thank you for coming to my session. My name is Troy Lea and I'm here to talk to you about leveraging and understanding performance data and graphs in Nagios.
First a little about me. Iâm primarily a Windows tech starting back in DOS 6 and Windows 3.1. Iâve worked on a variety of support roles over the years and my last role involved the development and maintenance of a cloud computing platform based on Windows Remote Desktop. I primarily looked after the backend infrastructure. I've been using Nagios XI since 2009. I originally tried Nagios before XI was released however being a Windows guy there were some linux barriers that I just could not get my head around. I love Nagios XI because it is delivered as a virtual appliance. Within minutes of importing that VM and powering it on you have a fully . functional . monitoring . product. Before I caught the Nagios bug, my programming experience was all windows related. Batch files, VB scripts and Powershell. I had dabbled in a little HTML but only because I had to. Since then I've learnt HTML, PHP, CSS, Javascript, Perl, Bash ... whatever is required to get the result I needed.
In the world of monitoring there is more to Nagios sending alerts because a server is about to run out of hard disk space. Collecting and storing performance data is one of the most useful features in Nagios, with this information you can get an understanding of your environment's day to day trends. Analysing this data can be very helpful, perhaps to look at growth, or identifying performance bottlenecks. This session is about understanding how the performance data is stored in the back end and how Nagios accesses it. Topics covered in this session are: âą Basic concepts âą Understanding the .rrd and .xml files âą Understanding how pnp generates graphs âą Creating custom graph templates in pnp âą Writing plugins that will output the performance data you want âą Understanding how MRTG works Everything I will talk about is documented on the Internet, however finding that information does not always appear on the first page of your google search results. It's especially difficult when you are learning a new language or concept, the information out there is not always helpful, or it can get overwhelming. Even though this is an advanced technical session, it's aimed at delivering the core concepts and information to help you get the results you need (and impress the boss). As I've mentioned before, I'm primarily a Windows tech. So some of the material I talk about might be obvious to a linux tech however to a windows tech it can get frustrating, so my goal here is to make the content accessible to anyone. This presentation is centered around Nagios XI. There are references to locations of files and components, your implementation of Nagios may differ slightly however the concepts are still the same.
I'll start off quickly explaining the basic concepts. Â Let's look at a common service that is used in monitoring, a free disk space check. Â Here is the service configuration and the current service status.
Here is this command and the output we see when we execute it from the CLI. The data after the pipe symbol is the performance data, I will explain this in more detail later on. Here is the Advanced Status Detail of the service showing the performance data string.
Here is the performance graph for this service, the end result. The chain of events that occur are ...
When I first began using Nagios, it became apparent that the power behind Nagios came with plugins. The ability to monitor what you want, how you want, using a variety of different methods really appealed to me.  I think everyone who starts developing plugins for Nagios has a very similar journey We modify an existing plugin to make it suit our environment We then create a simple plugin using an existing one to do something completely different Before we know it we are writing very complex plugins  There are two exceptional resources available that clearly define the guidelines around creating plugins.  Nagios Plug-in Developer Guidelines http://nagiosplug.sourceforge.net/developer-guidelines.html The information here is very clear and easy to understand, I constantly am referring to this  PNP Documentation http://docs.pnp4nagios.org/pnp-0.4/doc_complete This has some more detailed information and examples in relation to the performance data and how it needs to be formatted Â
Taken directly from the PNP documentation  When the plugin produces performance data, it is divided into two parts. The pipe symbol ("|") is used as a delimiter.  Example check_icmp : OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; Something I want to make really clear here is: The data to the left of the pipe symbol is processed by the monitoring engine The data to the right of the pipe symbol is used for inserting into RRD files for performance data
The only information not shown here is the exit code Nagios receives from the plugin that determines the state of the check 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN
If your plugin does not output performance data, then graphs will not be available for that service. Â So it's as basic as that. You can create your plugin using whatever language you need to, as it fits your purpose and needs. All that matters is the end result which is returned back to Nagios when the plugin has finished running.
Shell script Something you might want to check on the Nagios host itself perl script Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments visual basic script Using NSClient on a Windows host to perform a check (like RDP usage)
Here is a breakdown of the performance data The asterix (*) fields are required fields, everything else is optional. Â In this instance, rta is the FIRST datasource, or datasource 1 Â Â
A plugin can output multiple datasources. Each datasource is separated by a space and the format is the same.  Example: rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; The label can have spaces if you desire however the label MUST be enclosed by single quotes  Example: 'Round Trip Average'=2.687ms;3000.000;5000.000;0; 'Packet Loss'=0%;80;100;;
Here is a basic plugin I have created to demonstrate outputting performance data using a shell script. This is just a simple script that generates two random numbers and outputs them. For demonstration purposes this script will always return an OK state. Â NUMBER1=$[ ( $RANDOM % 100 ) + 1 ] NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ] Â echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number 2'=$NUMBER2;;;;" exit "0"
Here is the output each time it is run:
Here is the graphs displayed after the check has been running for a whileÂ
Now I am going to define a warning and critical threshold in the performance data string , this will show you how they appear in the graphs. Â Number1 WARNING @ 50 CRITICAL @ 75 Number2 WARNING @ 500 CRITICAL @ 750 Â echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;50;75;; 'Number 2'=$NUMBER2;500;750;;" Â
Here is the output each time it is run:
This demonstrates how the performance data does not have any effect on the state of the services. Â Also, if you were to look into the XML file generated for this service, this is where the warning and critical thresholds are stored. Â
What are Performance Data Files? Â Performance data files are used for recording the results from Nagios checks, which in turn become useful for observing the daily trends of your environment. Being able to look at hourly/daily/weekly/monthly/yearly historical data can be invaluable when trying to resolve performance issues. It helps get to the bottom of those customer complaints like "the server is slow". Â There are two files created by Nagios for every check that generates performance data. Â The RRD file is a Round Robin Database. That means that after some time the oldest data will be dropped at the "end" and it will be replaced by new values "at the beginning". This is the file that contains all the historical data. Â The XML file contains detailed information about the check that generated the performance data. Things like warning and critical thresholds, names of the checks. This file is updated at the same time as the RRD file, so it will always be information that is obtained from when the check was last run. Â How are these files used? Â When you are viewing performance graphs in Nagios, they are generated by an application called PNP4Nagios. PNP4Nagios uses the XML and RRD files to generate these graphs. PNP4Nagios allows you to create your own customised graphs based on the information in the XML file and then displays the historical data in the RRD file. Â It takes a couple of service checks to run initially to collect performance data before you will see performance graphs. Depending on the frequency of your service checks depends on how long it takes to see the data in the performance graphs.
Initially, when a service check returns performance data, nagios dumps this into: /usr/local/nagios/var/spool/perfdata  Another background process will then detect this spooled perfdata and create/update the relevant .rrd and .xml files.  The Performance Data files live in: /usr/local/nagios/share/perfdata/<host>  There is a folder for each host  The host object files are called _HOST_ (the check_icmp command that determines if a host is up or down)  All the other files are relevant to the service objects defined for each host.
If you want to extract the data from an .rrd file you can do it with the following command: rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h If you donât specify start and end times the data retrieved will be from the past 1 day.
.xml file can contain sensitive data When the .xml file is created/updated, a lot of information is stored in this file that is relevant to the check command that was run, which could have a password stored in plain text.  For example here is a service check that has a password stored in the definition And here is the line in the .xml file  <NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd !-t sp_cbt_busy!--sp A!--warn 70!--crit 90!</NAGIOS_SERVICECHECKCOMMAND>
There are many methods to work around this behaviour if you are not comfortable with it. For example this service check uses a file that contains the credentials And you can see that the credentials are not inside the .xml file  <NAGIOS_SERVICECHECKCOMMAND>check_vmware_host! check_vmware_config_vcenter01 !cpu!90!95!!!!</NAGIOS_SERVICECHECKCOMMAND>
RRD Data is averaged out over time. When you look at performance graphs for past day / week / month / year will show results with less spikey data. This generally only occurs with data that has lots of peaks and troughs, the lower troughs will cause the overall average to be less to the peaks will appear lower. Something like active user sessions will have a peak through business hours and then a drop to almost nothing out of hours. Constant data like disk space used will generally not average out that much. It all depends on your environment! When reviewing RRD data you need to take into consideration these factors as itâs all relative.
When you are viewing performance graphs in Nagios, they are generated by an application called PNP4Nagios. Â Here are two examples: The difference between the two graphs is that the first one has a PNP template and hence it's a little prettier, compared to the second graph that is generic and tells you that it is using the Default Template. Â
So how does this work? http://docs.pnp4nagios.org/pnp-0.4/tpl When the RRD and XML files are created / updated, the check_command directive* defined in the service object is added to the XML file under each <DATASOURCE> tag as the TEMPLATE tag. Â In relation to distributed monitoring, if PNP finds a string enclosed in brackets at the end of performance data it will be recognized as check command and will be used as PNP template. OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp] Â When PNP goes to display the graph, it queries the XML file and gets the TEMPLATE tag for each datasource.
For the example graphs shown on previous slides, these values are: <TEMPLATE>check-host-alive</TEMPLATE> and <TEMPLATE>check_local_load_alt</TEMPLATE>  In the examples above, these values are: check-host-alive check_local_load_alt  It then looks in the following folders to see if it can find a php file that has one of these names: /usr/local/nagios/share/pnp/templates.dist /usr/local/nagios/share/pnp/templates
In the first example above it finds the following file: /usr/local/nagios/share/pnp/templates.dist/check-host-alive.php So it uses this PHP file to generate the performance graph  In the second example above it cannot find any file named check_local_load_alt.php so it uses the default template which is: /usr/local/nagios/share/pnp/templates.dist/default.php
Creating your own templates isn't too hard, but it is a little complex and will require some trial and error. Â The best starting point is to find an existing template and modify it to your liking. Â As described in the previous slide, the name of the check_command is what Nagios uses to insert into the <TEMPLATE> tag in the XML file (how PNP determines which template to use). Â So for this example I have created a copy of an existing command called "check_xi_service_nsclient_alt". You can see the command is identical to the original command except for the name.
Here is the service I am using that I want to view custom graphs for, you can now see it is using the new command.
And here is the graph being generated by this service, you can see it is currently using the default template and it is also telling you the check command So that's our starting point, we know the data currently exists in the RRD and XML files and we are ready to create our custom template
Copy the file: /usr/local/nagios/share/pnp/templates.dist/ default.php To the following location with the name: /usr/local/nagios/share/pnp/templates/ check_xi_service_nsclient_alt.php  Edit the file check_xi_service_nsclient_alt.php
I am going to remove the bottom two lines Default Template Check Command command name  Which are lines 62 and 63 $def[$i] .= 'COMMENT:"Default Template\r" '; $def[$i] .= 'COMMENT:"Check Command ' . $TEMPLATE[$i] . '\r" ';  Save the file, and then go and reload the performance graph and we will see the new template
Reload the performance graph and we will see the new template The blue arrow I've added to the graph is showing where the template name and command name used to be  How easy was that!
Now I'll get a little more technical  Here is the modified template we just created.  There are a few sections in here that can get overwhelming but once you understand it, it's not that complicated
An RRD file can have multiple data sources. An example of this is the check-host-alive command that is a ping test used for host defintions. The performance data returned from this service contains two datasources:  Round Trip Time Packet Loss  When you view the graphs for this service you actually see two graphs. Each datasource increases the size of the .rrd file
Here is a check command that generates five data sources and the pnp template uses these to generate two performance graphs. The first graph uses three datasources and the second graphs uses two data sources
So going back to the template we modified. The default template is designed to create one graph per data source. It does this by looking at the RRD and looping through each datasource and generates the graphs. Â This is a simple php foreach loop And the code within the loop references the relevant datasource by the $i variable So that's how individual graphs can be generated for each datasource in a generic fashion.
In a previous slide I showed you a check command that generated five datasources and the first graph contained three of these datasources. Because I created the check command I know that it will always output five data sources in the performance data and they will always be outputted in the same numerical order. I will explain this in further detail later on when we get to the section on creating your own plugins.  Here is the first part of the template that shows you how this is achieved: On line 10 we define var1 as the 1st datasource $DS[ 1 ] On line 11 we define var2 as the 2nd datasource $DS[ 2 ] On line 12 we define var3 as the 3rd datasource $DS[ 3 ]  And then throughout the rest of the code the graphs that are generated are pulling the specific data from the RRD files for each specific datasource  $opt[1] and $def[1] is a reference for the first graph being generated. Not shown here is the code that generates the second graph which are referenced as $opt[2] and $def[2]
The last part I will talk about in relation to templates is the number formatting. Things here can get very complex indeed. Â Here is an example of the numbers displayed on the custom template we modified and the relative code The relevant information I am going to refer to is %3.4lf
Here is an example of the numbers displayed for the five datasource .rrd file and the relative code The relevant information I am going to refer to is %4.0lf Â
What I am highlighting here is: On the first graph, the numbers are displayed with four decimal points On the second graph, the numbers are displayed as whole numbers
The PNP documentation defines the number formatting using the printf standard defined here: http://en.wikipedia.org/wiki/Printf  I must point out that as the number (1) and the letter "L" look alike, the format %3.4lg contains a lower case "L". The syntax is %[parameter][flags][width][.precision][length]type Â
Specifically I am going to focus on: width When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style precision Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place
% 3.4 lf width = 3 precision = .4 hence the displayed number is 25.3800  % 4.0 lf width = 4 precision = .0 hence the displayed number is 14 Because the precision is 0, no decimal place is used  To be honest I haven't spent time looking into the other options available in the formatting style, as width and precision were the only options I needed to get the results I was after.
MRTG stands for the Multi Router Traffic Grapher
In Nagios XI, MRTG uses a config file (/etc/mrtg/mrtg.cfg) that contains all the devices and their ports that it is going to gather data on.  When you run the Network Switch / Router wizard, it will populate the MRTG config file with the device you just queried.  MRTG is run as a cron job every 5 minutes and is defined in /etc/cron.d/mrtg  The name cron comes from the Greek word for time, ÏÏÏÎœÎżÏ [chronos]. Hence cron is a software utility on linux which is a time-based job scheduler. In the windows world it's the Task Scheduler.
When MRTG runs, it gathers the data from the devices defined in the mrtg.cfg file and dumps this data into the folder /var/lib/mrtg For every port monitored an .rrd file is created. NOTE: there is no .xml file generated  In Nagios XI, the service checks defined for the ports you want to monitor will run a command that looks for the .rrd file in the "/var/lib/mrtg" folder and then puts this information into the regular location for performance data "/usr/local/nagios/share/perfdata/<host>/<service>"
As I explained before, when you run the Network Switch / Router wizard, it will populate the MRTG config file with the details about device you just queried. In the wizard you may have only selected to monitor 10 ports on the switch. Regardless of the selections you make in the wizard, mrtg.cfg will be populated with all ports on the switch. Nagios itself will only have the service definitions for the 10 ports you selected to monitor.
What you can do here is to go and edit the mrtg.cfg file and remove all of the ports that you do not wish to gather data on. However this can cause another issue in the future which I will explain here. Â Let's say that you need to now monitor an additional two ports on that switch. Running the Network Switch / Router wizard again runs you through all the steps and select these ports. However due to how the wizard works, when it detects that this switch already exists in the mrtg.cfg file, it will not update the mrtg.cfg file. Even though you have edited the mrtg.cfg file in the past and removed these ports, the wizard does not look for this level of detail.
Another similar behaviour occurs in relation to switch stacking. For example I have a stack of two 48 port switches (96 ports in total). So in the past I ran the wizard, monitored everything I needed. Now we have added an additional 48 port to the switch stack, taking the total ports to 144. Because this is a stack of switches, it is all monitored through one IP address. So the same behaviour explained above occurs. Running the Network Switch / Router wizard again runs you through all the steps and select these additional ports. However due to how the wizard works, when it detects that this switch already exists in the mrtg.cfg file, it will not update the mrtg.cfg file.
Use the cfgmaker tool to update the mrtg.cfg file
When you are monitoring an environment that changes frequently, it helps to keep the mrtg.cfg file clean. For example, in my environment we have clients that have multiple WAN links connected in a private IP cloud. We monitor the client routers on these WAN links. From time to time WAN links are decomissioned. While we remove these client routers from the Nagios XI configuration, MRTG is still trying to collect data from these client routers. If the WAN IP no longer exists, then it is going to timeout while trying to contact these routers. These timeouts are going to have an effect, especially as your mrtg.cfg file contains more and more decomissioned client routers. Keeping in mind that MRTG runs every five minutes, these timeoutes can cause MRTG to run longer and hence it's not really running every five minutes anymore.
Firmware upgrades on client routers can cause issues as well. Specifically we've noticed this behaviour on SonicWALL firewalls. What can happen is when a major firmware revision is released, the numbering of ports inside the firmware changes. For example the WAN port we monitored was port 1 and the LAN port was port 2. After the firmware upgrade the WAN port became port 0 and the LAN port became port 1. We are only monitoring the WAN port using MRTG however MRTG is still trying to gather data from the SonicWALL on for port 1, so now your MRTG graphs are going to reflect all the data that is relative to the LAN port on the router and not the WAN port. What we saw was a massive jump in the graphs because we were collecting all the local LAN traffic passing through that port, when we were only interested in the WAN port activity. Â Â