Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Leveraging and Understanding
Performance Data and Graphs
Troy Lea
troy@box293.com
Twitter: @Box293
http://exchange.nagios.org/directory/Owner/Box293/1

2
About Me
IT Consultant
Nagios Developer
Love tinkering with Nagios
Why Nagios XI?
It’s a virtual appliance - ready to go

3
About This Presentation
Understanding how performance data is stored
in the back end and how Nagios accesses it
Goal is to give you key pieces of information
A good reference for understanding concepts
This presentation is centered around Nagios XI
Valid for other Nagios implementations

5
Basic Concepts - Part 2
./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95
C: - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C: Used
Space'=25.28Gb;32.00;38.00;0.00;39.99

6
Basic Concepts - Part 3
Service check command is executed by the monitoring engine
Monitoring engine receives the result of the check
Data received has performance data
Performance data is anything after the | (pipe)
The performance data is inserted into an RRD file
When viewing the performance graph, PNP4Nagios retrieves the
performance data from the RRD file and generates a pretty graph
Every time the service check receives performance data, it inserts
this performance data into the RRD file which allows you to look at
trends over time

7
Plugins
The power of Nagios is in the plugins!
Monitor what you want, how you want!
Resources available that clearly define the
guidelines around creating plugins
Nagios Plug-in Developer Guidelines
http://nagiosplug.sourceforge.net/developer-
guidelines.html
PNP Documentation
http://docs.pnp4nagios.org/pnp-0.4/doc_complete

8
Plugin Output Explained - Part 1
Plugins produce data divided into two parts
The pipe symbol “|” is used as a delimiter
Example check_icmp
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
Data to the left of the pipe symbol is processed
by the monitoring engine
Data to the right of the pipe symbol is used for
inserting into RRD and XML files

9
The exit code Nagios receives from the plugin
determines the state of the service
0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN
The exit code is not “visible” when running a
check from the command line or looking at the
output returned from the plugin

10
No performance data = no pretty graphs
You can create a plugin using whatever
language and tools are available
All that matters is the end result which is
returned back to Nagios when the plugin has
finished running

11
Examples:
Shell script
Something you might want to check on the Nagios
host itself
perl script
Remotely checking a device using SNMP OR using
third party APIs like the VMware vSphere SDK to
remotely access virtual environments
Visual Basic script
Using NSClient on a Windows host to perform a
check (like RDP usage)

12
Performance Data Specifics - Part 1
Asterix (*) fields are required fields, everything
else is optional
In this instance, rta is the FIRST DS, or DS 1

13
Performance Data Specifics - Part 2
Multiple DS
Each DS is separated by a space
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
The label can have spaces however the label
MUST be enclosed by single quotes
'Round Trip Average'=2.687ms;3000.000;5000.000;0;
'Packet Loss'=0%;80;100;;
13

14
Basic Plugin - Part 1
Example shell script demonstrating how a plugin
outputs performance data
NUMBER1=$[ ( $RANDOM % 100 ) + 1 ]
NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number
2'=$NUMBER2;;;;“
exit "0"

15
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;;

16
Performance data
displayed as a
pretty graph
Demonstration of
how you can
generate
performance data
in a plugin

17
Now lets add warning and critical thresholds to
the performance data string
Number1
WARNING @ 50
CRITICAL @ 75
Number2
WARNING @ 500
CRITICAL @ 750
echo ""OK - Number 1: $NUMBER1 Number 2:
$NUMBER2" | 'Number 1'=$NUMBER1;50;75;;
'Number 2'=$NUMBER2;500;750;;"

18
Here is the output each time it is run:
OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;;

19
This demonstrates
how the
performance data
does not have any
effect on the state
of the service
Warning and
Critical thresholds
are inside the .xml
file
19

20
.rrd and .xml files
Used for recording the results from Nagios checks
Useful for observing daily trends of your environment
Invaluable for helping resolve performance issues
RRD = Round Robin Database
XML = Information about the Nagios check
PNP4Nagios uses the RRD and XML files to
generate pretty graphs

21
Location of .rrd and .xml files
When a service check returns performance data,
Nagios dumps this into:
/usr/local/nagios/var/spool/perfdata
A background process detects the spooled data
and creates / updates the relevant .rrd and .xml
The Performance Data files live in:
/usr/local/nagios/share/perfdata/<host>

22
Extract .rrd data
You can extract data from an .rrd file
Example (from the CLI):
rrdtool fetch
/usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX
-r 900 -s -1h

23
.rrd and .xml Gotchya - Part 1
The .xml file can contain sensitive data
<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u
readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90!
</NAGIOS_SERVICECHECKCOMMAND>

24
Perhaps use a central credential file
<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!
check_vmware_config_vcenter01!cpu!90!95!!!!
</NAGIOS_SERVICECHECKCOMMAND>

25
RRD Data is averaged out over time
Looking at performance graphs for past day / week /
month / year will show results with less spikey data
This generally only occurs with data that has lots of
peaks and troughs
Constant data like disk space used will generally not
average out that much
It all depends on your environment!
When reviewing RRD data you need to take into
consideration these factors, it’s all relative!

26
Graphs - How Templates Are Used - Part 1
http://docs.pnp4nagios.org/pnp-0.4/tpl

27
PNP4Nagios queries the XML file for the
<TEMPLATE> tag
Each datasource has it’s own <TEMPLATE> tag
<TEMPLATE>check-host-alive</TEMPLATE>
Also can be a trailing string in the performance
data (good for distributed monitoring)
OK - 127.0.0.1: rta 2.687ms, lost 0% |
rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;
[check_icmp]

28
From the example graphs:
<TEMPLATE>check-host-alive</TEMPLATE>
<TEMPLATE>check_local_load_alt</TEMPLATE>
PNP4Nagios looks for a php file with this name
in the following folders:
/usr/local/nagios/share/pnp/templates.dist
/usr/local/nagios/share/pnp/templates

29
check-host-alive
/usr/local/nagios/share/pnp/templates.dist/check-host-
alive.php
This PHP file generates the performance graph
check_local_load_alt
check_local_load_alt.php does NOT exist
Default template is used:
/usr/local/nagios/share/pnp/templates.dist/default.php
29

30
Graphs - Creating Your Own Template - Part 1
The check_command name is what Nagios uses
to insert into the <TEMPLATE> tag in the XML
file (how PNP determines which template to use)
So for this example I have created a copy of an
existing command
check_xi_service_nsclient_alt

31
The service definition using the new command

32
The graph currently being generated
Default Template being used
Check Command being used
.rrd and .xml files currently contain valid data

33
Copy the file:
/usr/local/nagios/share/pnp/templates.dist/default.php
To the following location with the name:
/
usr/local/nagios/share/pnp/templates/check_xi_servic
e_nsclient_alt.php
Edit check_xi_service_nsclient_alt.php

34
In the graph we are removing the bottom two lines
Default Template
Check Command command name
Which are lines 62 and 63
$def[$i] .= 'COMMENT:"Default Templater" ';
$def[$i] .= 'COMMENT:"Check Command ' .
$TEMPLATE[$i] . 'r" ';
Save check_xi_service_nsclient_alt.php
34

35
How easy was that!
Updated graph
Template Name and Check Command removed

36
PNP Templates In Detail - Part 1
Lets get into specifics
Template we just
modified
It’s not that
complicated! (LOL)
36

37
.rrd files can have multiple datasources (DS)
Round Trip Time and Packet Loss for example

38
Example of .rrd file with five DS
Two graphs generated using these DS

39
Default Template creates one graph per DS
This is a simple PHP foreach loop
The code within the loop references the relevant
DS by the $i variable

40
This section of the template uses three DS
One graph will be generated using three DS
$opt[1] and $def[1] is a reference for the first graph
being generated

41
Number formatting
Our modified template and the relative code
The relevant information:
%3.4lf

42
The three DS template and the relative code
The relevant information:
%4.0lf

43
Numbers are displayed with four decimal points
%3.4lf
Numbers are displayed as whole numbers
%4.0lf

44
PNP documentation defines the number
formatting using the printf standard defined here
http://en.wikipedia.org/wiki/Printf
The number (1) and the letter "L" look alike
%3.4lg contains a lower case "L"
The syntax is
%[parameter][flags][width][.precision][length]type

45
width
When the number is generated on the graph, it will
allocate a minimum specific width, this helps you
align numbers in a column style
precision
Determines if the number displayed is a whole
number, or a number with a specific number of digits
following the decimal place

46
%3.4lf
width = 3
precision = .4
hence the displayed number is 25.3800
%4.0lf
width = 4
precision = .0
hence the displayed number is 14
Because the precision is 0, NO decimal place is used

47
MRTG - Part 1
MRTG = Multi Router Traffic Grapher
Nagios Addon that is useful for monitoring
network switch and router bandwidth using SNMP
Can be complicated to understand configuration

48
MRTG - Part 2
Nagios XI Wizard called “Network Switch /
Router” automates the configuration of MRTG
MRTG configuration file
/etc/mrtg/mrtg.cfg
MRTG runs as a cron job every five minutes
cron comes from the Greek word for time, χρόνος
[chronos]
Hence cron is a software utility on linux which is a
time-based job scheduler
In the windows world it's the Task Scheduler

49
MRTG - Part 3
When MRTG runs, it gathers data from the
devices defined in the mrtg.cfg file
It dumps this data into the folder
/var/lib/mrtg
For every port monitored, an .rrd file is created
(no .xml file created at this point)
Another background process will then take the
data in /var/lib/mrtg and put it into the correct
location
/usr/local/nagios/share/perfdata/<host>

50
MRTG Gotchya - Part 1
When the Wizard populates the mrtg.cfg file it will
add ALL ports on the switch to the config file
Even if you only selected to monitor 10 ports on
the switch
The Nagios XI Service Configuration will only have 10
ports defined as service definitions
Every time the MRTG cron job runs, it will collect
data from all ports on the switch (as defined in the
mrtg.cfg file)
Extra CPU cycles, extra disk space
50

51
On a 48 port switch this might not concern you
But in a stack of two 48 port switches this
becomes 96 ports + also other internal ports like
link aggregation ports (another 32 ports perhaps)
So these additional 128 ports have now added
8700+ configuration lines to the mrtg.cfg file
128 ports consume about 24 MB of .rrd disk
space
In my past environment, the mrtg.cfg file was
59,000 lines long!
51

52
Suggestion
Clean up the mrtg.cfg file
Remove the ports you do not wish to gather data on
Can this cause Problems?
Yes!
Problem 1
Monitoring additional ports later using the wizard will
not work
The wizard will NOT re-add the ports to the mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file

53
Problem 2 - Adding a switch (or module) to an
existing switch
Monitoring additional ports later using the wizard will
not work
The wizard will NOT add newly detected ports to the
mrtg.cfg file
Wizard detects switch / router is already in the mrtg.cfg file
Very similar behaviour to Problem 1
Only relevant when the new switch / module is managed
through the existing IP Address / FQDN
Common with stacked switches, adding another switch to
the stack

54
Solutions to Problems 1 & 2
cfgmaker
This is how the Wizard configures mrtg.cfg
The wizard updates the existing mrtg.cfg using a php
function (not available from the CLI)
Run cfgmaker @ CLI to generate a config file
Add the contents of the config file to the existing mrtg.cfg
cfgmaker --noreversedns “public@192.168.1.1" --output=output.txt

55
Problem 3 - With a frequently changing
environment, keep mrtg.cfg clean
Monitoring WAN links for remote routers?
WAN link no longer exists?
Disable / Delete service definition(s) in Core Configuration
Manager (CCM)
You will NEED to remove device from mrtg.cfg
Why?
MRTG will still try and collect data from WAN links no longer
accessible
Causes delays and can make MRTG run past the default 5
minute schedule ... can cause graph anomalies

56
Problem 4 - Firmware Upgrade causes port
numbering to change
Major firmware revision applied to switch / router
New data collected for ports is no longer the same pattern
Internal port numbering has changed
mrtg.cfg queries specific port numbers, does not use port
names or descriptions
Example
Old Firmware: WAN = Port 1 LAN = Port 2
New Firmware: WAN = Port 0 LAN = Port 1
Have seen this behaviour on SonicWALL Firewalls

58
Discount Offer
But wait, there's more ...
When visiting the Nagios XI use my affiliate link
http://www.nagios.com/#ref=3oHG00

Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Ähnlich wie Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs (20)

Mehr von Nagios

Mehr von Nagios (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Hinweis der Redaktion