2. System Monitoring
What to monitor
●
How to monitor
●
Which kind of data to collect
●
When to use the collected data
●
3. What to monitor
Applications / Daemons
●
that provide Services
●
on one or more Ports
●
Running on one or more Machines.
●
4. Remote vs. Local Checks
Remote checks
●
Portavailability
–
Network
–
Dummy requests / Status servlet
–
Local checks
●
Required processes running?
–
Networks available?
–
Error messages in Logfiles?
–
Application throughput?
–
System parameters OK?
–
5. When to analyze
Instant notification
●
Daily Operations
–
Boundaries/Limits checking
–
Historical data
●
Problem determination and analysis
–
Statistics / Graphs
–
6. How to collect
Dedicated monitoring host
●
collects data from application machines
–
global data loss in case of network/monitoring failure
–
Store data locally on each host
●
and backup/archive that data!
–
7. Why History?
For problem determination and analysis
●
Statistics / Graphs
●
Show bottlenecks
–
Prove upgrade needs
–
Managers are loving it
–
8. ReadyMade vs. Rollyourown
Deploying existing tools is fast
–
Gives you generic checks
–
Can be customized over time
–
Does it store monitoring data locally?
–
Recommendation: rollyourown system
–
monitoring
9. Using NAGIOS
http://www.nagios.org/
●
Network and PortChecking
●
Needs plugins for remote checking
●
Use 'ssh' or NRPE
●
Think about security,
●
use a nonprivileged user.
'sudo' is your friend.
●
10. Data Format
Should be simple, humanreadable, parseable,
●
flexible, grepable.
YAML? Not grepable.
●
CSV? Not flexible.
●
Enhance CSV with infield header info:
●
ts=20070828_11:50:00,task=YAPC::2007,
status=attending
11. Data Format (parsing)
Easily parseable:
●
@d = split(',', $_); # don't, use Text::CSV_XS
%h = map {(split('=', $_, 2))} @d;
use Text::CSV_XS
●
12. Checking Logfiles
How to count “matching loglines per minute”?
●
# in shell
tail f access_log | grep quot; 200 quot; | wc l
# doesn't work :(
Use File::Tail (available from CPAN)
●
16. 'nmon'
'nmon' courtesy of IBM (unsupported)
●
Performance Data
●
CPU utilization, Memory use, Kernel statistics and run queue
–
Disks I/O rates, transfers, and read/write ratios, Filesystems size and free
–
space, Disk adapters, Paging space and paging rates, User defined disk
groups
Network I/O rates, transfers, and read/write ratios
–
Machine details, CPU and OS specification, Top processors
–
Dynamic LPAR changes AIX and Linux (on POWER hardware)
–
19. 'nmon'
Output
●
On screen (console, telnet, VNC, putty or X Windows) using curses for low CPU
–
impact which is updated once every two seconds.
Save the data to a comma separated file for analysis and longer term data
–
capture.
nmon Analyser Excel 2000 spreadsheet loads the nmon output file and
–
automatically creates dozens of graphs.
Filter this data, add it to a rrd database
–
See
●
http://www.ibm.com/developerworks/aix/library/auanalyze_aix/index.html
–
http://www941.ibm.com/collaboration/wiki/display/WikiPtype/nmon
–
27. Pushing perl code to remote sites
From within a perl script
●
Start a perl interpreter via 'ssh'
●
Feed it the script via stdin
●
Finish with __END__
●
Script will start to run
●
Script can use stdin/out to communicate
●
34. Creating graphs with 'drraw'
A better, pointnclick way: use 'drraw'
●
http://web.taranis.org/drraw/
●
Installs as a CGIapp
●
Define graphs onthefly (somewhat WYSIWYG)
●
Turn graphs into templates that can be applied
●
to selected .rrd files
Combine multiple templates into dashboards,
●
letting you view all your data at once.
36. Have fun!
Support OpenSource.
●
Be nice to your fellow people.
●
Pay it forward:
●
For each favor you receive,
do three strangers a favor.
http://www.payitforwardmovement.org/
And have a great day!
●
38. System Monitoring
What to monitor
●
How to monitor
●
Which kind of data to collect
●
When to use the collected data
●
39. What to monitor
Applications / Daemons
●
that provide Services
●
on one or more Ports
●
Running on one or more Machines.
●
Some basic definitions:
we are interested in monitoring applications, that is, programs that
provide services on one or more ports, running on one or more
machines. But of course these principles apply for just any program or
daemon.
40. Remote vs. Local Checks
Remote checks
●
Portavailability
–
Network
–
Dummy requests / Status servlet
–
Local checks
●
Required processes running?
–
Networks available?
–
Error messages in Logfiles?
–
Application throughput?
–
System parameters OK?
–
Check the application logfiles if there are some unusual messages in
there. Check the logfiles for throughput. This is often the best and
easiest check: let your users be your testers.
Count the usergenerated requests and how many of them were OK. Then
calculate a request rate and plot it. (I will show you later how to do
that) Such a graph can be a tremendous help. Imagine getting a call
from the helpdesk, that a user has reported that you application isn't
working. (don't we all love such specific reports...) If you now have
such a graph, you have to take only a glance at it to be able to tell the
helpdesk: quot;Well, in the past hour, 3000 requests were processed and
only 5 showed an error, so from my point, everything is running fine.
Walk him through his configuration again...quot;
41. When to analyze
Instant notification
●
Daily Operations
–
Boundaries/Limits checking
–
Historical data
●
Problem determination and analysis
–
Statistics / Graphs
–
For daily operations, you want to be notified as soon as possible if any
problems turn up, if parameters are overstepping their boundaries.
For problem analysis you want all that data to be available at a later point
in time in a way that is wellsuited for searching.
42. How to collect
Dedicated monitoring host
●
collects data from application machines
–
global data loss in case of network/monitoring failure
–
Store data locally on each host
●
and backup/archive that data!
–
Beware: monitoring host is a single point of failure.
If there is a network problem or the monitoring software crashes or the
monitoring host has a hardwarefailure, ALL data is gone or doesn't get
collected.
43. Why History?
For problem determination and analysis
●
Statistics / Graphs
●
Show bottlenecks
–
Prove upgrade needs
–
Managers are loving it
–
Store the data locally on the machines.
And archive those logs.
Use backup/archiving mechanisms in place for your
application and webserver logs.
44. ReadyMade vs. Rollyourown
Deploying existing tools is fast
–
Gives you generic checks
–
Can be customized over time
–
Does it store monitoring data locally?
–
Recommendation: rollyourown system
–
monitoring
45. Using NAGIOS
http://www.nagios.org/
●
Network and PortChecking
●
Needs plugins for remote checking
●
Use 'ssh' or NRPE
●
Think about security,
●
use a nonprivileged user.
'sudo' is your friend.
●
46. Data Format
Should be simple, humanreadable, parseable,
●
flexible, grepable.
YAML? Not grepable.
●
CSV? Not flexible.
●
Enhance CSV with infield header info:
●
ts=20070828_11:50:00,task=YAPC::2007,
status=attending
47. Data Format (parsing)
Easily parseable:
●
@d = split(',', $_); # don't, use Text::CSV_XS
%h = map {(split('=', $_, 2))} @d;
use Text::CSV_XS
●
48. Checking Logfiles
How to count “matching loglines per minute”?
●
# in shell
tail f access_log | grep quot; 200 quot; | wc l
# doesn't work :(
Use File::Tail (available from CPAN)
●
49. Using File::Tail
%patterns = (
'/var/logs/httpd/access_log' => [
{
regexp => qr/ 200 /,
outfile => '/var/logs/monitoring/access_ok.last',
history => '/var/logs/monitoring/access_ok.log',
count => 0,
},
{
regexp => qr/ 504 /,
outfile => '/var/logs/monitoring/access_gw_errors.last',
history => '/var/logs/monitoring/access_gw_errors.log',
count => 0,
},
],
);
For each logfile, count several matching regexps and store
the counts into two files, outfile and history.
50. Using File::Tail (2)
my @tails = map {File::Tail>new(name=>$_,
maxinterval=>10,
resetafter=>120,
reset_tail=>0,
);
} keys %patterns;
while (1) {
my ($nfound, $timeleft, @pending)=
File::Tail::select(undef, undef, undef, undef, @tails);
foreach my $f (@pending) {
my $line = $f>read();
my $logfile = $f>{input};
foreach my $p (@{$patterns{$logfile}}) {
$p>{count}++
if ($line =~ m/$p>{regexp}/);
}
}
}
In the script first generate all those File::Tail objects, each representing
one logfile.
Wait for new lines to arrive via File::Tails select().
Process list of objects that have new lines.
For each line, run all the regexps for that logfile against it and increment
the counter when matching.
Periodically save the counts to files and reset the counters.
51. Using File::Tail (3)
$SIG{ALRM} = sub {
my $time = strftime(quot;%Y%m%d_%H:%M:%Squot;, localtime());
foreach my $p (map {@$_} values %patterns) { # expand list
$p>{count} ||= 0; # be sure there is a num there
open (STATFILE, quot;>$p>{outfile}.$$quot;) # write to a temp file
or die quot;$0: ERROR opening stat file $p>{outfile}: $!nquot;;
print STATFILE quot;$p>{count}nquot;;
close STATFILE;
rename quot;$p>{outfile}.$$quot; => $p>{outfile}; # atomic update
open (STATFILE, quot;>>$p>{history}quot;) # append to history
or die quot;$0: ERROR opening history file $p>{history}: $!nquot;;
print STATFILE quot;ts=$time,cnt=$p>{count}nquot;;
close STATFILE;
$p>{count} = 0; # reset counter
}
alarm 60; # reset alarm
};
alarm 60; # upon alarm, write statistic files and reset counters
Use alarm() handler.
Maybe keep track of the time by specifying a timeout to select() and
periodically check the elapsed time.
Go through all the regexps and counters and save the values to files.
One history file with time information in it that can be used for plotting
lateron and one file with just the value which can be used for the
instant checks by NAGIOS.
52. 'nmon'
'nmon' courtesy of IBM (unsupported)
●
Performance Data
●
CPU utilization, Memory use, Kernel statistics and run queue
–
Disks I/O rates, transfers, and read/write ratios, Filesystems size and free
–
space, Disk adapters, Paging space and paging rates, User defined disk
groups
Network I/O rates, transfers, and read/write ratios
–
Machine details, CPU and OS specification, Top processors
–
Dynamic LPAR changes AIX and Linux (on POWER hardware)
–
'nmon' instead of standard commandline tools like 'ps', 'iostat' etc.
Linux or AIX only.
Similar to 'top' or 'topas', only on steroids.
54. 'nmon' screenshot
──nmonv10r───e=diskESS─────────Host=db1prd3────────Refresh=2 secs───11:44.57──
┌─AdapterI/OStatistics───────────────────────────────────────────────────────┐
│Name %busy read write xfers Disks AdapterType │
│fscsi0 18.5 19547.9 26.0 KB/s 662.0 8 FC SCSI I/O Controller│
│fscsi1 16.5 20079.9 30.0 KB/s 673.0 8 FC SCSI I/O Controller│
│sisscsia0 1.0 4.0 0.0 KB/s 1.0 2 PCIX Ultra320 SCSI Ad│
│TOTALS 3 adapters 39631.7 56.0 KB/s 1336.0 18 TOTAL(MB/s)=38.8 │
└──────────────────────────────────────────────────────────────────────────────┘
┌─DiskBusyMap─Key(%): @=90 #=80 X=70 8=60 O=50 0=40 o=30 +=20 =10 .=5 _=0%───┐
│hdisks numbers> 1 2 3 4 │
│ 01234567890123456789012345678901234567890123456789 │
│hdisk0 to 49 ____oX_.__.______ │
└──────────────────────────────────────────────────────────────────────────────┘
┌─TopProcesses──Procs=267mode=3[1=Basic2=CPU3=Perf4=Size5=I/O]───────────┐
│ PID %CPU Size Res Res Res Char RAM Paging Command │
│ Used KB Set Text Data I/O Use io other repage │
│ 929812 30.6 8136 1076 16 1060 2047 0% 162 1242 0 db2sysc │
│ 1528038 11.1 1200 268 16 252 32015K 0% 342 144 0 db2sysc │
│ 1007782 2.5 12796 9688 16 9672 0 0% 19 27 0 db2sysc │
│ 1503320 1.1 1200 232 16 216 2311K 0% 83 12 0 db2sysc │
│ 503930 0.8 28404 25524 16 25508 0 1% 0 0 0 db2sysc │
│ 1577164 0.5 5480 5600 252 5348 1172 0% 0 0 0 nmon │
│ 729318 0.2 1936 2136 392 1744 525 0% 0 0 0 hats_nim │
└──────────────────────────────────────────────────────────────────────────────┘
The DiskBusyMap can be really helpful.
[That can help finding out why the database doesn't perform well under
high load. Might be that even though tablespaces are distributed across
a lot of disks, most of the time it is only one disk that gets used and thus
is the bottleneck.]
55. 'nmon'
Output
●
On screen (console, telnet, VNC, putty or X Windows) using curses for low CPU
–
impact which is updated once every two seconds.
Save the data to a comma separated file for analysis and longer term data
–
capture.
nmon Analyser Excel 2000 spreadsheet loads the nmon output file and
–
automatically creates dozens of graphs.
Filter this data, add it to a rrd database
–
See
●
http://www.ibm.com/developerworks/aix/library/auanalyze_aix/index.html
–
http://www941.ibm.com/collaboration/wiki/display/WikiPtype/nmon
–
'nmon' can collect all system information over a longer period and write it
to a file.
Unfortunately, the output format is more geared to drop that data into a
spreadsheet and analyze it there.
That's what Perl is for!
56. Wrapping 'nmon'
my $LogDir = quot;/var/log/nmonquot;;
my $FifoFile = quot;$LogDir/nmonstats.fifoquot;;
my $NmonCmd = quot;/usr/local/bin/nmon F $FifoFile pDET
I 0 s 60 c 1440quot;;
mkfifo($FifoFile, 0600);
open(PIPE, quot;$NmonCmd|quot;); # will print PID
open($readFH, $FifoFile)
or die quot;Cannot open $FifoFile: $!nquot;;
$nmonPid = readline(*PIPE); # nmon writes PID in
first line
close(PIPE);
chomp $nmonPid;
$nmonPid = $1 if ($nmonPid =~ m/(d+)/); # untaint
my $selectFH = new IO::Select ($readFH);
Wrapper around 'nmon' that makes a useful data gathering
daemon out of it.
Use a named pipe aka fifo as an output file for 'nmon', so
output can be processed as it happens. Also useful for
other programs.
57. daemonize
sub go_background {
use POSIX qw(setsid);
my($pid, $sess_id, $i);
## Fork and exit parent
if ($pid = fork) { exit 0; }
## Detach ourselves from the terminal
die quot;Cannot detach from controlling terminal.nquot;
unless setsid();
## Prevent possibility of acquiring a controling terminal
$SIG{'HUP'} = 'IGNORE';
if ($pid = fork) { exit 0; }
## Reopen stdin/out/err to /dev/null
open(STDIN, quot;+>/dev/nullquot;);
open(STDOUT, quot;+>/dev/nullquot;);
open(STDERR, quot;+>/dev/nullquot;);
}
Run the script as a daemon in the background.
Standard doublefork, exiting in between, end up having
init as parent.
Detach from the controlling terminal with setsid().
Redirect stdin/out/err to /dev/null.
But now all diagnostics must go to a logfile.
58. Writing to logfiles
# set new handlers
$SIG{__WARN__} = sub { write_log($_[0]); };
$SIG{__DIE__} = sub { write_log($_[0]); CORE::die(@_); };
# all warnings and errors go to the log file
sub write_log {
my $msg = shift;
use POSIX qw(strftime);
my $ts = quot;[quot;.strftime('%Y%m%d_%H:%M:%S', localtime).quot;] quot;;
$msg =~ s{^}{$ts}gm; # insert at beginning of each line
$msg .= quot;nquot; unless $msg =~ m/nZ/; # make sure it ends in a
newline
open(my $log, quot;>>$LogFilequot;);
print $log $msg;
close $log;
}
Install handlers for warn() and die().
Open and close the logfile for each logmessage.
This makes logfilerotation really easy (just rename).
But shouldn't be used for highvolume logging.
59. Safely using 'ssh'
.ssh/authorized_keys
●
command=”/path/to/script”
●
nopty
●
noportforwarding
●
noX11forwarding
●
noagentforwarding
●
from=”10.1.1.1”
●
command=”/path/to/script”,nopty,noportforwarding
sshdss AAAA...== comment
'authorized_keys' file that allows connections from remote sites based on
publickeys. Generate a keypair on the monitoring machine and paste
the public key into 'authorized_keys' on the monitored machine. Then
'ssh' and 'scp' can be used without having to specify the password for
that user.
Also specify which program or script to run if a certain key is used to
login. This allows to have a remote user execute certain critical,
privileged commands without giving him full shellaccess.
60. Connecting 'ssh' to multiple sites
foreach my $host (@Hosts) {
if (not exists $HostInfo{$host}{pid}
and $now $HostLastActive{$host} > $ReconnectDelay) {
$HostLastActive{$host} = time();
my ($rfh, $wfh, $efh) = (gensym, gensym, gensym);
my $pid;
eval { # open new connection
$pid = open3($wfh, $rfh, $efh,
qw(/usr/bin/ssh T o BatchMode=yes l user),
$host, quot;perlquot;);
};
if ($@) {
warn quot;Cannot ssh to $host: $@nquot;;
if ($pid) { # cleanup child
kill TERM => $pid
and waitpid $pid, 0;
}
next; # retry later
}
Kick off several ssh connections at once, using open3().
Spawn a perl interpreter at the remote side.
Catch errors by handling EOF and by timeout, as we know that the
logfiles that is tailed on are supposed to grow every minute or so.
Cannot use the Net::Ssh* modules because we cannot select() on them (no
fileno()).
63. Pushing perl code to remote sites
From within a perl script
●
Start a perl interpreter via 'ssh'
●
Feed it the script via stdin
●
Finish with __END__
●
Script will start to run
●
Script can use stdin/out to communicate
●
Don't install the perl scripts on the remote machines, simply push down
the perl code to the interpreter via stdin.
Use File::Tail instead of 'tail f'
Push down checking code for use with NAGIOS? Performance penalty.
Maybe use persistent collecting daemon?
64. Pushing modules
use File ::Tail; # just to get it into @INC
my $script;
foreach my $m (qw(File/Tail.pm)) { # loop over needed modules
my $fn;
foreach my $prefix (@INC) { # find them in @INC
next if (ref($fn)); # skip coderefs
$fn = quot;$prefix/$mquot;;
last if (f $fn); # found it?
undef $fn;
}
die quot;Cannot find $mnquot; if not $fn;
Push modules too!
@INC contains the filepath to the loaded modules, so search it for the
source file.
65. Pushing modules (2)
# read module file and append it to $script
open(my $fh, $fn)
or die quot;Cannot open $fn: $!nquot;;
$script .= quot;BEGIN { eval <<'__END_OF_MODULE__'nquot;;
local $/ = undef; # slurp mode
$script .= <$fh>;
$script .= quot;n__END_OF_MODULE__n}nquot;;
close $fh;
}
# now add monitoring script
$script .= <<'__EOT__'
...
__EOT__
print PIPE_TO_REMOTE_PERL $script;
# for binary modules: PAR, the Perl Archive toolkit.
# allows to package modules into one file that
# can be eval'd by perl.
Slurp in the module source and wrap it with an eval so it gets its own
namespace.
Push it down along with check script.
Only works with pureperl modules. For modules with binary libraries
one could theoretically use PAR, the perl archive toolkit, which allows
you to put all modules into one file that can be evaluated.
Need to pack a perl application into one executable?
Remember PAR and pp, the Perl Packager.
66. RRDtool
RRD = RoundRobin Database
●
(think ringbuffer)
●
RRDtool by Tobias Oetiker
●
http://www.rrdtool.org/
Data logging and graphing system.
●
Used by BigSister, Cacti, NetMRG, webminstats
●
Perl bindings available:
●
use RRDs;
Store collected data in RRDs created by RRDtool.
RRD stands for RoundRobin Database, meaning that you simply drop
your data points in there one at a time and the database keeps a certain
timespan (for example the last 3 months) in its ringbuffer.
RRDtool can create graphs from the data in several ways and has a
powerful datamanipulation machine built in.
RRDtool only plots timelike data and cannot be used to generate
arbitrary graphs like piecharts or xyscatter plots.
67. Storing data into RRDs
use RRDs;
my @update = (quot;$RRDPathPrefix/$host.rrdquot;,
quot;templatequot; => join(quot;:quot;, @Datatags),
join(quot;:quot;, @data{@Datatags}),
);
RRDs::update(@update);
my $res = RRDs::error;
warn $res if $res;
Easy to store data into an RRdatabase.
template tells RRDtool the names and order of the data points (similar to
the header line of a CSV)
Just store the data points in that order.
68. Creating graphs with RRDtool
# files are named /rrdpath/hostname.rrd
rrdtool graph drawing.png u 150 rigid v “requests/min”
t quot;Requests $hostquot; start 86400 –end now
height 600 width 1000
DEF:html=${host}.rrd:html:MAX
DEF:xml=${host}.rrd:xml:MAX
DEF:was=${host}.rrd:was:MAX
DEF:soap=${host}.rrd:soap:MAX
AREA:was#9F9F9F:quot;WAS Requestsquot;
LINE2:soap#F8B932:quot;SOAP Requestsquot;
LINE2:html#0000FF:quot;HTML Requestsquot;
LINE2:xml#00FF00:quot;XML Requestsquot;
STACK:html#FFFF00:quot;XML+HTML Requestsquot;
STACK:soap#000000:quot;SOAP+XML+HTML Requestsquot;
To create graphs from an RRDatabase use the commandline rrdtool with
cron jobs.
Use the perl bindings for CGI scripts.
Specify the filename and the size of the graph, the time slot to plot and
which and how the data points are to be drawn.
70. Creating graphs with 'drraw'
A better, pointnclick way: use 'drraw'
●
http://web.taranis.org/drraw/
●
Installs as a CGIapp
●
Define graphs onthefly (somewhat WYSIWYG)
●
Turn graphs into templates that can be applied
●
to selected .rrd files
Combine multiple templates into dashboards,
●
letting you view all your data at once.
'drraw' has a more pointnclick or WYSIWYG approach. It's a CGI
application that needs only minimal configuration. Just tell it where to
find your .rrd files.
Just select which .rrd files to use and 'drraw' will show all data channels
that are in the RRD. Assign colors, disable the unwanted data channels,
set graph dimentions etc.
Store a finished graph as a template and use it with other RRDs that have
the same structure.
Combine templates to form a dashboard that shows all available data at
once.
72. Have fun!
Support OpenSource.
●
Be nice to your fellow people.
●
Pay it forward:
●
For each favor you receive,
do three strangers a favor.
http://www.payitforwardmovement.org/
And have a great day!
●
Useful tools: NAGIOS, nmon, ssh, sudo, RRDtool and drraw.
Perl modules: File ::Tail, Text::CSV_XS and PAR.