YAPC2007 Remote System Monitoring (w. Notes)

Remote System Monitoring

Roland Giersig <RGiersig@cpan.org>

Maintainer of Expect and IO::Tty on CPAN

Sponsored by xion.at

System Monitoring

What to monitor
●

How to monitor
●

Which kind of data to collect
●

When to use the collected data
●

What to monitor

Applications / Daemons
●

that provide Services
●

on one or more Ports
●

Running on one or more Machines.
●

Remote vs. Local Checks

Remote checks
●

Portavailability
–

Network
–

Dummy requests / Status servlet
–

Local checks
●

Required processes running?
–

Networks available?
–

Error messages in Logfiles?
–

Application throughput?
–

System parameters OK?
–

When to analyze

Instant notification
●

Daily Operations
–

Boundaries/Limits checking
–

Historical data
●

Problem determination and analysis
–

Statistics / Graphs
–

How to collect

Dedicated monitoring host
●

collects data from application machines
–

global data loss in case of network/monitoring failure
–

Store data locally on each host
●

and backup/archive that data!
–

Why History?

For problem determination and analysis
●

Statistics / Graphs
●

Show bottlenecks
–

Prove upgrade needs
–

Managers are loving it
–

ReadyMade vs. Rollyourown

Deploying existing tools is fast
–

Gives you generic checks
–

Can be customized over time
–

Does it store monitoring data locally?
–

Recommendation: rollyourown system
–

monitoring

Using NAGIOS

http://www.nagios.org/
●

Network and PortChecking
●

Needs plugins for remote checking
●

Use 'ssh' or NRPE
●

Think about security,
●

use a nonprivileged user.
'sudo' is your friend.
●

Data Format

Should be simple, humanreadable, parseable,
●

flexible, grepable.
YAML? Not grepable.
●

CSV? Not flexible.
●

Enhance CSV with infield header info:
●

ts=20070828_11:50:00,task=YAPC::2007,
status=attending

Data Format (parsing)

Easily parseable:
●

@d = split(',', $_); # don't, use Text::CSV_XS
%h = map {(split('=', $_, 2))} @d;

use Text::CSV_XS
●

Checking Logfiles

How to count “matching loglines per minute”?
●

# in shell
tail f access_log | grep quot; 200 quot; | wc l

# doesn't work :(

Use File::Tail (available from CPAN)
●

Using File::Tail

%patterns = (
  '/var/logs/httpd/access_log' => [
    {
     regexp  => qr/ 200 /,
     outfile => '/var/logs/monitoring/access_ok.last',
     history => '/var/logs/monitoring/access_ok.log',
     count => 0,
    },
    {
     outfile => '/var/logs/monitoring/access_gw_errors.last',
     history => '/var/logs/monitoring/access_gw_errors.log',
     count => 0,
    },
  ],
);

Using File::Tail (2)

my @tails = map {File::Tail>new(name=>$_,
                                 maxinterval=>10,
                                 resetafter=>120,
                                 reset_tail=>0,
                                );
                } keys %patterns;
while (1) {
  my ($nfound, $timeleft, @pending)=
    File::Tail::select(undef, undef, undef, undef, @tails);
  foreach my $f (@pending) {
    my $line = $f>read();
    my $logfile = $f>{input};
    foreach my $p (@{$patterns{$logfile}}) {
      $p>{count}++
        if ($line =~ m/$p>{regexp}/);
    }
  }
}

$SIG{ALRM} = sub {
  my $time = strftime(quot;%Y%m%d_%H:%M:%Squot;, localtime());
  foreach my $p (map {@$_} values %patterns) { # expand list
    $p>{count} ||= 0; # be sure there is a num there

    open (STATFILE, quot;>$p>{outfile}.$$quot;)  # write to a temp file
      or die quot;$0: ERROR opening stat file $p>{outfile}: $!nquot;;
    print STATFILE quot;$p>{count}nquot;;
    close STATFILE;
    rename quot;$p>{outfile}.$$quot; => $p>{outfile}; # atomic update

    open (STATFILE, quot;>>$p>{history}quot;) # append to history
      or die quot;$0: ERROR opening history file $p>{history}: $!nquot;;
    print STATFILE quot;ts=$time,cnt=$p>{count}nquot;;
    close STATFILE;

    $p>{count} = 0; # reset counter
  }
  alarm 60; # reset alarm
};
alarm 60; # upon alarm, write statistic files and reset counters

'nmon'

'nmon' courtesy of IBM (unsupported)
●

Performance Data
●

CPU utilization, Memory use, Kernel statistics and run queue
–

Disks I/O rates, transfers, and read/write ratios, Filesystems size and free
–

space, Disk adapters, Paging space and paging rates, User defined disk
groups
Network I/O rates, transfers, and read/write ratios
–

Machine details, CPU and OS specification, Top processors
–

Dynamic LPAR changes AIX and Linux (on POWER hardware)
–

'nmon' screenshot
──nmonv10r───c=CPU──────────────Host=was1prdA───────Refresh=2 secs───11:36.53──
┌─CPUUtilisationSmallView────────────EntitledCPU=  0.40 UsedCPU=  0.384──────┐
│Logical                     0255075100│
│CPU User%  Sys% Wait% Idle% |           |            |           |            |│
│  0  13.3   7.9   0.0  78.8 |UUUUUUsss         >                              |│
│  1  17.7   0.5   0.5  81.3 |UUUUUUUU                 >                       |│
│  2  10.1   1.0   0.0  88.9 |UUUUU>                                           |│
│  3  19.4   1.9   0.5  78.2 |UUUUUUUUU>                                       |│
│Logical/Physical Averages   +|||+│
│Log  15.2   2.8   0.2  81.8 |UUUUUUUs                                         |│
│Phy  74.5  16.9   0.0   8.6 |UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssssssss    |│
│Entitlement Used= 96.0%     +|||+│
└──────────────────────────────────────────────────────────────────────────────┘
┌─MemoryUse─────────────────────Paging────────────────────────Stats───────────┐
│          Physical PagingSpace         pages/sec  In     Out   FileSystemCache │
│% Used       98.6%      28.1%   to Paging Space   1.5    0.0   (numperm)  26.0%│
│% Free        1.4%      71.9%   to File System    0.0    3.0   Process    40.1%│
│MB Used    2020.0MB   1725.8MB  Page Scans        0.0          System     32.5%│
│MB Free      28.0MB   4418.2MB  Page Cycles       0.0          Free        1.4%│
│Total(MB)  2048.0MB   6144.0MB  Page Steals       0.0                    │
│                                Page Faults     828.5          Total     100.0%│
│Min/Maxperm     386MB( 19%)  1543MB( 75%) note: % of memory                    │
│Min/Maxfree     960   1088      Total  Virtual    8.0GB        User       61.0%│
│Min/Maxpgahead    2      8    Accessed Virtual    2.7GB 34.1%  Pinned     32.5%│
└──────────────────────────────────────────────────────────────────────────────┘

'nmon' screenshot
──nmonv10r───e=diskESS─────────Host=db1prd3────────Refresh=2 secs───11:44.57──
┌─AdapterI/OStatistics───────────────────────────────────────────────────────┐
│Name          %busy     read    write        xfers Disks AdapterType          │
│fscsi0         18.5  19547.9     26.0 KB/s   662.0 8     FC SCSI I/O Controller│
│sisscsia0       1.0      4.0      0.0 KB/s     1.0 2     PCIX Ultra320 SCSI Ad│
│TOTALS  3 adapters   39631.7     56.0 KB/s  1336.0 18    TOTAL(MB/s)=38.8      │
└──────────────────────────────────────────────────────────────────────────────┘
┌─DiskBusyMap─Key(%): @=90 #=80 X=70 8=60 O=50 0=40 o=30 +=20 =10 .=5 _=0%───┐
│hdisks numbers>           1         2         3         4                     │
│                 01234567890123456789012345678901234567890123456789            │
│hdisk0 to 49     ____oX_.__.______                                            │
└──────────────────────────────────────────────────────────────────────────────┘
┌─TopProcesses──Procs=267mode=3[1=Basic2=CPU3=Perf4=Size5=I/O]───────────┐
│  PID     %CPU  Size   Res   Res   Res  Char RAM      Paging       Command     │
│          Used    KB   Set  Text  Data   I/O  Use io other repage              │
│  929812  30.6  8136  1076    16  1060  2047   0%  162 1242    0 db2sysc       │
│ 1528038  11.1  1200   268    16   252 32015K  0%  342  144    0 db2sysc       │
│ 1007782   2.5 12796  9688    16  9672     0   0%   19   27    0 db2sysc       │
│ 1503320   1.1  1200   232    16   216  2311K  0%   83   12    0 db2sysc       │
│  503930   0.8 28404 25524    16 25508     0   1%    0    0    0 db2sysc       │
│ 1577164   0.5  5480  5600   252  5348  1172   0%    0    0    0 nmon          │
│  729318   0.2  1936  2136   392  1744   525   0%    0    0    0 hats_nim      │
└──────────────────────────────────────────────────────────────────────────────┘

'nmon'

Output
●

On screen (console, telnet, VNC, putty or X Windows) using curses for low CPU
–

impact which is updated once every two seconds.
Save the data to a comma separated file for analysis and longer term data
–

capture.
nmon Analyser Excel 2000 spreadsheet loads the nmon output file and
–

automatically creates dozens of graphs.
Filter this data, add it to a rrd database
–

See
●

http://www.ibm.com/developerworks/aix/library/auanalyze_aix/index.html
–

http://www941.ibm.com/collaboration/wiki/display/WikiPtype/nmon
–

Wrapping 'nmon'

my $LogDir = quot;/var/log/nmonquot;;
my $FifoFile = quot;$LogDir/nmonstats.fifoquot;;
my $NmonCmd = quot;/usr/local/bin/nmon F $FifoFile pDET
I 0 s 60 c 1440quot;;

mkfifo($FifoFile, 0600);
open(PIPE, quot;$NmonCmd|quot;); # will print PID
open($readFH, $FifoFile)
or die quot;Cannot open $FifoFile: $!nquot;;
$nmonPid = readline(*PIPE); # nmon writes PID in
first line
close(PIPE);
chomp $nmonPid;
$nmonPid = $1 if ($nmonPid =~ m/(d+)/); # untaint

my $selectFH = new IO::Select ($readFH);

daemonize
sub go_background {
  use POSIX qw(setsid);
  my($pid, $sess_id, $i);

  ## Fork and exit parent
  if ($pid = fork) { exit 0; }

  ## Detach ourselves from the terminal
  die quot;Cannot detach from controlling terminal.nquot;
    unless setsid();

  ## Prevent possibility of acquiring a controling terminal
  $SIG{'HUP'} = 'IGNORE';

  ## Reopen stdin/out/err to /dev/null
  open(STDIN,  quot;+>/dev/nullquot;);
  open(STDOUT, quot;+>/dev/nullquot;);
  open(STDERR, quot;+>/dev/nullquot;);
}

Writing to logfiles

# set new handlers
$SIG{__WARN__} = sub { write_log($_[0]); };
$SIG{__DIE__}  = sub { write_log($_[0]); CORE::die(@_); };

# all warnings and errors go to the log file
sub write_log {
  my $msg = shift;
  use POSIX qw(strftime);
  my $ts = quot;[quot;.strftime('%Y%m%d_%H:%M:%S', localtime).quot;] quot;;

  $msg =~ s{^}{$ts}gm;  # insert at beginning of each line
  $msg .= quot;nquot; unless $msg =~ m/nZ/; # make sure it ends in a
newline

  open(my $log, quot;>>$LogFilequot;);
  print $log $msg;
  close $log;
}

Safely using 'ssh'

.ssh/authorized_keys
●

command=”/path/to/script”
●

nopty
●

noportforwarding
●

noX11forwarding
●

noagentforwarding
●

from=”10.1.1.1”
●

command=”/path/to/script”,nopty,noportforwarding
sshdss AAAA...== comment

Connecting 'ssh' to multiple sites
foreach my $host (@Hosts) {
  if (not exists $HostInfo{$host}{pid}
      and $now $HostLastActive{$host} > $ReconnectDelay) {
    $HostLastActive{$host} = time();
    my ($rfh, $wfh, $efh) = (gensym, gensym, gensym);
    my $pid;
    eval { # open new connection
      $pid = open3($wfh, $rfh, $efh,
                   qw(/usr/bin/ssh T o BatchMode=yes l user),
                   $host, quot;perlquot;);
    };
    if ($@) {
      warn quot;Cannot ssh to $host: $@nquot;;
      if ($pid) { # cleanup child
        kill TERM => $pid
          and waitpid $pid, 0;
      }
      next; # retry later
    }


  $SIG{PIPE} = sub { die quot;SIGPIPE: lost connection to $hostnquot; };

  eval { print $wfh script4host($host) }; # what's that? :)

  $SIG{PIPE} = 'IGNORE';
  if ($@) {
    warn quot;Cannot push script to $host: $@nquot;;
      kill TERM => $pid
    }
    next; # retry later
  }


  # add it to select list
  $selectFHs>add([$rfh, $host, 0]);
  $selectFHs>add([$efh, $host, 1]);
  $HostInfo{$host}{stdin} = $wfh;
  $HostInfo{$host}{stdout} = $rfh;
  $HostInfo{$host}{stderr} = $efh;
  $HostInfo{$host}{pid} = $pid;
  warn quot;Connected to $host.nquot;;
  $SIG{PIPE} = 'DEFAULT';
}

Pushing perl code to remote sites

From within a perl script
●

Start a perl interpreter via 'ssh'
●

Feed it the script via stdin
●

Finish with __END__
●

Script will start to run
●

Script can use stdin/out to communicate
●

Pushing modules
use File ::Tail; # just to get it into @INC

my $script;
foreach my $m (qw(File/Tail.pm)) { # loop over needed modules
  my $fn;
  foreach my $prefix (@INC) { # find them in @INC
    next if (ref($fn)); # skip coderefs
    $fn = quot;$prefix/$mquot;;
    last if (f $fn); # found it?
    undef $fn;
  }
  die quot;Cannot find $mnquot; if not $fn;

Pushing modules (2)

  # read module file and append it to $script
  open(my $fh, $fn)
    or die quot;Cannot open $fn: $!nquot;;
  $script .= quot;BEGIN { eval <<'__END_OF_MODULE__'nquot;;
  local $/ = undef; # slurp mode
  $script .= <$fh>;
  $script .= quot;n__END_OF_MODULE__n}nquot;;
  close $fh;
}
# now add monitoring script
$script .= <<'__EOT__'
...
__EOT__

print PIPE_TO_REMOTE_PERL $script;

# for binary modules: PAR, the Perl Archive toolkit.
# allows to package modules into one file that
# can be eval'd by perl.

RRDtool

RRD = RoundRobin Database
●

(think ringbuffer)
●

RRDtool by Tobias Oetiker
●

http://www.rrdtool.org/
Data logging and graphing system.
●

Used by BigSister, Cacti, NetMRG, webminstats
●

Perl bindings available:
●

use RRDs;

Storing data into RRDs

use RRDs;

my @update = (quot;$RRDPathPrefix/$host.rrdquot;,
              quot;templatequot; => join(quot;:quot;, @Datatags),
              join(quot;:quot;, @data{@Datatags}),
             );
RRDs::update(@update);
my $res = RRDs::error;
warn $res if $res;

Creating graphs with RRDtool
# files are named /rrdpath/hostname.rrd
rrdtool graph drawing.png u 150 rigid v “requests/min”
  t quot;Requests $hostquot; start 86400 –end now
  height 600 width 1000
  DEF:html=${host}.rrd:html:MAX
  DEF:xml=${host}.rrd:xml:MAX
  DEF:was=${host}.rrd:was:MAX
  DEF:soap=${host}.rrd:soap:MAX
  AREA:was#9F9F9F:quot;WAS Requestsquot;
  LINE2:soap#F8B932:quot;SOAP Requestsquot;
  LINE2:html#0000FF:quot;HTML Requestsquot;
  LINE2:xml#00FF00:quot;XML Requestsquot;
  STACK:html#FFFF00:quot;XML+HTML Requestsquot;
  STACK:soap#000000:quot;SOAP+XML+HTML Requestsquot;


Result:
●

Creating graphs with 'drraw'

A better, pointnclick way: use 'drraw'
●

http://web.taranis.org/drraw/
●

Installs as a CGIapp
●

Define graphs onthefly (somewhat WYSIWYG)
●

Turn graphs into templates that can be applied
●

to selected .rrd files
Combine multiple templates into dashboards,
●

letting you view all your data at once.

Have fun!

Support OpenSource.
●

Be nice to your fellow people.
●

Pay it forward:
●

For each favor you receive,
do three strangers a favor.
http://www.payitforwardmovement.org/

And have a great day!
●

What to monitor

Applications / Daemons
●

that provide Services
●

on one or more Ports
●

Running on one or more Machines.
●

Some basic definitions:
we are interested in monitoring applications, that is, programs that
provide services on one or more ports, running on one or more
machines. But of course these principles apply for just any program or
daemon.

Remote vs. Local Checks

Remote checks
●

Portavailability
–

Network
–

Dummy requests / Status servlet
–

Local checks
●

Required processes running?
–

Networks available?
–

Error messages in Logfiles?
–

Application throughput?
–

System parameters OK?
–

Check the application logfiles if there are some unusual messages in
there. Check the logfiles for throughput. This is often the best and
easiest check: let your users be your testers.
Count the usergenerated requests and how many of them were OK. Then
calculate a request rate and plot it. (I will show you later how to do
that) Such a graph can be a tremendous help. Imagine getting a call
from the helpdesk, that a user has reported that you application isn't
working. (don't we all love such specific reports...) If you now have
such a graph, you have to take only a glance at it to be able to tell the
helpdesk: quot;Well, in the past hour, 3000 requests were processed and
only 5 showed an error, so from my point, everything is running fine.
Walk him through his configuration again...quot;

When to analyze

Instant notification
●

Daily Operations
–

Boundaries/Limits checking
–

Historical data
●

Problem determination and analysis
–

Statistics / Graphs
–

For daily operations, you want to be notified as soon as possible if any
problems turn up, if parameters are overstepping their boundaries.
For problem analysis you want all that data to be available at a later point
in time in a way that is wellsuited for searching.

How to collect

Dedicated monitoring host
●

collects data from application machines
–

global data loss in case of network/monitoring failure
–

Store data locally on each host
●

and backup/archive that data!
–

Beware: monitoring host is a single point of failure.
If there is a network problem or the monitoring software crashes or the
monitoring host has a hardwarefailure, ALL data is gone or doesn't get
collected.

Why History?

For problem determination and analysis
●

Statistics / Graphs
●

Show bottlenecks
–

Prove upgrade needs
–

Managers are loving it
–

Store the data locally on the machines.
And archive those logs.
Use backup/archiving mechanisms in place for your
application and webserver logs.

Using File::Tail

%patterns = (
  '/var/logs/httpd/access_log' => [
    {
     outfile => '/var/logs/monitoring/access_ok.last',
     history => '/var/logs/monitoring/access_ok.log',
     count => 0,
    },
    {
     outfile => '/var/logs/monitoring/access_gw_errors.last',
     history => '/var/logs/monitoring/access_gw_errors.log',
     count => 0,
    },
  ],
);

For each logfile, count several matching regexps and store
the counts into two files, outfile and history.


my @tails = map {File::Tail>new(name=>$_,
                                 maxinterval=>10,
                                 resetafter=>120,
                                 reset_tail=>0,
                                );
                } keys %patterns;
while (1) {
  my ($nfound, $timeleft, @pending)=
    File::Tail::select(undef, undef, undef, undef, @tails);
  foreach my $f (@pending) {
    my $line = $f>read();
    my $logfile = $f>{input};
    foreach my $p (@{$patterns{$logfile}}) {
      $p>{count}++
        if ($line =~ m/$p>{regexp}/);
    }
  }
}

In the script first generate all those File::Tail objects, each representing
one logfile.
Wait for new lines to arrive via File::Tails select().
Process list of objects that have new lines.
For each line, run all the regexps for that logfile against it and increment
the counter when matching.
Periodically save the counts to files and reset the counters.

$SIG{ALRM} = sub {
  my $time = strftime(quot;%Y%m%d_%H:%M:%Squot;, localtime());
  foreach my $p (map {@$_} values %patterns) { # expand list
    $p>{count} ||= 0; # be sure there is a num there

    open (STATFILE, quot;>$p>{outfile}.$$quot;)  # write to a temp file
      or die quot;$0: ERROR opening stat file $p>{outfile}: $!nquot;;
    print STATFILE quot;$p>{count}nquot;;
    close STATFILE;
    rename quot;$p>{outfile}.$$quot; => $p>{outfile}; # atomic update

    open (STATFILE, quot;>>$p>{history}quot;) # append to history
      or die quot;$0: ERROR opening history file $p>{history}: $!nquot;;
    print STATFILE quot;ts=$time,cnt=$p>{count}nquot;;
    close STATFILE;

    $p>{count} = 0; # reset counter
  }
  alarm 60; # reset alarm
};
alarm 60; # upon alarm, write statistic files and reset counters

Use alarm() handler.
Maybe keep track of the time by specifying a timeout to select() and
periodically check the elapsed time.
Go through all the regexps and counters and save the values to files.
One history file with time information in it that can be used for plotting
lateron and one file with just the value which can be used for the
instant checks by NAGIOS.

'nmon'

'nmon' courtesy of IBM (unsupported)
●

Performance Data
●

CPU utilization, Memory use, Kernel statistics and run queue
–

Disks I/O rates, transfers, and read/write ratios, Filesystems size and free
–

space, Disk adapters, Paging space and paging rates, User defined disk
groups
Network I/O rates, transfers, and read/write ratios
–

Machine details, CPU and OS specification, Top processors
–

Dynamic LPAR changes AIX and Linux (on POWER hardware)
–

'nmon' instead of standard commandline tools like 'ps', 'iostat' etc.
Linux or AIX only.
Similar to 'top' or 'topas', only on steroids.

'nmon' screenshot
──nmonv10r───e=diskESS─────────Host=db1prd3────────Refresh=2 secs───11:44.57──
┌─AdapterI/OStatistics───────────────────────────────────────────────────────┐
│Name          %busy     read    write        xfers Disks AdapterType          │
│sisscsia0       1.0      4.0      0.0 KB/s     1.0 2     PCIX Ultra320 SCSI Ad│
│TOTALS  3 adapters   39631.7     56.0 KB/s  1336.0 18    TOTAL(MB/s)=38.8      │
└──────────────────────────────────────────────────────────────────────────────┘
┌─DiskBusyMap─Key(%): @=90 #=80 X=70 8=60 O=50 0=40 o=30 +=20 =10 .=5 _=0%───┐
│hdisks numbers>           1         2         3         4                     │
│                 01234567890123456789012345678901234567890123456789            │
│hdisk0 to 49     ____oX_.__.______                                            │
└──────────────────────────────────────────────────────────────────────────────┘
┌─TopProcesses──Procs=267mode=3[1=Basic2=CPU3=Perf4=Size5=I/O]───────────┐
│  PID     %CPU  Size   Res   Res   Res  Char RAM      Paging       Command     │
│          Used    KB   Set  Text  Data   I/O  Use io other repage              │
│  929812  30.6  8136  1076    16  1060  2047   0%  162 1242    0 db2sysc       │
│ 1528038  11.1  1200   268    16   252 32015K  0%  342  144    0 db2sysc       │
│ 1007782   2.5 12796  9688    16  9672     0   0%   19   27    0 db2sysc       │
│ 1503320   1.1  1200   232    16   216  2311K  0%   83   12    0 db2sysc       │
│  503930   0.8 28404 25524    16 25508     0   1%    0    0    0 db2sysc       │
│ 1577164   0.5  5480  5600   252  5348  1172   0%    0    0    0 nmon          │
│  729318   0.2  1936  2136   392  1744   525   0%    0    0    0 hats_nim      │
└──────────────────────────────────────────────────────────────────────────────┘

The DiskBusyMap can be really helpful.

[That can help finding out why the database doesn't perform well under
high load. Might be that even though tablespaces are distributed across
a lot of disks, most of the time it is only one disk that gets used and thus
is the bottleneck.]

'nmon'

Output
●

On screen (console, telnet, VNC, putty or X Windows) using curses for low CPU
–

impact which is updated once every two seconds.
Save the data to a comma separated file for analysis and longer term data
–

capture.
nmon Analyser Excel 2000 spreadsheet loads the nmon output file and
–

automatically creates dozens of graphs.
Filter this data, add it to a rrd database
–

See
●

http://www.ibm.com/developerworks/aix/library/auanalyze_aix/index.html
–

http://www941.ibm.com/collaboration/wiki/display/WikiPtype/nmon
–

'nmon' can collect all system information over a longer period and write it
to a file.
Unfortunately, the output format is more geared to drop that data into a
spreadsheet and analyze it there.
That's what Perl is for!

Wrapping 'nmon'

my $LogDir = quot;/var/log/nmonquot;;
my $FifoFile = quot;$LogDir/nmonstats.fifoquot;;
my $NmonCmd = quot;/usr/local/bin/nmon F $FifoFile pDET
I 0 s 60 c 1440quot;;

mkfifo($FifoFile, 0600);
open(PIPE, quot;$NmonCmd|quot;); # will print PID
open($readFH, $FifoFile)
or die quot;Cannot open $FifoFile: $!nquot;;
$nmonPid = readline(*PIPE); # nmon writes PID in
first line
close(PIPE);
chomp $nmonPid;
$nmonPid = $1 if ($nmonPid =~ m/(d+)/); # untaint

my $selectFH = new IO::Select ($readFH);

Wrapper around 'nmon' that makes a useful data gathering
daemon out of it.
Use a named pipe aka fifo as an output file for 'nmon', so
output can be processed as it happens. Also useful for
other programs.

daemonize

sub go_background {
  use POSIX qw(setsid);
  my($pid, $sess_id, $i);

  ## Fork and exit parent

  ## Detach ourselves from the terminal
  die quot;Cannot detach from controlling terminal.nquot;
    unless setsid();

  ## Prevent possibility of acquiring a controling terminal
  $SIG{'HUP'} = 'IGNORE';

  ## Reopen stdin/out/err to /dev/null
  open(STDIN,  quot;+>/dev/nullquot;);
  open(STDOUT, quot;+>/dev/nullquot;);
  open(STDERR, quot;+>/dev/nullquot;);
}

Run the script as a daemon in the background.
Standard doublefork, exiting in between, end up having
init as parent.
Detach from the controlling terminal with setsid().
Redirect stdin/out/err to /dev/null.
But now all diagnostics must go to a logfile.

Writing to logfiles

# set new handlers
$SIG{__WARN__} = sub { write_log($_[0]); };
$SIG{__DIE__}  = sub { write_log($_[0]); CORE::die(@_); };

# all warnings and errors go to the log file
sub write_log {
  my $msg = shift;
  use POSIX qw(strftime);
  my $ts = quot;[quot;.strftime('%Y%m%d_%H:%M:%S', localtime).quot;] quot;;

  $msg =~ s{^}{$ts}gm;  # insert at beginning of each line
  $msg .= quot;nquot; unless $msg =~ m/nZ/; # make sure it ends in a
newline

  open(my $log, quot;>>$LogFilequot;);
  print $log $msg;
  close $log;
}

Install handlers for warn() and die().
Open and close the logfile for each logmessage.
This makes logfilerotation really easy (just rename).
But shouldn't be used for highvolume logging.

Safely using 'ssh'

.ssh/authorized_keys
●

command=”/path/to/script”
●

nopty
●

noportforwarding
●

noX11forwarding
●

noagentforwarding
●

from=”10.1.1.1”
●

command=”/path/to/script”,nopty,noportforwarding
sshdss AAAA...== comment

'authorized_keys' file that allows connections from remote sites based on
publickeys. Generate a keypair on the monitoring machine and paste
the public key into 'authorized_keys' on the monitored machine. Then
'ssh' and 'scp' can be used without having to specify the password for
that user.

Also specify which program or script to run if a certain key is used to
login. This allows to have a remote user execute certain critical,
privileged commands without giving him full shellaccess.


foreach my $host (@Hosts) {
  if (not exists $HostInfo{$host}{pid}
      and $now $HostLastActive{$host} > $ReconnectDelay) {
    $HostLastActive{$host} = time();
    my ($rfh, $wfh, $efh) = (gensym, gensym, gensym);
    my $pid;
    eval { # open new connection
      $pid = open3($wfh, $rfh, $efh,
                   qw(/usr/bin/ssh T o BatchMode=yes l user),
                   $host, quot;perlquot;);
    };
    if ($@) {
      warn quot;Cannot ssh to $host: $@nquot;;
        kill TERM => $pid
      }
      next; # retry later
    }

Kick off several ssh connections at once, using open3().
Spawn a perl interpreter at the remote side.
Catch errors by handling EOF and by timeout, as we know that the
logfiles that is tailed on are supposed to grow every minute or so.
Cannot use the Net::Ssh* modules because we cannot select() on them (no
fileno()).



  # add it to select list
  $selectFHs>add([$rfh, $host, 0]);
  $selectFHs>add([$efh, $host, 1]);
  $HostInfo{$host}{stdin} = $wfh;
  $HostInfo{$host}{stdout} = $rfh;
  $HostInfo{$host}{stderr} = $efh;
  $HostInfo{$host}{pid} = $pid;
  warn quot;Connected to $host.nquot;;
  $SIG{PIPE} = 'DEFAULT';
}

Pushing perl code to remote sites

From within a perl script
●

Start a perl interpreter via 'ssh'
●

Feed it the script via stdin
●

Finish with __END__
●

Script will start to run
●

Script can use stdin/out to communicate
●

Don't install the perl scripts on the remote machines, simply push down
the perl code to the interpreter via stdin.
Use File::Tail instead of 'tail f'
Push down checking code for use with NAGIOS? Performance penalty.
Maybe use persistent collecting daemon?

Pushing modules
use File ::Tail; # just to get it into @INC

my $script;
foreach my $m (qw(File/Tail.pm)) { # loop over needed modules
  my $fn;
  foreach my $prefix (@INC) { # find them in @INC
    next if (ref($fn)); # skip coderefs
    $fn = quot;$prefix/$mquot;;
    last if (f $fn); # found it?
    undef $fn;
  }
  die quot;Cannot find $mnquot; if not $fn;

Push modules too!
@INC contains the filepath to the loaded modules, so search it for the
source file.

Pushing modules (2)

  # read module file and append it to $script
  open(my $fh, $fn)
    or die quot;Cannot open $fn: $!nquot;;
  $script .= quot;BEGIN { eval <<'__END_OF_MODULE__'nquot;;
  local $/ = undef; # slurp mode
  $script .= <$fh>;
  $script .= quot;n__END_OF_MODULE__n}nquot;;
  close $fh;
}
# now add monitoring script
$script .= <<'__EOT__'
...
__EOT__

print PIPE_TO_REMOTE_PERL $script;

# for binary modules: PAR, the Perl Archive toolkit.
# allows to package modules into one file that
# can be eval'd by perl.

Slurp in the module source and wrap it with an eval so it gets its own
namespace.
Push it down along with check script.

Only works with pureperl modules. For modules with binary libraries
one could theoretically use PAR, the perl archive toolkit, which allows
you to put all modules into one file that can be evaluated.

Need to pack a perl application into one executable?
Remember PAR and pp, the Perl Packager.

RRDtool

RRD = RoundRobin Database
●

(think ringbuffer)
●

RRDtool by Tobias Oetiker
●

http://www.rrdtool.org/
Data logging and graphing system.
●

Used by BigSister, Cacti, NetMRG, webminstats
●

Perl bindings available:
●

use RRDs;

Store collected data in RRDs created by RRDtool.

RRD stands for RoundRobin Database, meaning that you simply drop
your data points in there one at a time and the database keeps a certain
timespan (for example the last 3 months) in its ringbuffer.

RRDtool can create graphs from the data in several ways and has a
powerful datamanipulation machine built in.

RRDtool only plots timelike data and cannot be used to generate
arbitrary graphs like piecharts or xyscatter plots.

Storing data into RRDs

use RRDs;

my @update = (quot;$RRDPathPrefix/$host.rrdquot;,
              quot;templatequot; => join(quot;:quot;, @Datatags),
              join(quot;:quot;, @data{@Datatags}),
             );
RRDs::update(@update);
my $res = RRDs::error;
warn $res if $res;

Easy to store data into an RRdatabase.
template tells RRDtool the names and order of the data points (similar to
the header line of a CSV)
Just store the data points in that order.


# files are named /rrdpath/hostname.rrd
rrdtool graph drawing.png u 150 rigid v “requests/min”
  t quot;Requests $hostquot; start 86400 –end now
  height 600 width 1000
  DEF:html=${host}.rrd:html:MAX
  DEF:xml=${host}.rrd:xml:MAX
  DEF:was=${host}.rrd:was:MAX
  DEF:soap=${host}.rrd:soap:MAX
  AREA:was#9F9F9F:quot;WAS Requestsquot;
  LINE2:soap#F8B932:quot;SOAP Requestsquot;
  LINE2:html#0000FF:quot;HTML Requestsquot;
  LINE2:xml#00FF00:quot;XML Requestsquot;
  STACK:html#FFFF00:quot;XML+HTML Requestsquot;
  STACK:soap#000000:quot;SOAP+XML+HTML Requestsquot;

To create graphs from an RRDatabase use the commandline rrdtool with
cron jobs.
Use the perl bindings for CGI scripts.

Specify the filename and the size of the graph, the time slot to plot and
which and how the data points are to be drawn.

Creating graphs with 'drraw'

A better, pointnclick way: use 'drraw'
●

http://web.taranis.org/drraw/
●

Installs as a CGIapp
●

Define graphs onthefly (somewhat WYSIWYG)
●

Turn graphs into templates that can be applied
●

to selected .rrd files
Combine multiple templates into dashboards,
●

letting you view all your data at once.

'drraw' has a more pointnclick or WYSIWYG approach. It's a CGI
application that needs only minimal configuration. Just tell it where to
find your .rrd files.

Just select which .rrd files to use and 'drraw' will show all data channels
that are in the RRD. Assign colors, disable the unwanted data channels,
set graph dimentions etc.

Store a finished graph as a template and use it with other RRDs that have
the same structure.

Combine templates to form a dashboard that shows all available data at
once.

'drraw' dashboard

Click to add an outline
●

Have fun!

Support OpenSource.
●

Be nice to your fellow people.
●

Pay it forward:
●

For each favor you receive,
do three strangers a favor.
http://www.payitforwardmovement.org/

And have a great day!
●

Useful tools: NAGIOS, nmon, ssh, sudo, RRDtool and drraw.
Perl modules: File ::Tail, Text::CSV_XS and PAR.

YAPC2007 Remote System Monitoring (w. Notes)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie YAPC2007 Remote System Monitoring (w. Notes)

Ähnlich wie YAPC2007 Remote System Monitoring (w. Notes) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

YAPC2007 Remote System Monitoring (w. Notes)