[Slony1-general] recommend monitoring script to use with Nagios?

Tue Jan 23 15:28:04 PST 2007

Mark Stosberg <mark at summersault.com> writes:
> I'm wondering what people are using to monitor Slony with Nagios. The
> docs reference:
>
> psql_replication_check.pl
>
> ... but that uses the very ancient "Pg.pm" and not DBI/DBD::Pg.
>
> I also see "test_slony_state-dbi.pl", which uses DBI, but is not
> referenced as a Nagios plugin.
>
> I'm also curious if anyone is using Nagios' event handling as part of
> failover-- using it to notice that a slave is down, and then make a
> pro-active change to handle failover.
>
> I'm evaluating various options to handle failover. DBIx::HA seems like
> area reasonable solution for Perl, but I'm still exploring other options
> at this point.

What we're using actively, in practice, is twofold:

1.  We have a Nagios check that looks at a view that we call
"replication_status."

And that is indeed based on psql_replication_check.pl...

The view has three fields:

some-tld at state:5432=# select * from replication_status ;
 object_name |       created_on       |  age  
-------------+------------------------+-------
 mydomain.tld   | 2006-12-31 23:59:52+00 | 33075
(1 row)

[I ran it against a copy of a database as of 12/31, so the check found
that replication was behind by rather a lot!]

Behind the scenes, the view selects the most recently updated object
on a transaction table, which is a table we expect to see updated very
frequently.

Having the 3 pieces of information is kind of nice:
 - The age is a good basis for raising an alarm;
 - The "created_on" column may well identify when something broke;
 - The "object_name" gives the folks doing monitoring some information
   as to where things have stopped.  

  If people watching Nagios see that the object keeps changing, then
  they know replication hasn't actually ceased to work, which means
  they might not page me at 3am after getting hit by some load that is
  causing a node that's replicating across a somewhat slow WAN
  connection to fall behind a bit :-).

On your system, this would have to point to some sort of "object of
interest" that you expect to see frequently updated.

Yeah, it's using Pg.  It hasn't stopped working :-).

2.  We've got an MRTG data collector that looks at the sl_status view,
collecting data on how far each node is behind.

In the long run, that's likely to be more useful, particularly since
you can do trend analysis and such like...
-- 
(reverse (concatenate 'string "ofni.sailifa.ac" "@" "enworbbc"))
<http://cbbrowne.com/info/monitoring.html>
Christopher Browne
(416) 673-4124 (land)