In the tools directory, you will find Section 5.1.1 scripts called test_slony_state.pl and test_slony_state-dbi.pl. One uses the Perl/DBI interface; the other uses the Pg interface.
Both do essentially the same thing, namely to connect to a Slony-I node (you can pick any one), and from that, determine all the nodes in the cluster. They then run a series of queries (read only, so this should be quite safe to run) which examine various Slony-I tables, looking for a variety of sorts of conditions suggestive of problems, including:
Bloating of tables like pg_listener, sl_log_1, sl_log_2, sl_seqlog
Analysis of Event propagation
Analysis of Event confirmation propagation
If communications is a little broken, replication may happen, but confirmations may not get back, which prevents nodes from clearing out old events and old replication data.
Running this once an hour or once a day can help you detect symptoms of problems early, before they lead to performance degradation.
In the directory tools may be found four scripts that may be used to do monitoring of Slony-I instances:
test_slony_replication is a Perl script to which you can pass connection information to get to a Slony-I node. It then queries sl_path and other information on that node in order to determine the shape of the requested replication set.
It then injects some test queries to a test table called slony_test which is defined as follows, and which needs to be added to the set of tables being replicated:
CREATE TABLE slony_test ( description text, mod_date timestamp with time zone, "_Slony-I_testcluster_rowID" bigint DEFAULT nextval('"_testcluster".sl_rowid_seq'::text) NOT NULL );
The last column in that table was defined by Slony-I as one lacking a primary key...
This script generates a line of output for each Slony-I node that is active for the requested replication set in a file called cluster.fact.log.
There is an additional finalquery option that allows you to pass in an application-specific SQL query that can determine something about the state of your application.
log.pm is a Perl module that manages logging for the Perl scripts.
run_rep_tests.sh is a "wrapper" script that runs test_slony_replication.
If you have several Slony-I clusters, you might set up configuration in this file to connect to all those clusters.
nagios_slony_test is a script that was constructed to query the log files so that you might run the replication tests every so often (we run them every 6 minutes), and then a system monitoring tool such as Nagios can be set up to use this script to query the state indicated in those logs.
It seemed rather more efficient to have a cron job run the tests and have Nagios check the results rather than having Nagios run the tests directly. The tests can exercise the whole Slony-I cluster at once rather than Nagios invoking updates over and over again.
The methodology of the previous section is designed with a view to minimizing the cost of submitting replication test queries; on a busy cluster, supporting hundreds of users, the cost associated with running a few queries is likely to be pretty irrelevant, and the setup cost to configure the tables and data injectors is pretty high.
Three other methods for analyzing the state of replication have stood out:
For an application-oriented test, it has been useful to set up a view on some frequently updated table that pulls application-specific information.
For instance, one might look either at some statistics about a most recently created application object, or an application transaction. For instance:
create view replication_test as select now() - txn_time as age, object_name from transaction_table order by txn_time desc limit 1;
create view replication_test as select now() - created_on as age, object_name from object_table order by id desc limit 1;
There is a downside: This approach requires that you have regular activity going through the system that will lead to there being new transactions on a regular basis. If something breaks down with your application, you may start getting spurious warnings about replication being behind, despite the fact that replication is working fine.
The Slony-I-defined view, sl_status provides information as to how up to date different nodes are. Its contents are only really interesting on origin nodes, as the events generated on other nodes are generally ignorable.
See also the Section 5.1.3 discussion.