6. Slony-I Maintenance

Slony-I actually does a lot of its necessary maintenance itself, in a "cleanup" thread:

6.1. Interaction with PostgreSQL autovacuum

Recent versions of PostgreSQL support an "autovacuum" process which notices when tables are modified, thereby creating dead tuples, and vacuums those tables, "on demand." It has been observed that this can interact somewhat negatively with Slony-I's own vacuuming policies on its own tables.

Slony-I requests vacuums on its tables immediately after completing transactions that are expected to clean out old data, which may be expected to be the ideal time to do so. It appears as though autovacuum may notice the changes a bit earlier, and attempts vacuuming when transactions are not complete, rendering the work pretty useless. It seems preferable to configure autovacuum to avoid vacuum Slony-I-managed configuration tables.

The following query (change the cluster name to match your local configuration) will identify the tables that autovacuum should be configured not to process:

mycluster=# select oid, relname from pg_class where relnamespace = (select oid from pg_namespace where nspname = '_' || 'MyCluster') and relhasindex;
  oid  |   relname    
-------+--------------
 17946 | sl_nodelock
 17963 | sl_setsync
 17994 | sl_trigger
 17980 | sl_table
 18003 | sl_sequence
 17937 | sl_node
 18034 | sl_listen
 18017 | sl_path
 18048 | sl_subscribe
 17951 | sl_set
 18062 | sl_event
 18069 | sl_confirm
 18074 | sl_seqlog
 18078 | sl_log_1
 18085 | sl_log_2
(15 rows)

The following query will populate pg_catalog.pg_autovacuum with suitable configuration information: INSERT INTO pg_catalog.pg_autovacuum (vacrelid, enabled, vac_base_thresh, vac_scale_factor, anl_base_thresh, anl_scale_factor, vac_cost_delay, vac_cost_limit, freeze_min_age, freeze_max_age) SELECT oid, 'f', -1, -1, -1, -1, -1, -1, -1, -1 FROM pg_catalog.pg_class WHERE relnamespace = (SELECT OID FROM pg_namespace WHERE nspname = '_' || 'MyCluster') AND relhasindex;

6.2. Watchdogs: Keeping Slons Running

There are a couple of "watchdog" scripts available that monitor things, and restart the slon processes should they happen to die for some reason, such as a network "glitch" that causes loss of connectivity.

You might want to run them...

The "best new way" of managing slon processes is via the combination of Section 21.2, which creates a configuration file for each node in a cluster, and Section 21.4, which uses those configuration files.

This approach is preferable to elder "watchdog" systems in that you can very precisely "nail down," in each config file, the exact desired configuration for each node, and not need to be concerned with what options the watchdog script may or may not give you. This is particularly important if you are using log shipping , where forgetting the -a option could ruin your log shipped node, and thereby your whole day.

6.3. Parallel to Watchdog: generate_syncs.sh

A new script for Slony-I 1.1 is generate_syncs.sh, which addresses the following kind of situation.

Supposing you have some possibly-flakey server where the slon daemon that might not run all the time, you might return from a weekend away only to discover the following situation.

On Friday night, something went "bump" and while the database came back up, none of the slon daemons survived. Your online application then saw nearly three days worth of reasonably heavy transaction load.

When you restart slon on Monday, it hasn't done a SYNC on the master since Friday, so that the next "SYNC set" comprises all of the updates between Friday and Monday. Yuck.

If you run generate_syncs.sh as a cron job every 20 minutes, it will force in a periodic SYNC on the origin, which means that between Friday and Monday, the numerous updates are split into more than 100 syncs, which can be applied incrementally, making the cleanup a lot less unpleasant.

Note that if SYNCs are running regularly, this script won't bother doing anything.

6.4. Testing Slony-I State

In the tools directory, you will find Section 5.1 scripts called test_slony_state.pl and test_slony_state-dbi.pl. One uses the Perl/DBI interface; the other uses the Pg interface.

Both do essentially the same thing, namely to connect to a Slony-I node (you can pick any one), and from that, determine all the nodes in the cluster. They then run a series of queries (read only, so this should be quite safe to run) which examine various Slony-I tables, looking for a variety of sorts of conditions suggestive of problems, including:

Running this once an hour or once a day can help you detect symptoms of problems early, before they lead to performance degradation.

6.5. Replication Test Scripts

In the directory tools may be found four scripts that may be used to do monitoring of Slony-I instances:

6.6. Other Replication Tests

The methodology of the previous section is designed with a view to minimizing the cost of submitting replication test queries; on a busy cluster, supporting hundreds of users, the cost associated with running a few queries is likely to be pretty irrelevant, and the setup cost to configure the tables and data injectors is pretty high.

Three other methods for analyzing the state of replication have stood out:

6.7. Log Files

slon daemons generate some more-or-less verbose log files, depending on what debugging level is turned on. You might assortedly wish to:

6.8. mkservice

6.8.1. slon-mkservice.sh

Create a slon service directory for use with svscan from daemontools. This uses multilog in a pretty basic way, which seems to be standard for daemontools / multilog setups. If you want clever logging, see logrep below. Currently this script has very limited error handling capabilities.

For non-interactive use, set the following environment variables. BASEDIR SYSUSR PASSFILE DBUSER HOST PORT DATABASE CLUSTER SLON_BINARY If any of the above are not set, the script asks for configuration information interactively.

  • BASEDIR where you want the service directory structure for the slon to be created. This should not be the /var/service directory.

  • SYSUSR the unix user under which the slon (and multilog) process should run.

  • PASSFILE location of the .pgpass file to be used. (default ~sysusr/.pgpass)

  • DBUSER the postgres user the slon should connect as (default slony)

  • HOST what database server to connect to (default localhost)

  • PORT what port to connect to (default 5432)

  • DATABASE which database to connect to (default dbuser)

  • CLUSTER the name of your Slony1 cluster? (default database)

  • SLON_BINARY the full path name of the slon binary (default which slon)

6.8.2. logrep-mkservice.sh

This uses tail -F to pull data from log files allowing you to use multilog filters (by setting the CRITERIA) to create special purpose log files. The goal is to provide a way to monitor log files in near realtime for "interesting" data without either hacking up the initial log file or wasting CPU/IO by re-scanning the same log repeatedly.

For non-interactive use, set the following environment variables. BASEDIR SYSUSR SOURCE EXTENSION CRITERIA If any of the above are not set, the script asks for configuration information interactively.

  • BASEDIR where you want the service directory structure for the logrep to be created. This should not be the /var/service directory.

  • SYSUSR unix user under which the service should run.

  • SOURCE name of the service with the log you want to follow.

  • EXTENSION a tag to differentiate this logrep from others using the same source.

  • CRITERIA the multilog filter you want to use.

A trivial example of this would be to provide a log file of all slon ERROR messages which could be used to trigger a nagios alarm. EXTENSION='ERRORS' CRITERIA="'-*' '+* * ERROR*'" (Reset the monitor by rotating the log using svc -a $svc_dir)

A more interesting application is a subscription progress log. EXTENSION='COPY' CRITERIA="'-*' '+* * ERROR*' '+* * WARN*' '+* * CONFIG enableSubscription*' '+* * DEBUG2 remoteWorkerThread_* prepare to copy table*' '+* * DEBUG2 remoteWorkerThread_* all tables for set * found on subscriber*' '+* * DEBUG2 remoteWorkerThread_* copy*' '+* * DEBUG2 remoteWorkerThread_* Begin COPY of table*' '+* * DEBUG2 remoteWorkerThread_* * bytes copied for table*' '+* * DEBUG2 remoteWorkerThread_* * seconds to*' '+* * DEBUG2 remoteWorkerThread_* set last_value of sequence*' '+* * DEBUG2 remoteWorkerThread_* copy_set*'"

If you have a subscription log then it's easy to determine if a given slon is in the process of handling copies or other subscription activity. If the log isn't empty, and doesn't end with a "CONFIG enableSubscription: sub_set:1" (or whatever set number you've subscribed) then the slon is currently in the middle of initial copies.

If you happen to be monitoring the mtime of your primary slony logs to determine if your slon has gone brain-dead, checking this is a good way to avoid mistakenly clobbering it in the middle of a subscribe. As a bonus, recall that since the the slons are running under svscan, you only need to kill it (via the svc interface) and let svscan start it up again laster. I've also found the COPY logs handy for following subscribe activity interactively.