[Slony1-general] Odd slony problem

Sat Mar 12 18:05:33 PST 2005

Tass Chapman wrote:

>
> We are seeing the following issue occur sporadically on out master and 
> forwarders, sometimes a few times within 48 hours , sometimes not for 
> a few weeks. It stops the SLON daemon though, so we have to restart it 
> to get our replication working again.
>
> DEBUG1 cleanupThread:    0.007 seconds for delete logs
> FATAL  cleanupThread: "vacuum analyze "_user_master".sl_event; vacuum
> analyze "_user_master".sl_confirm; vacuum analyze 
> "_user_master".sl_setsync;
> vacuum analyze "_user_master".sl_log_1; vacuum
> analyze "_user_master".sl_log_2;vacuum analyze "_use
> r_master".sl_seqlog;vacuum analyze pg_catalog.pg_listener;" - ERROR:
> duplicate key violates unique constraint "pg_statisti c_relid_att_index"
> DEBUG1 syncThread: thread done
> DEBUG1 main: scheduler mainloop returned
> INFO   remoteListenThread_2: disconnecting from 'dbname=master
> host={HOST_NAME} port=5432 user={USER} password={PASSWORD}'
> DEBUG1 remoteListenThread_2: thread done
> DEBUG1 localListenThread: thread done
> DEBUG1 remoteWorkerThread_2: thread done
> DEBUG1 main: done
>
> Then at this point some ascii ESC as it stops.
>
> We have  several SLON clusters running on our master, going to a few 
> dozen systems in total.
> Running LFS with a kernel of 2.6.9 SMP, SLONY 1.0.5 and Postgres 7.4.6.
>
> We have set processor affinity as well.
>
> Any suggestions? Is this  known issue ?

This seems consistent with Slony-I running an analyze that tries to 
update stats in pg_statistic concurrently with some other process doing 
the same.

Are you running pg_autovacuum or some other vacuuming regimen that 
periodically runs ANALYZE on one or another of the tables you saw in 
that FATAL message?

If you're periodically doing ANALYZEs, and Slony-I is too, that is 
consistent with it occurring sporadically.

If your ANALYZE is a pretty big one, involving many tables, it would 
make sense that the periodicity could go "in phase," so that the fatal 
condition would happen with considerable regularity, and could go "out 
of phase," so it would nearly disappear.

If it took five minutes to restart the slon, that would lead to a phase 
shift, which would either make things conflict worse, or lessen it.

I'd observe that cfengine, an automated configuration management engine, 
has the habit of sleeping for random periods of time (it calls this a 
"splaytime") before getting started in order to try to avoid 'thundering 
herd' and 'getting in phase' problems.  I probably ought to modify the 
watchdog process to add a bit of random "fuzz" time to avoid these 
issues, and it might even be worth doing the same to the cleanup thread.