Sat Mar 12 18:05:33 PST 2005
- Previous message: [Slony1-general] Odd slony problem
- Next message: [Slony1-general] Odd slony problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Tass Chapman wrote: > > We are seeing the following issue occur sporadically on out master and > forwarders, sometimes a few times within 48 hours , sometimes not for > a few weeks. It stops the SLON daemon though, so we have to restart it > to get our replication working again. > > DEBUG1 cleanupThread: 0.007 seconds for delete logs > FATAL cleanupThread: "vacuum analyze "_user_master".sl_event; vacuum > analyze "_user_master".sl_confirm; vacuum analyze > "_user_master".sl_setsync; > vacuum analyze "_user_master".sl_log_1; vacuum > analyze "_user_master".sl_log_2;vacuum analyze "_use > r_master".sl_seqlog;vacuum analyze pg_catalog.pg_listener;" - ERROR: > duplicate key violates unique constraint "pg_statisti c_relid_att_index" > DEBUG1 syncThread: thread done > DEBUG1 main: scheduler mainloop returned > INFO remoteListenThread_2: disconnecting from 'dbname=master > host={HOST_NAME} port=5432 user={USER} password={PASSWORD}' > DEBUG1 remoteListenThread_2: thread done > DEBUG1 localListenThread: thread done > DEBUG1 remoteWorkerThread_2: thread done > DEBUG1 main: done > > Then at this point some ascii ESC as it stops. > > We have several SLON clusters running on our master, going to a few > dozen systems in total. > Running LFS with a kernel of 2.6.9 SMP, SLONY 1.0.5 and Postgres 7.4.6. > > We have set processor affinity as well. > > Any suggestions? Is this known issue ? This seems consistent with Slony-I running an analyze that tries to update stats in pg_statistic concurrently with some other process doing the same. Are you running pg_autovacuum or some other vacuuming regimen that periodically runs ANALYZE on one or another of the tables you saw in that FATAL message? If you're periodically doing ANALYZEs, and Slony-I is too, that is consistent with it occurring sporadically. If your ANALYZE is a pretty big one, involving many tables, it would make sense that the periodicity could go "in phase," so that the fatal condition would happen with considerable regularity, and could go "out of phase," so it would nearly disappear. If it took five minutes to restart the slon, that would lead to a phase shift, which would either make things conflict worse, or lessen it. I'd observe that cfengine, an automated configuration management engine, has the habit of sleeping for random periods of time (it calls this a "splaytime") before getting started in order to try to avoid 'thundering herd' and 'getting in phase' problems. I probably ought to modify the watchdog process to add a bit of random "fuzz" time to avoid these issues, and it might even be worth doing the same to the cleanup thread.
- Previous message: [Slony1-general] Odd slony problem
- Next message: [Slony1-general] Odd slony problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list