[Slony1-general] Replication fails after network outage

Thu Jan 26 11:29:32 PST 2006

Glen Eustace wrote:

>I am using slony-1.1.0 with postgresql-8.0.6 and have a situation that I
>hope has a better resolution than I am currently using.
>
>One of my 2 slaves is some distance away and over the last 6 months or
>so we have had quite a few network brown or black outs between it and
>the master. After such an event, replication fails and the only way I am
>managing to get it to go again is to drop the node and database and
>start again. I have done this now so many times I have scripted it so
>that I can get the slave back online relatively quickly.
>
>I get errors, like the following, in the slony log
>
>2006-01-26 08:11:13 NZDT ERROR  remoteWorkerThread_1: "start
>transaction; set enable_seqscan = off; set enable_indexscan = on; "
>PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR  remoteWorkerThread_1:
>"close LOG; " PGRES_FATAL_ERROR 2006-01-26 08:11:13 NZDT ERROR  remot
>eWorkerThread_1: "rollback transaction; set enable_seqscan = default;
>set enable_indexscan = default; " PGRES_FATAL_ERROR 2006-01-26 08
>:11:13 NZDT ERROR  remoteWorkerThread_1: helper 1 finished with error
>2006-01-26 08:11:13 NZDT ERROR  remoteWorkerThread_1: SYNC aborted
>
>Stopping and restarting all the various slony processes doesn't seem to
>clear things.
>
>NB: It only ever seems to happen after a network event.  Any advice on
>how to get replication started again without rebuilding would be
>appreciated.
> 
>
>  
>
One thought...

You might want to turn the logging up to a higher level; it looks as
though it's at level 1, and I'd expect "-d 2" to give more useful
information.

Another notion...

My suspicion is that what is happening is that the connection between
the slon and the database it is managing was broken by the network
event.  Higher debug levels might display a message like "a slon is
already servicing node #2;" that would be a good tell-tale sign...

The next time this happens, connect in to the database and look at
pg_stat_activity to see what slony-related backends are in use.  My
suspicion is that you'll see several of them, possibly (if statement
logging is on) indicating "<IDLE> in transaction".

Solution #1...   Those idle-in-transaction backends are, in effect,
'zombies' of sorts.  They haven't yet figured out that the network
connection has died and won't be coming back.  They could persist
(depending on TCP/IP configuration) for up to a couple hours.  Kill them
off, and see if starting new slon processes works out better.

Solution #2...  It is preferable if each slon lives on the same network
as the database it is managing.  That would prevent some of the above
from happening, notably in that restarting slons would do some good.