[Slony1-general] Slony Watchdog failed starting up the child process

Tue Jul 23 12:22:39 PDT 2013

On 07/23/2013 03:08 PM, Christopher Browne wrote:
> My intuition from seeing it say "FATAL" is that that's indicating "death
> of process," and that there's not much coming back from it.
>
> This behaviour is pretty consistent with what happens with a Postgres
> postmaster; if the attempt to start up fails due to seeming already to
> have a postmaster, it doesn't retry, pg_ctl immediately gives up.
>

This came up a few years ago with bug #132.

http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=acd46819bad1613764708b138ebcfa895467ac51

Changed slon to behave as Rose expected, retry to get the node lock 
every few seconds.

A few weeks later we modified this to only retry getting the node lock 
in response to a slon requested restart and not retry if the initial 
start fails. 
http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=7d3e6659542ad337feb2fbe39f05b780c37afe97

I don't really remember the discussion around this change and exactly 
why we didn't like my original patch, possibly for reasons like you 
argue above, if slon keeps looping it never really 'starts' and it is 
hard to detect that.

> By the way, is this possibly because of a zombied old connection that
> got disconnected due to firewall glitch or such?  If so, you should
> probably see about lowering the TCP keepalive parameters both in the
> slon.conf file and in postgresql.conf
>
> (On postgresql.conf, see tcp_keepalives_(idle|interval|count), and on
> slon.conf, see tcp_keepalive, tcp_keepalive_(idle|interval|count).)
>
>

No matter how low you make the postgresql.conf settings it is always 
possible for the replacement slon to start before the postgresql detects 
the timeout. I don't know how low you can make the tcp timeout settings 
before it has other side-effects.

One option is to push the issue to whatever is starting the slon and let 
it retry (which is what we do now).  Another option is to let slon loop 
x times trying to get the node-lock before giving up, but we didn't seem 
to like that 3 years ago.

> On Tue, Jul 23, 2013 at 3:07 PM, Christopher Browne
> <cbbrowne at afilias.info <mailto:cbbrowne at afilias.info>> wrote:
>
>     My intuition from seeing it say "FATAL" is that that's indicating
>     "death of process," and that there's not much coming back from it.
>
>     This behaviour is pretty consistent with what happens with a
>     Postgres postmaster; if the attempt to start up fails due to seeming
>     already to have a postmaster, it doesn't retry, pg_ctl immediately
>     gives up.
>
>     By the way, is this possibly because of a zombied old connection
>     that got disconnected due to firewall glitch or such?  If so, you
>     should probably see about lowering the TCP keepalive parameters both
>     in the slon.conf file and in postgresql.conf
>
>     (On postgresql.conf, see tcp_keepalives_(idle|interval|count), and
>     on slon.conf, see tcp_keepalive, tcp_keepalive_(idle|interval|count).)
>
>
>
>
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general