[Slony1-general] slons ... stop ...

Fri Jan 7 20:37:26 PST 2005

We have run into a bit of an anomaly that sorta looks like some odd
deadlock.

We've got a whole bunch of slon daemons in one of our environments,
roughly a dozen of them, across 4 Slony-I clusters.  The platform is
AIX, Slony-I 1.0.5, PostgreSQL 7.4.6.

Some sort of "strange and mystical" thing happened at 04:20 UTC;
roughly half of them hung up and stopped at roughly the same moment.

[the following has been subtly rewritten to keep certain internal
details private, of course...]

dbserv[/opt/log/DOTA/slony1]$ for i in node*; do
> cd $i
> ls -alt | head | grep "4\:20"
> cd ..
> done
-rw-------   1 postgres dba         3142497 Jan 07 04:20 srsa_2005-01-06_22:04:47.log
-rw-------   1 postgres dba         2719731 Jan 07 04:20 srsa_2005-01-06_22:04:47.log
-rw-------   1 postgres dba         2323728 Jan 07 04:20 srsa_2005-01-06_22:04:47.log
dbserv[/opt/log/DOTA/slony1]$ cd ../../DOTB/slony1 
dbserv[/opt/log/DOTB/slony1]$ for i in node*; do cd $i; ls -alt | head | grep "4\:20"; cd ..; done
-rw-------   1 postgres dba         5100346 Jan 07 04:20 srsb_2005-01-06_22:04:47.log
-rw-------   1 postgres dba         1365573 Jan 07 04:20 srsb_2005-01-07_03:31:25.log
-rw-------   1 postgres dba         1394114 Jan 07 04:20 srsb_2005-01-07_03:30:34.log
-rw-------   1 postgres dba         1500946 Jan 07 04:20 srsb_2005-01-07_03:27:59.log
dbserv[/opt/log/DOTB/slony1]$ cd ../../LOGDBS/slony1 
dbserv[/opt/log/LOGDBS/slony1]$ for i in node*; do cd $i; ls -alt | head | grep "4\:20"; cd ..; done
-rw-------   1 postgres dba         4670613 Jan 07 04:20 blogs_2005-01-07_00:34:54.log

None of them contain any out-of-the-ordinary messages; the logs for
origin nodes have the usual series of confirmations.  Subscribers have
messages about SYNCs coming through, and about the successful
processing of syncs.  They all look as though everything is working
just fine until, at 04:20 UTC, they got caught up in the "Carbonite
Maneuver" (to combine two of the better things presented on Star Trek
and Star Wars!) in some sort of state of frozenness.

The slon processes were present; my first theory would have been, "oh,
dear, their parent process died, so they got thrown out."  But a ps
command found that they were still (seemingly) running.

The database instances were examined for locks that might have been
causing some form of DBMS deadlock; no relevant locks were found.

After some peering around, the slon processes were killed and
restarted, and replication continued apace, no seeming problems
continuing.  (Another situation came up later, but it was quite
another matter...)

This just seems ... odd.

As is visible in the log file names, most of them were started at very
nearly the same time.  All of the slon processes were started by a
central "watchdog" at about the same moment, so the similarity in the
times that these all stopped seems unlikely to be a pure coincidence.

Mind you, there were several slons that _didn't_ hang up, so it's not
true to say that they _all_ went down.

The best theory outstanding is that perhaps there's some subtlety in
thread handling (which goes without saying!), and that since these
processes all had similar quantities of Slony-I event data go through
them, there might be some filling up of some limited resource based on
that.

But that's at best a bit of gossip at this point.

I'm throwing out this "bit of gossip" on the offchance that anyone
else has seen a similar condition take place, where slons just freeze.
If there's a bug involved, it'll take some keen observation to track
it down.
-- 
"cbbrowne","@","ca.afilias.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)