Fri Jan 7 20:37:26 PST 2005
- Previous message: [Slony1-general] Schema-Update
- Next message: [Slony1-general] slons ... stop ...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
We have run into a bit of an anomaly that sorta looks like some odd deadlock. We've got a whole bunch of slon daemons in one of our environments, roughly a dozen of them, across 4 Slony-I clusters. The platform is AIX, Slony-I 1.0.5, PostgreSQL 7.4.6. Some sort of "strange and mystical" thing happened at 04:20 UTC; roughly half of them hung up and stopped at roughly the same moment. [the following has been subtly rewritten to keep certain internal details private, of course...] dbserv[/opt/log/DOTA/slony1]$ for i in node*; do > cd $i > ls -alt | head | grep "4\:20" > cd .. > done -rw------- 1 postgres dba 3142497 Jan 07 04:20 srsa_2005-01-06_22:04:47.log -rw------- 1 postgres dba 2719731 Jan 07 04:20 srsa_2005-01-06_22:04:47.log -rw------- 1 postgres dba 2323728 Jan 07 04:20 srsa_2005-01-06_22:04:47.log dbserv[/opt/log/DOTA/slony1]$ cd ../../DOTB/slony1 dbserv[/opt/log/DOTB/slony1]$ for i in node*; do cd $i; ls -alt | head | grep "4\:20"; cd ..; done -rw------- 1 postgres dba 5100346 Jan 07 04:20 srsb_2005-01-06_22:04:47.log -rw------- 1 postgres dba 1365573 Jan 07 04:20 srsb_2005-01-07_03:31:25.log -rw------- 1 postgres dba 1394114 Jan 07 04:20 srsb_2005-01-07_03:30:34.log -rw------- 1 postgres dba 1500946 Jan 07 04:20 srsb_2005-01-07_03:27:59.log dbserv[/opt/log/DOTB/slony1]$ cd ../../LOGDBS/slony1 dbserv[/opt/log/LOGDBS/slony1]$ for i in node*; do cd $i; ls -alt | head | grep "4\:20"; cd ..; done -rw------- 1 postgres dba 4670613 Jan 07 04:20 blogs_2005-01-07_00:34:54.log None of them contain any out-of-the-ordinary messages; the logs for origin nodes have the usual series of confirmations. Subscribers have messages about SYNCs coming through, and about the successful processing of syncs. They all look as though everything is working just fine until, at 04:20 UTC, they got caught up in the "Carbonite Maneuver" (to combine two of the better things presented on Star Trek and Star Wars!) in some sort of state of frozenness. The slon processes were present; my first theory would have been, "oh, dear, their parent process died, so they got thrown out." But a ps command found that they were still (seemingly) running. The database instances were examined for locks that might have been causing some form of DBMS deadlock; no relevant locks were found. After some peering around, the slon processes were killed and restarted, and replication continued apace, no seeming problems continuing. (Another situation came up later, but it was quite another matter...) This just seems ... odd. As is visible in the log file names, most of them were started at very nearly the same time. All of the slon processes were started by a central "watchdog" at about the same moment, so the similarity in the times that these all stopped seems unlikely to be a pure coincidence. Mind you, there were several slons that _didn't_ hang up, so it's not true to say that they _all_ went down. The best theory outstanding is that perhaps there's some subtlety in thread handling (which goes without saying!), and that since these processes all had similar quantities of Slony-I event data go through them, there might be some filling up of some limited resource based on that. But that's at best a bit of gossip at this point. I'm throwing out this "bit of gossip" on the offchance that anyone else has seen a similar condition take place, where slons just freeze. If there's a bug involved, it'll take some keen observation to track it down. -- "cbbrowne","@","ca.afilias.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 673-4124 (land)
- Previous message: [Slony1-general] Schema-Update
- Next message: [Slony1-general] slons ... stop ...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list