Fri May 22 10:03:44 PDT 2009
- Previous message: [Slony1-general] sl_nodelock values ("pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value
- Next message: [Slony1-general] cleanup_interval parameter not working
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 5/22/2009 2:33 PM, Brad Nicholson wrote: > On Fri, 2009-05-22 at 12:49 -0400, Brian A. Seklecki wrote: >> All: >> >> So this problem with slon(8) daemons continues to vex us. During a >> switchover, we see "No Worker Thread" errors: >> >> 2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err] >> slon[55352]: [12-1] [55352] CONFIG storeSet: set_id=1 set_origin=3 >> set_comment='All CORES tables' >> 2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [warning] >> slon[55352]: [13-1] [55352] WARN remoteWorker_wakeup: node 3 - no >> worker thread >> >> Followed by: >> >> >> 2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err] >> slon[55352]: [19-1] [55352] FATAL localListenThread: "select >> "_DBNAME".cleanupNodelock(); insert into >> 2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err] >> slon[55352]: [19-2] "_DBNAME".sl_nodelock values ( >> 2, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value >> violates >> >> The screwed up thing is that, as far as we know, all three slon(8) >> daemons on all there configurations are active, healthy, and responding >> before we execute the switchover. >> >> We know because we have nagios watching SYNC events and watching that >> sl_log table row counts are within acceptable ranges. >> >> Any advice on further troubleshooting this? Maybe attach a ktrace(8) >> to the process and try to re-create the error. >> >> We're running the latest Slony/PostgreSQL (postgresql-server-8.3.7 + >> slony1-1.2.15) on FBSD6/amd64. >> >> ~BAS > > This looks like the same issue that one of our guys was trying to figure > out. > > Restarting the Slon let's the failover proceed, but it sort of sucks > that you have to do that. > It appears to me this is a race condition. During the switchover, the slon processes try to restart. The call to cleanupNodeLock() is supposed to remove the stale entry. If memory serves, cleanupNodeLock() does check if the corresponding backend still exists via kill(backendpid, 0). What I think happens is that the slon process is instructed to restart, so it drops the connection and instantaneously restarts, reconnects and tries to gain a node lock. But this may happen faster than the backend from the old connection had time to terminate and for postmaster to receive the exit code via wait. So the kill(backendpid, 0) tells (correctly) that the backend is still alive, assuming that this is a valid and active node lock in place. I presume the correct way to fix this is to not be entirely dependent on cleanupNodeLock() for removing the lock. Just prior to closing the backend connection, the node should actually delete the lock entry itself. Additionally we may want to introduce a little sleep+retry loop into cleanupNodeLock(). Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
- Previous message: [Slony1-general] sl_nodelock values ("pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value
- Next message: [Slony1-general] cleanup_interval parameter not working
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list