[Slony1-general] sl_nodelock values ("pg_catalog".pg_backend_pid()); "

Fri May 22 10:03:44 PDT 2009

On 5/22/2009 2:33 PM, Brad Nicholson wrote:
> On Fri, 2009-05-22 at 12:49 -0400, Brian A. Seklecki wrote:
>> All:
>> 
>> So this problem with slon(8) daemons continues to vex us.  During a
>> switchover, we see "No Worker Thread" errors:
>> 
>>  2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err] 
>>  slon[55352]: [12-1] [55352] CONFIG storeSet: set_id=1 set_origin=3
>>  set_comment='All CORES tables'
>>  2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [warning]
>>  slon[55352]: [13-1] [55352] WARN   remoteWorker_wakeup: node 3 - no
>>  worker thread
>> 
>> Followed by:
>> 
>> 
>>  2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err]
>>  slon[55352]: [19-1] [55352] FATAL  localListenThread: "select 
>>  "_DBNAME".cleanupNodelock(); insert into
>>  2009 May 22 06:37:17 -04:00 bdb01 [slon][55352] [local2] [err]
>>  slon[55352]: [19-2]  "_DBNAME".sl_nodelock values (   
>>  2, 0, "pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value
>>  violates
>> 
>> The screwed up thing is that, as far as we know, all three slon(8)
>> daemons on all there configurations are active, healthy, and responding
>> before we execute the switchover.
>> 
>> We know because we have nagios watching SYNC events and watching that
>> sl_log table row counts are within acceptable ranges.
>> 
>> Any advice on further troubleshooting this?    Maybe attach a ktrace(8)
>> to the process and try to re-create the error.
>> 
>> We're running the latest Slony/PostgreSQL (postgresql-server-8.3.7 +
>> slony1-1.2.15) on FBSD6/amd64.
>> 
>> ~BAS
> 
> This looks like the same issue that one of our guys was trying to figure
> out.
> 
> Restarting the Slon let's the failover proceed, but it sort of sucks
> that you have to do that.
> 

It appears to me this is a race condition. During the switchover, the 
slon processes try to restart.

The call to cleanupNodeLock() is supposed to remove the stale entry. If 
memory serves, cleanupNodeLock() does check if the corresponding backend 
still exists via kill(backendpid, 0).

What I think happens is that the slon process is instructed to restart, 
so it drops the connection and instantaneously restarts, reconnects and 
tries to gain a node lock. But this may happen faster than the backend 
from the old connection had time to terminate and for postmaster to 
receive the exit code via wait. So the kill(backendpid, 0) tells 
(correctly) that the backend is still alive, assuming that this is a 
valid and active node lock in place.

I presume the correct way to fix this is to not be entirely dependent on 
cleanupNodeLock() for removing the lock. Just prior to closing the 
backend connection, the node should actually delete the lock entry itself.

Additionally we may want to introduce a little sleep+retry loop into 
cleanupNodeLock().

Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin