A slon exitted with "_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values ( 4, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates unique constraint "sl_nodelock-pkey" What appears to have happened is that following a move set the slon called slon_retry() and exitted and was restarted The restarted slon did some processing and eventually encountered the above error. At this stage I can't say if the bug is: 1) The first slon instance exitted without cleaning up 2) The second slon instance should have cleaned up after the first (check if the backend is still alive and if not delete the row) 3) something else 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - subscription is complete:2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTWARN remoteWorker_wakeup: node 3 - no worker thread 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled) 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: last local event sequence = 5000000002 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: configuration complete - starting threads 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO localListenThread: thread starts 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG version for "dbname=test4 host=localhost user=slony password=slony" is 80309 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 local_listen "dbname=test4 host=localhost user=slony password=slony": backend pid = 4464 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: worker process created - pid = 4460 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTFATAL localListenThread: "select "_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values ( 4, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates unique constraint "sl_nodelock-pkey" 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 slon_abort() from pid=4460 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO slon: shutdown requested 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO slon: notify worker process to shutdown 2010-06-01 10:34:17,448 [db4stderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=4072 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO slon: child termination timeout - kill child 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460 2010-06-01 10:34:37,450 [db4 stdout] INFO info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO slon: done 2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG remoteWorkerThread_5: update provider configuration 2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO remoteWorkerThread_5: thread done 2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 cleanupThread: thread done 2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: done 2010-06-01 10:34:17,412 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: child terminated status: 0; pid: 3974, current worker pid: 3974 2010-06-01 10:34:17,413 [db4 stdout] INFO info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:17 EDTCONFIG slon: child terminated status: 0; pid: 3974, current worker pid: 3974 2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: restart of worker 2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: slon version 2.0.3 starting up 2010-06-01 10:34:17,413 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO slon: watchdog process started 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - subscription is complete:2010-06-01 10:34:17 EDTCONFIG enableSubscription: sub_set=2 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTWARN remoteWorker_wakeup: node 3 - no worker thread 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 sched_wakeup_node(): no_id=3 (0 threads + worker signaled) 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: last local event sequence = 5000000002 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG main: configuration complete - starting threads 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO localListenThread: thread starts 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG version for "dbname=test4 host=localhost user=slony password=slony" is 80309 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG1 local_listen "dbname=test4 host=localhost user=slony password=slony": backend pid = 4464 2010-06-01 10:34:17,433 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTCONFIG slon: worker process created - pid = 4460 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTFATAL localListenThread: "select "_disorder_replica".cleanupNodelock(); insert into "_disorder_replica".sl_nodelock values ( 4, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates unique constraint "sl_nodelock-pkey" 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTDEBUG2 slon_abort() from pid=4460 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO slon: shutdown requested 2010-06-01 10:34:17,447 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:17 EDTINFO slon: notify worker process to shutdown 2010-06-01 10:34:17,448 [db4stderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=4072 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO slon: child termination timeout - kill child 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460 2010-06-01 10:34:37,450 [db4 stdout] INFO info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - abnormal termination of slon2010-06-01 10:34:37 EDTCONFIG slon: child terminated status: 9; pid: 4460, current worker pid: 4460 2010-06-01 10:34:37,450 [db4 stdout] DEBUG info.slony.clustertest.testcoordinator.slony.SlonLauncher db4 - 2010-06-01 10:34:37 EDTINFO slon: done
I've seen this happen some more often. I think the following is happening. -a slon_retry() method is called following the move set -slon_retry signals the parent slon process which in turn sends a kill to the child -The child exits, I can find no 'cleanup' process that the child runs on a kill to ensure the postgresql connections are closed or to remove entries from sl_nodelock -The child is restarted by the parent. -The child calls cleanupNodelock() which checks to see if the pid for the backend registered with sl_nodelock is still around. I think sometimes the old backend process is still around (hasn't yet exited) maybe because it is in the middle of a query and hasn't yet noticed that the slon it is talking to has gone away -Since the backend process is still around the row isn't deleted from sl_nodelock causing the insert into sl_nodelock to fail. Since I've seen thsi happen more than an isolated incident and it causes the watchdog to exit as well I am bumping the priority. Options to fix this include 1) Having a the slon worker properly exit and remove itself from the sl_nodelock table before exiting 2) Increase the 'sleep' time before restarting the child. This doesn't really fix the problem it just makes it less likely
This has been committed to REL_2_0_STABLE and master