Brian A. Seklecki lavalamp at spiritual-machines.org
Mon Apr 13 07:47:22 PDT 2009
Bug #81 Filed:

http://www.slony.info/bugzilla/show_bug.cgi?id=81

~BAS

On Fri, 27 Feb 2009, Bill Moran wrote:

>
> I'm seeing a problem similar to that described in this thread:
> http://lists.slony.info/pipermail/slony1-general/2008-August/008492.html
>
> However, I was unable to isolate the cause.  The solution was to restart
> the slons on all nodes, which unstuck things and all tests show that
> the cluster is replicating happily again.
>
> I'm trying to determine how/why the problem happened.  In this case it
> was triggered by moving an origin around (as part of a routine disaster
> drill).  The set moved to the backup without problem, but we didn't notice
> that the slon on node 2 had died until we went to move things back.  At
> that point, the move set hung, and eventually bailed with "timeout exceeded
> while waiting for event confirmation"
>
> If anyone has any insight into this, I'm all ears.  Otherwise, I can set
> up a test bed if there's anything I can do to provide more information.
>
> Some basic details:
> FreeBSD 6.3
> PostgreSQL 8.3.5
> Slony 1.2.15
>
> Cluster =
> node 1: masterdb in facil0
> node 2: localslavedb in facil0
> node 3: remoteslavedb in facil1
>
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8393-1] [55583] CONFIG moveSet: set_id=1 old_origin=1 new_origin=3
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8394-1] [55583] DEBUG1 remoteWorkerThread_3: helper thread for provider 3 created
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8395-1] [55583] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8396-1] [55583] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=3
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [err] slon[55583]: [8397-1] [55583] CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=3
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8398-1] [55583] DEBUG1 remoteWorkerThread_1: helper thread for provider 1 terminated
> 2009 Feb 27 06:31:43 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8399-1] [55583] DEBUG1 remoteWorkerThread_1: disconnecting from data provider 1
> 2009 Feb 27 06:31:52 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8400-1] [55583] DEBUG1 remoteWorkerThread_3: connected to data provider 3 on 'dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
> 2009 Feb 27 06:31:52 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8400-2] user=database_name_slony password=[redacted]'
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8401-1] [55583] INFO localListenThread: got restart notification
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [2-1] [1058] DEBUG1 slon: retry requested
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8402-1] [55583] DEBUG1 main: scheduler mainloop returned
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8403-1] [55583] INFO remoteListenThread_3: disconnecting from 'dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8403-2] user=database_name_slony password=[redacted]'
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8404-1] [55583] INFO remoteListenThread_1: disconnecting from 'dbname=database_name_prod host=masterdb-db.cluster00.facil0
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [info] slon[55583]: [8404-2] user=database_name_slony password=[redacted]'
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8405-1] [55583] DEBUG1 syncThread: thread done
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8406-1] [55583] DEBUG1 cleanupThread: thread done
> 2009 Feb 27 06:32:01 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8407-1] [55583] DEBUG1 remoteWorkerThread_1: thread done
> 2009 Feb 27 06:32:02 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8408-1] [55583] DEBUG1 remoteListenThread_3: thread done
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8409-1] [55583] DEBUG1 remoteListenThread_1: thread done
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][55583] [local2] [debug] slon[55583]: [8410-1] [55583] DEBUG1 main: done
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [3-1] [1058] DEBUG1 slon: restart of worker
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [err] slon[1058]: [1-1] [1058] CONFIG main: slon version 1.2.15 starting up
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [2-1] [27776] CONFIG main: local node id = 2
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [3-1] [27776] CONFIG main: launching sched_start_mainloop
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [4-1] [27776] CONFIG main: loading current cluster configuration
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [5-1] [27776] CONFIG storeNode: no_id=1 no_comment='Master node'
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [6-1] [27776] CONFIG storeNode: no_id=3 no_comment='remote slave node'
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [7-1] [27776] CONFIG storePath: pa_server=1 pa_client=2 pa_conninfo="dbname=database_name_prod host=masterdb-db.cluster00.facil0
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [7-2] user=database_name_slony password=[redacted]" pa_connretry=10
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [8-1] [27776] CONFIG storePath: pa_server=3 pa_client=2 pa_conninfo="dbname=database_name_prod host=remoteslavedb-ifx.bdb.facil1
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [8-2] user=database_name_slony password=[redacted]" pa_connretry=10
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [9-1] [27776] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [10-1] [27776] CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=3
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [11-1] [27776] CONFIG storeListen: li_origin=3 li_receiver=2 li_provider=3
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [12-1] [27776] CONFIG storeSet: set_id=1 set_origin=3 set_comment='All tables'
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [13-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [14-1] [27776] CONFIG storeSubscribe: sub_set=1 sub_provider=3 sub_forward='t'
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [15-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [16-1] [27776] CONFIG enableSubscription: sub_set=1
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [warning] slon[27776]: [17-1] [27776] WARN remoteWorker_wakeup: node 3 - no worker thread
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [18-1] [27776] CONFIG main: configuration complete - starting threads
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [debug] slon[27776]: [19-1] [27776] DEBUG1 localListenThread: thread starts
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-1] [27776] FATAL localListenThread: "select "_database_name_prod".cleanupNodelock(); insert into
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-2] "_database_name_prod".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid()); " - ERROR: duplicate key value violates
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][27776] [local2] [err] slon[27776]: [20-3] unique constraint "sl_nodelock-pkey"
> 2009 Feb 27 06:32:04 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [2-1] [1058] DEBUG1 slon: shutdown requested
> 2009 Feb 27 06:32:24 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [3-1] [1058] DEBUG1 slon: child termination timeout - kill child
> 2009 Feb 27 06:32:24 -05:00 localslavedb [slon][1058] [local2] [debug] slon[1058]: [4-1] [1058] DEBUG1 slon: done
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at lists.slony.info
> http://lists.slony.info/mailman/listinfo/slony1-general
>

l8*
	-lava (Brian A. Seklecki - Pittsburgh, PA, USA)
	       http://www.spiritual-machines.org/

    "Show me a young conservative and I'll show you someone with no heart.
    Show me an old liberal and I'll show you someone with no brains."
    ~ Winston Churchill


More information about the Slony1-general mailing list