Mikhail Kolesnik mike at openbunker.org
Thu Feb 14 02:40:58 PST 2008
Hello.

I am using the following configuration: 2 slaves, directly subscribed
to one master. The problem sometimes shows up during controlled
failover with 'move set' (doing it as described in [1]). When doing
move set from one node to another the third node sometimes does not
follow the configuration changes. It keeps complaining about: 
'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep'
and still points to the wrong origin. This happens nearly in every
third case.

Sending 'restart node 3' or killing slon processes directly helps
_sometimes_.

I've played with 'sync' and 'wait for event' a bit in different places
to make the third node receive that 'move set', but with no luck. 

Script-generated input for slon, which is used to setup, and failover
is attached. Set is a sample pgbench database with one additional
explicit primary key added.

Any suggestions? What can trigger such a behavior?

Here is a piece of log with successful 'move set':
2008-02-14 09:45:21 EET DEBUG2 remoteWorkerThread_2: SYNC 134 done in 0.018 seconds
2008-02-14 09:45:21 EET DEBUG2 remoteWorkerThread_2: Received event 2,135 ACCEPT_SET
2008-02-14 09:45:21 EET DEBUG2 start processing ACCEPT_SET
2008-02-14 09:45:21 EET DEBUG2 ACCEPT: set=1
2008-02-14 09:45:21 EET DEBUG2 ACCEPT: old origin=1
2008-02-14 09:45:21 EET DEBUG2 ACCEPT: new origin=2
2008-02-14 09:45:21 EET DEBUG2 ACCEPT: move set seq=147
2008-02-14 09:45:21 EET DEBUG2 got parms ACCEPT_SET
2008-02-14 09:45:21 EET DEBUG2 ACCEPT_SET - node not origin
2008-02-14 09:45:21 EET DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET exists - adjusting setsync status
2008-02-14 09:45:21 EET DEBUG2 ACCEPT_SET - done
2008-02-14 09:45:21 EET DEBUG2 slon_retry() from pid=18367
2008-02-14 09:45:21 EET DEBUG1 slon: retry requested
2008-02-14 09:45:21 EET DEBUG2 slon: notify worker process to shutdown
2008-02-14 09:45:21 EET INFO   remoteListenThread_2: disconnecting from 'dbname=pgb host=192.168.0.202 port=5432 user=slony password=sln'
2008-02-14 09:45:21 EET DEBUG1 localListenThread: thread done
2008-02-14 09:45:21 EET DEBUG1 remoteListenThread_2: thread done
2008-02-14 09:45:21 EET INFO   remoteListenThread_1: disconnecting from 'dbname=pgb host=192.168.0.201 port=5432 user=slony password=sln'
2008-02-14 09:45:21 EET DEBUG1 cleanupThread: thread done
2008-02-14 09:45:21 EET DEBUG1 remoteListenThread_1: thread done
2008-02-14 09:45:21 EET DEBUG1 syncThread: thread done
2008-02-14 09:45:21 EET DEBUG1 main: scheduler mainloop returned
2008-02-14 09:45:21 EET DEBUG2 main: wait for remote threads

Here is unsuccessful one:
2008-02-14 09:58:24 EET DEBUG2 remoteWorkerThread_2: SYNC 11 processing
2008-02-14 09:58:24 EET DEBUG2 remoteWorkerThread_2: no sets need syncing for this event
2008-02-14 09:58:24 EET DEBUG2 remoteWorkerThread_2: Received event 2,12 ACCEPT_SET
2008-02-14 09:58:24 EET DEBUG2 start processing ACCEPT_SET
2008-02-14 09:58:24 EET DEBUG2 ACCEPT: set=1
2008-02-14 09:58:24 EET DEBUG2 ACCEPT: old origin=1
2008-02-14 09:58:24 EET DEBUG2 ACCEPT: new origin=2
2008-02-14 09:58:24 EET DEBUG2 ACCEPT: move set seq=24
2008-02-14 09:58:24 EET DEBUG2 got parms ACCEPT_SET
2008-02-14 09:58:24 EET DEBUG2 ACCEPT_SET - node not origin
2008-02-14 09:58:24 EET DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep
2008-02-14 09:58:26 EET DEBUG2 remoteListenThread_1: queue event 1,23 SYNC
2008-02-14 09:58:26 EET DEBUG2 remoteWorkerThread_1: Received event 1,23 SYNC
2008-02-14 09:58:26 EET DEBUG2 remoteListenThread_1: queue event 1,24 MOVE_SET
2008-02-14 09:58:26 EET DEBUG2 remoteListenThread_1: queue event 1,25 SYNC
2008-02-14 09:58:26 EET DEBUG2 calc sync size - last time: 1 last length: 2001 ideal: 29 proposed size: 3
2008-02-14 09:58:26 EET DEBUG2 remoteWorkerThread_1: SYNC 23 processing
2008-02-14 09:58:26 EET DEBUG2 remoteWorkerThread_1: syncing set 1 with 4 table(s) from provider 1
2008-02-14 09:58:26 EET DEBUG4  ssy_action_list value:
2008-02-14 09:58:26 EET DEBUG2  ssy_action_list length: 0
2008-02-14 09:58:26 EET DEBUG2 remoteWorkerThread_1: current local log_status is 0
2008-02-14 09:58:26 EET DEBUG3 remoteWorkerThread_1: activate helper 1
2008-02-14 09:58:26 EET DEBUG4 remoteWorkerThread_1: waiting for log data

More logs attached.

Slony-I 1.2.13
postgresql 8.2.6
# gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.1.2/configure --prefix=/usr
--libexecdir=/usr/lib --enable-languages=c,c++,objc
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
--enable-shared --disable-nls --with-x=no Thread model: posix gcc
version 4.1.2 (CRUX)

[1] http://slony.info/adminguide/current/doc/adminguide/failover.html

-- 
Mikhail Kolesnik
ICQ: 260259143
IRC: mike_k at freenode/#crux, rusnet/#yalta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: move_set_FAIL.log
Type: text/x-log
Size: 4877 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20080214/f0c5596f/move_set_FAIL-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: move_set_OK.log
Type: text/x-log
Size: 14054 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20080214/f0c5596f/move_set_OK-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: initial_setup.slonik
Type: application/octet-stream
Size: 2228 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20080214/f0c5596f/initial_setup-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: controlled_failover.slonik
Type: application/octet-stream
Size: 169 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20080214/f0c5596f/controlled_failover-0001.obj


More information about the Slony1-general mailing list