[Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly

Thu May 1 12:17:26 PDT 2008

As posted by Mikhail Kolesnik in a discussion with Stéphane A. Schildknecht

An issue was possibly introduced into 1.2.12 that caused failover
problems (although it appears from the conversation that folks thought
it was in the 1.2.13 branch and I believe this major issue appeared
before 1.2.13.

So I've done exhaustive testing and 1.2.11 will allow failover with no
issues, I have a 4 node scheme

1 masterhost
1 slavehost
2 qslavehosts (query only)

1.2.11 and postgresql 8.2.5

Other than a possible single instance of:
ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep

Which is short lived, failover works back and forth between master and
slave, all day long. I switchover, wait for replication, switchback,
wait for replication etc.etc.etc.etc. No issues.

Now with 1.2.12 and 1.2.14rc (have not tested 1.2.13 yet (but since
it's apparent in 1.2.12 and in 1.2.14rc even with the "patch/possible
fix", I'm guessing the issue is very much in 1.2.13 and there is a
large issue as failover and switchover are key elements in this
application.

The symptoms in 1.2.12 and 1.2.14rc are that the qslaves freak the
heck out. We can get fail over to work, but we MUST drop the affected
qslave host and re add, and when one is doing weekly indexes and has
to end up rebuilding each time, that's an issue.

The qservers get into this state (ps -ef) during and after a failover.
One can stop slon the failover will take place and once you restart
slon the node is instantly in a bas state (2008-05-01 11:54:42 PDT
DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep)

Qslavehost
postgres  6467  1161  0 11:47 ?        00:00:00 postgres: postgres
clsdb 10.40.5.243(54273) idle
postgres  6468  6442  0 11:48 pts/0    00:00:00 slon -f
/data/pgsql/slon.conf cls dbname=clsdb user=postgres
postgres  6545  6468  0 11:51 pts/0    00:00:00 slon -f
/data/pgsql/slon.conf cls dbname=clsdb user=postgres
postgres  6549  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54310) idle
postgres  6552  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54311) idle in transaction
postgres  6558  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54312) LOCK TABLE waiting  <--- this is wrong
postgres  6560  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54313) idle
postgres  6561  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54315) idle
postgres  6563  1161  0 11:51 ?        00:00:00 postgres: postgres
clsdb 10.40.5.250(54316) idle

The logs show:

2008-05-01 11:54:20 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 54
2008-05-01 11:54:20 PDT DEBUG2 remoteListenThread_1: LISTEN
2008-05-01 11:54:22 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET
not received yet - sleep
2008-05-01 11:54:25 PDT DEBUG2 remoteListenThread_1: queue event 1,153 SYNC
2008-05-01 11:54:25 PDT DEBUG2 remoteListenThread_1: UNLISTEN
2008-05-01 11:54:27 PDT DEBUG2 remoteListenThread_4: LISTEN
2008-05-01 11:54:30 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 55
2008-05-01 11:54:30 PDT DEBUG2 localListenThread: Received event 3,54 SYNC
2008-05-01 11:54:30 PDT DEBUG2 localListenThread: Received event 3,55 SYNC
2008-05-01 11:54:32 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET
not received yet - sleep
2008-05-01 11:54:34 PDT DEBUG2 remoteListenThread_2: queue event 2,72 SYNC
2008-05-01 11:54:34 PDT DEBUG2 remoteListenThread_2: UNLISTEN
2008-05-01 11:54:37 PDT DEBUG2 remoteListenThread_4: LISTEN
2008-05-01 11:54:39 PDT DEBUG2 remoteListenThread_2: queue event 2,73 SYNC
2008-05-01 11:54:39 PDT DEBUG2 remoteListenThread_2: UNLISTEN
2008-05-01 11:54:40 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 56
2008-05-01 11:54:40 PDT DEBUG2 remoteListenThread_1: queue event 1,154 SYNC
2008-05-01 11:54:42 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET
not received yet - sleep

And i also note that I've seen this on occasions.

2008-05-01 11:52:06 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET
not received yet - sleep
2008-05-01 11:52:07 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 34
2008-05-01 11:52:07 PDT DEBUG4 version for "dbname=clsdb
host=devidb04.domain.com user=postgres password=SECURED" is 80205
2008-05-01 11:52:07 PDT ERROR  remoteListenThread_3:
db_getLocalNodeId() returned 2 - wrong database?
2008-05-01 11:52:08 PDT DEBUG2 remoteListenThread_1: queue event 1,122 SYNC
2008-05-01 11:52:08 PDT DEBUG2 remoteListenThread_1: UNLISTEN
2008-05-01 11:52:10 PDT DEBUG2 remoteListenThread_2: LISTEN
2008-05-01 11:52:11 PDT DEBUG2 remoteListenThread_2: queue event 3,35 SYNC
2008-05-01 11:52:11 PDT DEBUG2 remoteListenThread_2: UNLISTEN
2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: Received event 3,35 SYNC
2008-05-01 11:52:11 PDT DEBUG2 calc sync size - last time: 1 last
length: 11025 ideal: 5 proposed size: 3
2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: SYNC 35 processing
2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: no sets need
syncing for this event
2008-05-01 11:52:12 PDT DEBUG2 localListenThread: Received event 4,34 SYNC
2008-05-01 11:52:13 PDT DEBUG2 remoteListenThread_1: LISTEN
2008-05-01 11:52:16 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET
not received yet - sleep
2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_2: queue event 2,47 SYNC
2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_2: UNLISTEN
2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_1: LISTEN
2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: queue event 1,123
STORE_PATH
2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: UNLISTEN
2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: queue event 1,124
STORE_LISTEN
2008-05-01 11:52:17 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 35
2008-05-01 11:52:17 PDT DEBUG4 version for "dbname=clsdb
host=devidb04.domain.com user=postgres password=SECURED" is 80205
2008-05-01 11:52:17 PDT ERROR  remoteListenThread_3:
db_getLocalNodeId() returned 2 - wrong database?

Sometimes in 1.2.12 and 1.2.14rc the failover works, but your not
going to get more than one successful failover before you have to drop
and add a node. Also this situation causes switchover to hang, until
you kill slon on the affected qslave.

I'm more than happy to work thru this as I really want to push out
8.3.1 and would love to have a functioning 1.2.14 slon release, but
something bad happened between 1.2.11 and current.. Either something
new that I have not added to my setup scripts or it's the code.

I'll work with someone on this!!!

Thanks
Tory