Thu May 1 12:17:26 PDT 2008
- Previous message: [Slony1-general] configure:7134: error: Headers for libpqserver are not found in the includeserverdir
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
As posted by Mikhail Kolesnik in a discussion with Stéphane A. Schildknecht An issue was possibly introduced into 1.2.12 that caused failover problems (although it appears from the conversation that folks thought it was in the 1.2.13 branch and I believe this major issue appeared before 1.2.13. So I've done exhaustive testing and 1.2.11 will allow failover with no issues, I have a 4 node scheme 1 masterhost 1 slavehost 2 qslavehosts (query only) 1.2.11 and postgresql 8.2.5 Other than a possible single instance of: ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep Which is short lived, failover works back and forth between master and slave, all day long. I switchover, wait for replication, switchback, wait for replication etc.etc.etc.etc. No issues. Now with 1.2.12 and 1.2.14rc (have not tested 1.2.13 yet (but since it's apparent in 1.2.12 and in 1.2.14rc even with the "patch/possible fix", I'm guessing the issue is very much in 1.2.13 and there is a large issue as failover and switchover are key elements in this application. The symptoms in 1.2.12 and 1.2.14rc are that the qslaves freak the heck out. We can get fail over to work, but we MUST drop the affected qslave host and re add, and when one is doing weekly indexes and has to end up rebuilding each time, that's an issue. The qservers get into this state (ps -ef) during and after a failover. One can stop slon the failover will take place and once you restart slon the node is instantly in a bas state (2008-05-01 11:54:42 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep) Qslavehost postgres 6467 1161 0 11:47 ? 00:00:00 postgres: postgres clsdb 10.40.5.243(54273) idle postgres 6468 6442 0 11:48 pts/0 00:00:00 slon -f /data/pgsql/slon.conf cls dbname=clsdb user=postgres postgres 6545 6468 0 11:51 pts/0 00:00:00 slon -f /data/pgsql/slon.conf cls dbname=clsdb user=postgres postgres 6549 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54310) idle postgres 6552 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54311) idle in transaction postgres 6558 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54312) LOCK TABLE waiting <--- this is wrong postgres 6560 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54313) idle postgres 6561 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54315) idle postgres 6563 1161 0 11:51 ? 00:00:00 postgres: postgres clsdb 10.40.5.250(54316) idle The logs show: 2008-05-01 11:54:20 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 54 2008-05-01 11:54:20 PDT DEBUG2 remoteListenThread_1: LISTEN 2008-05-01 11:54:22 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-01 11:54:25 PDT DEBUG2 remoteListenThread_1: queue event 1,153 SYNC 2008-05-01 11:54:25 PDT DEBUG2 remoteListenThread_1: UNLISTEN 2008-05-01 11:54:27 PDT DEBUG2 remoteListenThread_4: LISTEN 2008-05-01 11:54:30 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 55 2008-05-01 11:54:30 PDT DEBUG2 localListenThread: Received event 3,54 SYNC 2008-05-01 11:54:30 PDT DEBUG2 localListenThread: Received event 3,55 SYNC 2008-05-01 11:54:32 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-01 11:54:34 PDT DEBUG2 remoteListenThread_2: queue event 2,72 SYNC 2008-05-01 11:54:34 PDT DEBUG2 remoteListenThread_2: UNLISTEN 2008-05-01 11:54:37 PDT DEBUG2 remoteListenThread_4: LISTEN 2008-05-01 11:54:39 PDT DEBUG2 remoteListenThread_2: queue event 2,73 SYNC 2008-05-01 11:54:39 PDT DEBUG2 remoteListenThread_2: UNLISTEN 2008-05-01 11:54:40 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 56 2008-05-01 11:54:40 PDT DEBUG2 remoteListenThread_1: queue event 1,154 SYNC 2008-05-01 11:54:42 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep And i also note that I've seen this on occasions. 2008-05-01 11:52:06 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-01 11:52:07 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 34 2008-05-01 11:52:07 PDT DEBUG4 version for "dbname=clsdb host=devidb04.domain.com user=postgres password=SECURED" is 80205 2008-05-01 11:52:07 PDT ERROR remoteListenThread_3: db_getLocalNodeId() returned 2 - wrong database? 2008-05-01 11:52:08 PDT DEBUG2 remoteListenThread_1: queue event 1,122 SYNC 2008-05-01 11:52:08 PDT DEBUG2 remoteListenThread_1: UNLISTEN 2008-05-01 11:52:10 PDT DEBUG2 remoteListenThread_2: LISTEN 2008-05-01 11:52:11 PDT DEBUG2 remoteListenThread_2: queue event 3,35 SYNC 2008-05-01 11:52:11 PDT DEBUG2 remoteListenThread_2: UNLISTEN 2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: Received event 3,35 SYNC 2008-05-01 11:52:11 PDT DEBUG2 calc sync size - last time: 1 last length: 11025 ideal: 5 proposed size: 3 2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: SYNC 35 processing 2008-05-01 11:52:11 PDT DEBUG2 remoteWorkerThread_3: no sets need syncing for this event 2008-05-01 11:52:12 PDT DEBUG2 localListenThread: Received event 4,34 SYNC 2008-05-01 11:52:13 PDT DEBUG2 remoteListenThread_1: LISTEN 2008-05-01 11:52:16 PDT DEBUG2 ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep 2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_2: queue event 2,47 SYNC 2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_2: UNLISTEN 2008-05-01 11:52:16 PDT DEBUG2 remoteListenThread_1: LISTEN 2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: queue event 1,123 STORE_PATH 2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: UNLISTEN 2008-05-01 11:52:17 PDT DEBUG2 remoteListenThread_1: queue event 1,124 STORE_LISTEN 2008-05-01 11:52:17 PDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 35 2008-05-01 11:52:17 PDT DEBUG4 version for "dbname=clsdb host=devidb04.domain.com user=postgres password=SECURED" is 80205 2008-05-01 11:52:17 PDT ERROR remoteListenThread_3: db_getLocalNodeId() returned 2 - wrong database? Sometimes in 1.2.12 and 1.2.14rc the failover works, but your not going to get more than one successful failover before you have to drop and add a node. Also this situation causes switchover to hang, until you kill slon on the affected qslave. I'm more than happy to work thru this as I really want to push out 8.3.1 and would love to have a functioning 1.2.14 slon release, but something bad happened between 1.2.11 and current.. Either something new that I have not added to my setup scripts or it's the code. I'll work with someone on this!!! Thanks Tory
- Previous message: [Slony1-general] configure:7134: error: Headers for libpqserver are not found in the includeserverdir
- Next message: [Slony1-general] 1.2.14rc still does not appear to handle switchover cleanly
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list