Wed Sep 28 20:41:52 PDT 2005
- Previous message: [Slony1-general] Forcing an existing node to copy data fresh
- Next message: [Slony1-general] Is non-sequential node numbering okay?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Darcy Buskermolen wrote: >On Monday 22 August 2005 17:12, elein wrote: > > >>Slony 1.1. Three nodes. 10 set(1) => 20 => 30. >> >>I ran failover from node10 to node20. >> >>On node30, the origin of the set was changed >>from 10 to 20, however, drop node10 failed >>because of the row in sl_setsync. >> >>This causes slon on node30 to quit and the cluster to >>become unstable. Which in turn prevents putting >>node10 back into the mix. >> >>Please tell me I'm not the first one to run into >>this... >> >>The only clean work around I can see is to drop >>node 30. Re-add it. And then re-add node10. This >>leaves us w/o a back up for the downtime. >> >> >>This is what is in some of the tables for node20: >> >>gb2=# select * from sl_node; >> no_id | no_active | no_comment | no_spool >>-------+-----------+-------------------------+---------- >> 20 | t | Node 20 - gb2 at localhost | f >> 30 | t | Node 30 - gb3 at localhost | f >>(2 rows) >> >>gb2=# select * from sl_set; >> set_id | set_origin | set_locked | set_comment >>--------+------------+------------+---------------------- >> 1 | 20 | | Set 1 for gb_cluster >>gb2=# select * from sl_setsync; >> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | >>ssy_action_list >>-----------+------------+-----------+------------+------------+---------+-- >>--------------- (0 rows) >> >>This is what I have for node30: >> >>gb3=# select * from sl_node; >> no_id | no_active | no_comment | no_spool >>-------+-----------+-------------------------+---------- >> 10 | t | Node 10 - gb at localhost | f >> 20 | t | Node 20 - gb2 at localhost | f >> 30 | t | Node 30 - gb3 at localhost | f >>(3 rows) >> >>gb3=# select * from sl_set; >> set_id | set_origin | set_locked | set_comment >>--------+------------+------------+---------------------- >> 1 | 20 | | Set 1 for gb_cluster >>(1 row) >> >>gb3=# select * from sl_setsync; >> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip | >>ssy_action_list >>-----------+------------+-----------+------------+------------+---------+-- >>--------------- 1 | 10 | 235 | 1290260 | 1290261 | >> | (1 row) >> >>frustrated, >>--elein >> >> >Elein, >I can share your frustration, I have just for the first time started to >investigate failover and I have yet to be able to have a clean failover >happen, no matter how I do a failover I end up with nodes that are no longer >in sync with other the nodes. My time is fairly short this week, but I hope >to be able to spend some time on it. I've pushed all my other slony work to >the back burner to come to a solid resolution to this. > >Jan/Chris are either of you able to reproduce stable failovers in a multi node >(more than a single origin/subscriber pair) ? > > I finally put together a suitable environment to do some testing of this... I'm running into a case where, upon failover from node 1 to node 2, I get the following error message from slonik: -sh-2.05b$ slonik failover.slonik failover.slonik:6: NOTICE: Slony-I: terminating DB connection of faile node with pid 13903 CONTEXT: PL/pgSQL function "failednode" line 75 at perform FATAL: terminating connection due to administrator command failover.slonik:6: NOTICE: failedNode: set 1 has no other direct receivers - move now failover.slonik:6: NOTICE: Slony-I: terminating DB connection of faile node with pid 13905 CONTEXT: PL/pgSQL function "failednode" line 75 at perform FATAL: terminating connection due to administrator command failover.slonik:6: NOTICE: failedNode: set 1 has no other direct receivers - move now -sh-2.05b$ After which point things are not QUITE ok. I find that node #2 still has the "denyaccess" triggers in place. Interestingly, if I do a LOCK SET/MOVE SET to shift origin to node 3, then shift it back to node 2, all seems to be well again. That was with 1.0.5, not 1.1 (because I had the compile handy :-)). I'd not expect material differences in 1.1, as there hasn't been substantial change to this in 1.1. The case of single origin/subscriber isn't worth researching much as that's a case where there isn't really much point to FAILOVER, as losing the origin means you no longer have a replication cluster anymore. Supposing there were anomalies there, I'd find that somewhat uninteresting, as it makes just as much sense to do an UNINSTALL NODE and drop replication from the surviving node altogether.
- Previous message: [Slony1-general] Forcing an existing node to copy data fresh
- Next message: [Slony1-general] Is non-sequential node numbering okay?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list