[Slony1-general] Failover failures

Wed Sep 28 20:41:52 PDT 2005

Darcy Buskermolen wrote:

>On Monday 22 August 2005 17:12, elein wrote:
>  
>
>>Slony 1.1.  Three nodes. 10 set(1) => 20 => 30.
>>
>>I ran failover from node10 to node20.
>>
>>On node30, the origin of the set was changed
>>from 10 to 20, however, drop node10 failed
>>because of the row in sl_setsync.
>>
>>This causes slon on node30 to quit and the cluster to
>>become unstable.  Which in turn prevents putting
>>node10 back into the mix.
>>
>>Please tell me I'm not the first one to run into
>>this...
>>
>>The only clean work around I can see is to drop
>>node 30. Re-add it. And then re-add node10.  This
>>leaves us w/o a back up for the downtime.
>>
>>
>>This is what is in some of the tables for node20:
>>
>>gb2=# select * from sl_node;
>> no_id | no_active |       no_comment        | no_spool
>>-------+-----------+-------------------------+----------
>>    20 | t         | Node 20 - gb2 at localhost | f
>>    30 | t         | Node 30 - gb3 at localhost | f
>>(2 rows)
>>
>>gb2=# select * from sl_set;
>> set_id | set_origin | set_locked |     set_comment
>>--------+------------+------------+----------------------
>>      1 |         20 |            | Set 1 for gb_cluster
>>gb2=# select * from sl_setsync;
>> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
>>ssy_action_list
>>-----------+------------+-----------+------------+------------+---------+--
>>--------------- (0 rows)
>>
>>This is what I have for node30:
>>
>>gb3=# select * from sl_node;
>> no_id | no_active |       no_comment        | no_spool
>>-------+-----------+-------------------------+----------
>>    10 | t         | Node 10 - gb at localhost  | f
>>    20 | t         | Node 20 - gb2 at localhost | f
>>    30 | t         | Node 30 - gb3 at localhost | f
>>(3 rows)
>>
>>gb3=# select * from sl_set;
>> set_id | set_origin | set_locked |     set_comment
>>--------+------------+------------+----------------------
>>      1 |         20 |            | Set 1 for gb_cluster
>>(1 row)
>>
>>gb3=# select * from sl_setsync;
>> ssy_setid | ssy_origin | ssy_seqno | ssy_minxid | ssy_maxxid | ssy_xip |
>>ssy_action_list
>>-----------+------------+-----------+------------+------------+---------+--
>>--------------- 1 |         10 |       235 | 1290260    | 1290261    |      
>>  | (1 row)
>>
>>frustrated,
>>--elein
>>    
>>
>Elein,
>I can share your frustration, I have just for the first time started to 
>investigate failover and I have yet to be able to have a clean failover 
>happen, no matter how I do a failover I end up with nodes that are no longer 
>in sync with other the nodes.  My time is fairly short this week, but I hope 
>to be able to spend some time on it. I've pushed all my other slony work to 
>the back burner to come to a solid resolution to this.
>
>Jan/Chris are either of you able to reproduce stable failovers in a multi node 
>(more than a single origin/subscriber pair) ?
>  
>
I finally put together a suitable environment to do some testing of this...

I'm running into a case where, upon failover from node 1 to node 2, I
get the following error message from slonik:

-sh-2.05b$ slonik failover.slonik
failover.slonik:6: NOTICE:  Slony-I: terminating DB connection of faile
node with pid 13903
CONTEXT:  PL/pgSQL function "failednode" line 75 at perform
FATAL:  terminating connection due to administrator command
failover.slonik:6: NOTICE:  failedNode: set 1 has no other direct
receivers - move now
failover.slonik:6: NOTICE:  Slony-I: terminating DB connection of faile
node with pid 13905
CONTEXT:  PL/pgSQL function "failednode" line 75 at perform
FATAL:  terminating connection due to administrator command
failover.slonik:6: NOTICE:  failedNode: set 1 has no other direct
receivers - move now
-sh-2.05b$

After which point things are not QUITE ok.

I find that node #2 still has the "denyaccess" triggers in place.

Interestingly, if I do a LOCK SET/MOVE SET to shift origin to node 3,
then shift it back to node 2, all seems to be well again.

That was with 1.0.5, not 1.1 (because I had the compile handy :-)).  I'd
not expect material differences in 1.1, as there hasn't been substantial
change to this in 1.1.

The case of single origin/subscriber isn't worth researching much as
that's a case where there isn't really much point to FAILOVER, as losing
the origin means you no longer have a replication cluster anymore. 
Supposing there were anomalies there, I'd find that somewhat
uninteresting, as it makes just as much sense to do an UNINSTALL NODE
and drop replication from the surviving node altogether.