Bug 130 - failover does not seem to update sl_set
Summary: failover does not seem to update sl_set
Status: RESOLVED DUPLICATE of bug 80
Alias: None
Product: Slony-I
Classification: Unclassified
Component: stored procedures (show other bugs)
Version: 2.0
Hardware: PC Linux
: medium enhancement
Assignee: Slony Bugs List
URL:
Depends on:
Blocks:
 
Reported: 2010-05-26 14:26 UTC by Steve Singer
Modified: 2010-08-25 07:24 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Singer 2010-05-26 14:26:43 UTC
Observed with 2.0.3

In a cluster as follows

1
\\
 \\
  3===4
  \\
   \\
     5


 A failover of set 1=>3 seems to work.

However DROP NODE (id=1) fails with 

<stdin>:12: PGRES_FATAL_ERROR select "_disorder_replica".dropNode(1);  - ERROR:  Slony-I: Node 1 is still origin of one or more sets

When I query sl_set it shows that node 1 is still the origin of the set even though the failover command seemed to work okay.
Comment 1 Christopher Browne 2010-06-15 15:25:07 UTC
(In reply to comment #0)
> Observed with 2.0.3
> 
> In a cluster as follows
> 
> 1
> \\
>  \\
>   3===4
>   \\
>    \\
>      5
> 
> 
>  A failover of set 1=>3 seems to work.
> 
> However DROP NODE (id=1) fails with 
> 
> <stdin>:12: PGRES_FATAL_ERROR select "_disorder_replica".dropNode(1);  - ERROR:
>  Slony-I: Node 1 is still origin of one or more sets
> 
> When I query sl_set it shows that node 1 is still the origin of the set even
> though the failover command seemed to work okay.

Where did you run these commands?

If you did so against node 1, then I'd fully expect this behaviour.  Node #1 doesn't really know that it's "shunned."  It certainly doesn't if the disks were ground into a powder (in which case your queries would fail because there's no database there anymore!).

But if node 1 failed due to a network partition, it can't expect to ever be aware that it's shunned.

If the requests were hitting ex-node #1, then I don't think these results are necessarily wrong.
Comment 2 Steve Singer 2010-06-23 06:47:11 UTC
According to the documentation DROP NODE shouldn't work with an EVENT_NODE=1 (the node being dropped) so I don't think that was the case.

I think what was happening is that FAILOVER returns right away but the failover processing doesn't complete until later (the other subscribers have to process an ACCEPT_SET).  No WAIT FOR commands where being issued after the FAILOVER.

See the comments on bug #129. relevant portion duplicated below

----------------
2) As part of a failover we want to mark the failed node as being inactive in
sl_node and make it so that WAIT FOR  confirmed=all don't wait on this failed
nodes to confirm things.

3) slonik needs to remember the sequence number returned by failedNode2 so that
it is possible to WAIT FOR that event on the backup node to ensure it is
confirmed by all.    Exactly how a slonik script can wait still needs to be
figured out.   This won't be done until 2.1
------------------------
Comment 3 Steve Singer 2010-08-11 14:18:31 UTC
I think this is covered by the proposed patch in bug136?
Comment 4 Jan Wieck 2010-08-25 07:24:40 UTC
Will be covered by fixing Bug 80

*** This bug has been marked as a duplicate of bug 80 ***