Observed with 2.0.3
In a cluster as follows
A failover of set 1=>3 seems to work.
However DROP NODE (id=1) fails with
<stdin>:12: PGRES_FATAL_ERROR select "_disorder_replica".dropNode(1); - ERROR: Slony-I: Node 1 is still origin of one or more sets
When I query sl_set it shows that node 1 is still the origin of the set even though the failover command seemed to work okay.
(In reply to comment #0)
> Observed with 2.0.3
> In a cluster as follows
> A failover of set 1=>3 seems to work.
> However DROP NODE (id=1) fails with
> <stdin>:12: PGRES_FATAL_ERROR select "_disorder_replica".dropNode(1); - ERROR:
> Slony-I: Node 1 is still origin of one or more sets
> When I query sl_set it shows that node 1 is still the origin of the set even
> though the failover command seemed to work okay.
Where did you run these commands?
If you did so against node 1, then I'd fully expect this behaviour. Node #1 doesn't really know that it's "shunned." It certainly doesn't if the disks were ground into a powder (in which case your queries would fail because there's no database there anymore!).
But if node 1 failed due to a network partition, it can't expect to ever be aware that it's shunned.
If the requests were hitting ex-node #1, then I don't think these results are necessarily wrong.
According to the documentation DROP NODE shouldn't work with an EVENT_NODE=1 (the node being dropped) so I don't think that was the case.
I think what was happening is that FAILOVER returns right away but the failover processing doesn't complete until later (the other subscribers have to process an ACCEPT_SET). No WAIT FOR commands where being issued after the FAILOVER.
See the comments on bug #129. relevant portion duplicated below
2) As part of a failover we want to mark the failed node as being inactive in
sl_node and make it so that WAIT FOR confirmed=all don't wait on this failed
nodes to confirm things.
3) slonik needs to remember the sequence number returned by failedNode2 so that
it is possible to WAIT FOR that event on the backup node to ensure it is
confirmed by all. Exactly how a slonik script can wait still needs to be
figured out. This won't be done until 2.1
I think this is covered by the proposed patch in bug136?
Will be covered by fixing Bug 80
*** This bug has been marked as a duplicate of bug 80 ***