Created attachment 31 [details]
log and slonik scripts
when dropping the "failed" node immediately after the failover() command, it might happen that a node gets the dropNode_int() command while the failed node is still referenced in the sl_set table. the dropNode_int() then fails because the foreign key contraint "set_origin-no_id-ref" fails.
The slon deamon then restarts itself and tries again to drop the node (which fails) and restarts itself in a endless loop, rendering the node unusable;
If you wait a bit after the failover() before dropping the node, everything works fine.
attached are the slonik scripts to setup the test and to (hopefully) reproduce the problem, and some of the slon daemon log of the node that went wild.
The setup was as follow: 4 nodes 1-4, 1 as master, 2-4 as slaves directly subscribing on the master. then the master is assumed to be broken and 4 should become the new master, so nodes 2 and 3 are re-subscribed to node 4, then the failover is executed, waited and then the "failed" old master dropped.
i have to apologize i forgot to switch languages on me dev machine so some of the error messages in the slon log are in german.
I'll think about this over the weekend; my first reaction is to treat this as a documentation patch, and to recommend not rushing to drop the node out of the cluster until you actually get the failover completed.
As a first response, that's definitely what I'd recommend.
When you drop it "too quickly," that introduces the risk, which you ran into, that some later node gets the DROP NODE event before receiving the FAILOVER event.
There's no easy way to evade that problem!
However, second reaction is that it's not particularly reasonable for this mistake to be allowed to break the cluster.
As a first thought on a solution, we might check to see if there's a pending FAILOVER_SET event pending, and somehow defer/ignore the DROP NODE.
Gonna have to sleep on that...
This issue is also referenced in bug 129.
There has been some discussion about making FAILOVER mark nodes as disabled so they don't get included in the set of nodes wait for ... confimred=all uses. (that on its own won't fix the issue)
The issue is actually the async processing of events coming from different nodes.
The FAILOVER_NODE is faked by slonik to be coming from the failed node. This guarantees that every subscriber will drain out all outstanding SYNC events from the failed node before starting to consume changes from the next origin (either backup node or temporary origin).
The next origin will issue ACCEPT_SET. The purpose of the ACCEPT_SET event, which is also seen in MOVE_SET, is that a subscriber suspends processing events from the accepting node, until it has seen the corresponding FAILOVER_SET or MOVE_SET, so that it doesn't throw away data from the accepting node. The accepting node can modify the tables and create sl_log data long before everybody else is caught up.
What we want to do is to reproduce the ACCEPT_SET logic in slon for DROP_NODE and suspend processing events from the DROP_NODE origin until there are no more sets from that origin in slon's runtime config.
*** Bug 130 has been marked as a duplicate of this bug. ***
This should also be in 2.0.
We have to push this one back to devel.
There are several issues with a premature DROP NODE. One is that the function dropNode_int() cleans up after the dropped node. Namely that it deletes every reference to that node from sl_path, sl_listen, sl_confirm, sl_event. This can eventually destroy the FAILOVER_NODE or MOVE_SET event before it was forwarded to everybody else.
However, we cannot easily detect what needs to be waited for because it is possible to have a multi-node failure, so some other node will never confirm those events.
At this point I don't have a plan how to finally fix this problem. It might require a new event type.