At time of node start up, check to see if the nodes providing my subscriptions believe my node exists:
- If they do, all is well
- If they don't, then presumably I'm a failed node, and I should stop with a fatal error
- If a connection cannot be established, then warn of this (probably with a pretty quick timeout) but continue, for now...
Added "bug171" branch:
Added code to perform the test:
- Pull list of all parent nodes
- If none, then we're probably OK
- For each one found
- try to connect
- Failure -> warning (not a FATAL error!)
- See if parent agrees that the slon's node is a subscriber
- If so, all is OK
- If not, then the local node is probably dropped out from a
failover, so slon dies with FATAL error
A few initial comments:
You state "If they don't, then presumably I'm a failed node, and I should stop with a fatal error"
I'm not sure we can presume that. The problem might be with the provider node not the failed node.
Consider a cluster of the form
If something happens to node a and you want to do a failover from a-->b
then node c might need to learn about cluster changes from node b via slon.
You don't want node c exiting on startup when it could talk to b.
Similarly the case
The provider 'b' might have the problem not node c. You can't assume that node c (the local node) is gone.
Your patch also does not check the return code from the connect. As I read your patch I think it will mean that if at startup time if slon has a connection failure to one of its providers then it will not start, though in the first comment you mention that the feature should be resilient to this.
Maybe it would be better for the remote listener to check the remote database to see if it has an associated sl_node entry for the local id. If not the the remote listener should do nothing/sleep and retry periodically.
Further discussion showing that it's problematic to consider the cases where this should cause the slon to terminate with fatal error.
The trouble is that there is little certainty in the determination that things are well and truly broken.
1. There isn't a straightforward way to indicate that a node is being "shunned" due to failover.
2. It may be that one or the other of the disagreeing nodes is simply behind in processing events from other places, and the contradiction may shortly clear itself up.
There's no straightforward way to determine either.
Once more of the "automated WAIT FOR" logic is in place, we may have clearer answers; attaching dependency and deferring...