Bugzilla – Bug 171
Health Check: Verify node's parentage
Last modified: 2010-12-30 08:46:57 PST
You need to
before you can comment on or make changes to this bug.
At time of node start up, check to see if the nodes providing my subscriptions
believe my node exists:
- If they do, all is well
- If they don't, then presumably I'm a failed node, and I should stop with a
- If a connection cannot be established, then warn of this (probably with a
pretty quick timeout) but continue, for now...
Added "bug171" branch:
Added code to perform the test:
- Pull list of all parent nodes
- If none, then we're probably OK
- For each one found
- try to connect
- Failure -> warning (not a FATAL error!)
- See if parent agrees that the slon's node is a subscriber
- If so, all is OK
- If not, then the local node is probably dropped out from a
failover, so slon dies with FATAL error
A few initial comments:
You state "If they don't, then presumably I'm a failed node, and I should stop
with a fatal error"
I'm not sure we can presume that. The problem might be with the provider node
not the failed node.
Consider a cluster of the form
If something happens to node a and you want to do a failover from a-->b
then node c might need to learn about cluster changes from node b via slon.
You don't want node c exiting on startup when it could talk to b.
Similarly the case
The provider 'b' might have the problem not node c. You can't assume that node
c (the local node) is gone.
Your patch also does not check the return code from the connect. As I read
your patch I think it will mean that if at startup time if slon has a
connection failure to one of its providers then it will not start, though in
the first comment you mention that the feature should be resilient to this.
Maybe it would be better for the remote listener to check the remote database
to see if it has an associated sl_node entry for the local id. If not the the
remote listener should do nothing/sleep and retry periodically.
Further discussion showing that it's problematic to consider the cases where
this should cause the slon to terminate with fatal error.
The trouble is that there is little certainty in the determination that things
are well and truly broken.
1. There isn't a straightforward way to indicate that a node is being
"shunned" due to failover.
2. It may be that one or the other of the disagreeing nodes is simply behind
in processing events from other places, and the contradiction may shortly clear
There's no straightforward way to determine either.
Once more of the "automated WAIT FOR" logic is in place, we may have clearer
answers; attaching dependency and deferring...