Bug 171 - Health Check: Verify node's parentage
: Health Check: Verify node's parentage
Status: NEW
Product: Slony-I
slon
: devel
: All All
: low enhancement
Assigned To: Slony Bugs List
:
: 179
:
  Show dependency treegraph
 
Reported: 2010-12-03 12:54 PST by Christopher Browne
Modified: 2010-12-30 08:46 PST (History)
1 user (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Christopher Browne 2010-12-03 12:54:01 PST
Per [http://wiki.postgresql.org/wiki/SlonyBrainstorming#Parentage_Check]

At time of node start up, check to see if the nodes providing my subscriptions
believe my node exists:
- If they do, all is well
- If they don't, then presumably I'm a failed node, and I should stop with a
fatal error
- If a connection cannot be established, then warn of this (probably with a
pretty quick timeout) but continue, for now...
Comment 1 Christopher Browne 2010-12-09 10:33:17 PST
Added "bug171" branch: 
https://github.com/cbbrowne/slony1-engine/tree/bug171

Added code to perform the test:

https://github.com/cbbrowne/slony1-engine/commit/23402816c258920aa0e3e0e23685c71748395689

 - Pull list of all parent nodes
 - If none, then we're probably OK
 - For each one found
   - try to connect
     - Failure -> warning (not a FATAL error!)
     - See if parent agrees that the slon's node is a subscriber
       - If so, all is OK
       - If not, then the local node is probably dropped out from a
         failover, so slon dies with FATAL error
Comment 2 Steve Singer 2010-12-14 13:20:45 PST
A few initial comments:

You state "If they don't, then presumably I'm a failed node, and I should stop
with a fatal error"

I'm not sure we can presume that.  The problem might be with the provider node
not the failed node.

Consider a cluster of the form

b<--a--->c

If something happens to node a and you want to do a failover from a-->b
then node c might need to learn about cluster changes from node b via slon.
You don't want node c exiting on startup when it could talk to b.

Similarly the case 
a-->b-->c

The provider 'b' might have the problem not node c.  You can't assume that node
c (the local node) is gone.


Your patch also does not check the return code from the connect.  As I read
your patch I think it will mean that if at startup time if slon has a
connection failure to one of its providers then it will not start, though in
the first comment you mention that the feature should be resilient to this.

Maybe it would be better for the remote listener to check the remote database
to see if it has an associated sl_node entry for the local id. If not the the
remote listener should do nothing/sleep and retry periodically.
Comment 3 Christopher Browne 2010-12-16 08:54:45 PST
Further discussion showing that it's problematic to consider the cases where
this should cause the slon to terminate with fatal error.

The trouble is that there is little certainty in the determination that things
are well and truly broken.

1.  There isn't a straightforward way to indicate that a node is being
"shunned" due to failover.

2.  It may be that one or the other of the disagreeing nodes is simply behind
in processing events from other places, and the contradiction may shortly clear
itself up.

There's no straightforward way to determine either.

Once more of the "automated WAIT FOR" logic is in place, we may have clearer
answers; attaching dependency and deferring...