Bug 171

Summary:	Health Check: Verify node's parentage
Product:	Slony-I	Reporter:	Christopher Browne <cbbrowne>
Component:	slon	Assignee:	Slony Bugs List <slony1-bugs>
Status:	NEW ---
Severity:	enhancement	CC:	slony1-bugs
Priority:	low
Version:	devel
Hardware:	All
OS:	All
Bug Depends on:	179
Bug Blocks:

Description Christopher Browne 2010-12-03 12:54:01 UTC

Per [http://wiki.postgresql.org/wiki/SlonyBrainstorming#Parentage_Check]

At time of node start up, check to see if the nodes providing my subscriptions believe my node exists:
- If they do, all is well
- If they don't, then presumably I'm a failed node, and I should stop with a fatal error
- If a connection cannot be established, then warn of this (probably with a pretty quick timeout) but continue, for now...

Comment 1 Christopher Browne 2010-12-09 10:33:17 UTC

Added "bug171" branch: 
https://github.com/cbbrowne/slony1-engine/tree/bug171

Added code to perform the test:

https://github.com/cbbrowne/slony1-engine/commit/23402816c258920aa0e3e0e23685c71748395689

 - Pull list of all parent nodes
 - If none, then we're probably OK
 - For each one found
   - try to connect
     - Failure -> warning (not a FATAL error!)
     - See if parent agrees that the slon's node is a subscriber
       - If so, all is OK
       - If not, then the local node is probably dropped out from a
         failover, so slon dies with FATAL error

Comment 2 Steve Singer 2010-12-14 13:20:45 UTC

A few initial comments:

You state "If they don't, then presumably I'm a failed node, and I should stop with a fatal error"

I'm not sure we can presume that.  The problem might be with the provider node not the failed node.

Consider a cluster of the form

b<--a--->c

If something happens to node a and you want to do a failover from a-->b
then node c might need to learn about cluster changes from node b via slon.
You don't want node c exiting on startup when it could talk to b.

Similarly the case 
a-->b-->c

The provider 'b' might have the problem not node c.  You can't assume that node c (the local node) is gone.


Your patch also does not check the return code from the connect.  As I read your patch I think it will mean that if at startup time if slon has a connection failure to one of its providers then it will not start, though in the first comment you mention that the feature should be resilient to this.

Maybe it would be better for the remote listener to check the remote database to see if it has an associated sl_node entry for the local id. If not the the remote listener should do nothing/sleep and retry periodically.

Comment 3 Christopher Browne 2010-12-16 08:54:45 UTC

Further discussion showing that it's problematic to consider the cases where this should cause the slon to terminate with fatal error.

The trouble is that there is little certainty in the determination that things are well and truly broken.

1.  There isn't a straightforward way to indicate that a node is being "shunned" due to failover.

2.  It may be that one or the other of the disagreeing nodes is simply behind in processing events from other places, and the contradiction may shortly clear itself up.

There's no straightforward way to determine either.

Once more of the "automated WAIT FOR" logic is in place, we may have clearer answers; attaching dependency and deferring...