Summary: | Health Check: Verify node's parentage | ||
---|---|---|---|
Product: | Slony-I | Reporter: | Christopher Browne <cbbrowne> |
Component: | slon | Assignee: | Slony Bugs List <slony1-bugs> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | slony1-bugs |
Priority: | low | ||
Version: | devel | ||
Hardware: | All | ||
OS: | All | ||
Bug Depends on: | 179 | ||
Bug Blocks: |
Description
Christopher Browne
2010-12-03 12:54:01 UTC
Added "bug171" branch: https://github.com/cbbrowne/slony1-engine/tree/bug171 Added code to perform the test: https://github.com/cbbrowne/slony1-engine/commit/23402816c258920aa0e3e0e23685c71748395689 - Pull list of all parent nodes - If none, then we're probably OK - For each one found - try to connect - Failure -> warning (not a FATAL error!) - See if parent agrees that the slon's node is a subscriber - If so, all is OK - If not, then the local node is probably dropped out from a failover, so slon dies with FATAL error A few initial comments: You state "If they don't, then presumably I'm a failed node, and I should stop with a fatal error" I'm not sure we can presume that. The problem might be with the provider node not the failed node. Consider a cluster of the form b<--a--->c If something happens to node a and you want to do a failover from a-->b then node c might need to learn about cluster changes from node b via slon. You don't want node c exiting on startup when it could talk to b. Similarly the case a-->b-->c The provider 'b' might have the problem not node c. You can't assume that node c (the local node) is gone. Your patch also does not check the return code from the connect. As I read your patch I think it will mean that if at startup time if slon has a connection failure to one of its providers then it will not start, though in the first comment you mention that the feature should be resilient to this. Maybe it would be better for the remote listener to check the remote database to see if it has an associated sl_node entry for the local id. If not the the remote listener should do nothing/sleep and retry periodically. Further discussion showing that it's problematic to consider the cases where this should cause the slon to terminate with fatal error. The trouble is that there is little certainty in the determination that things are well and truly broken. 1. There isn't a straightforward way to indicate that a node is being "shunned" due to failover. 2. It may be that one or the other of the disagreeing nodes is simply behind in processing events from other places, and the contradiction may shortly clear itself up. There's no straightforward way to determine either. Once more of the "automated WAIT FOR" logic is in place, we may have clearer answers; attaching dependency and deferring... |