[Slony1-general] what to consider for failover policy?

Tue Jun 21 17:56:57 PDT 2005

31337 .. wrote:

>Ok, I'm a new slony'er, only been messing with it for a few days. I
>will need to implement this very soon, and I need to come up with a
>failover/switchover policy. What are you guys doing to say 'master
>node is down'? I have considered setting up another database on the
>master, and having a seperate server do a 'write' to the database,
>then try to read it. If the read/write succeded, then the server is
>ok, if it fails, to start the switchover script to change the next
>node to be the master.
>Are there any other easier ways to detect when the master node has gone down?
>  
>
This depends on everything up to and including hardware fault analysis
tools.

--> What if an Ethernet cable somewhere between the hosts has an
intermittant fault? 

That will lead to the "attempted write" failing.

--> What if a power supply on a (router|disk array|computer) fails? 

That can disconnect one or another component, and lead to the "attempted
write" failing.

Those are all sorts of hardware failures that would lead to a 'fault'
being raised by your test; only you can answer the question of whether
your "fault test" can 'safely' impose the policy that detecting faults
in that fashion leads to using FAIL OVER to indicate that the 'possibly
dead' node should be treated as destroyed.

As far as *I* am concerned, failover is the sort of thing that would
involve me calling one of our network admins to verify that the master
is well and truly broken from the network perspective, and then
escalating to the appropriate management level for a Manager to say
"Yes, Chris, fail it over."  (And yes, I'd make that call at 3am, if
need be...)

Really and truly, making this policy is NOT a matter for discussion on
this list; it is a matter for you to discuss with your "powers that be"
in order to properly factor the *business* factors into the policy.

It might well be that you discover you need to buy some more hardware to
help improve the ability to analyze hardware faults.  And it is worth
pointing out that people spend literally millions of dollars on tools
like HP OpenView, IBM Tivoli, and such, and they have NOT become any
sort of magical "silver bullet" to correctly diagnose hardware faults.

We've got some guys that spend some of their time (and hence some
not-nominal amount of money) generating Nagios tests, and that again
doesn't provide any sort of "diagnosis for free."  When they discover
that something breaks, particularly in a complex network environment, it
then takes a sharp technical mind to figure out what broke, and why. 
The automated tools can do no more than provide some clues.

Sorry not to be more directly helpful, but I would not want you to fool
yourself into thinking that there is some easy answer right around the
corner.

Automatic FAIL OVER represents Risky Business...