Tue Jun 21 17:56:57 PDT 2005
- Previous message: [Slony1-general] what to consider for failover policy?
- Next message: [Slony1-general] metilinx
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
31337 .. wrote: >Ok, I'm a new slony'er, only been messing with it for a few days. I >will need to implement this very soon, and I need to come up with a >failover/switchover policy. What are you guys doing to say 'master >node is down'? I have considered setting up another database on the >master, and having a seperate server do a 'write' to the database, >then try to read it. If the read/write succeded, then the server is >ok, if it fails, to start the switchover script to change the next >node to be the master. >Are there any other easier ways to detect when the master node has gone down? > > This depends on everything up to and including hardware fault analysis tools. --> What if an Ethernet cable somewhere between the hosts has an intermittant fault? That will lead to the "attempted write" failing. --> What if a power supply on a (router|disk array|computer) fails? That can disconnect one or another component, and lead to the "attempted write" failing. Those are all sorts of hardware failures that would lead to a 'fault' being raised by your test; only you can answer the question of whether your "fault test" can 'safely' impose the policy that detecting faults in that fashion leads to using FAIL OVER to indicate that the 'possibly dead' node should be treated as destroyed. As far as *I* am concerned, failover is the sort of thing that would involve me calling one of our network admins to verify that the master is well and truly broken from the network perspective, and then escalating to the appropriate management level for a Manager to say "Yes, Chris, fail it over." (And yes, I'd make that call at 3am, if need be...) Really and truly, making this policy is NOT a matter for discussion on this list; it is a matter for you to discuss with your "powers that be" in order to properly factor the *business* factors into the policy. It might well be that you discover you need to buy some more hardware to help improve the ability to analyze hardware faults. And it is worth pointing out that people spend literally millions of dollars on tools like HP OpenView, IBM Tivoli, and such, and they have NOT become any sort of magical "silver bullet" to correctly diagnose hardware faults. We've got some guys that spend some of their time (and hence some not-nominal amount of money) generating Nagios tests, and that again doesn't provide any sort of "diagnosis for free." When they discover that something breaks, particularly in a complex network environment, it then takes a sharp technical mind to figure out what broke, and why. The automated tools can do no more than provide some clues. Sorry not to be more directly helpful, but I would not want you to fool yourself into thinking that there is some easy answer right around the corner. Automatic FAIL OVER represents Risky Business...
- Previous message: [Slony1-general] what to consider for failover policy?
- Next message: [Slony1-general] metilinx
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list