Andrew Sullivan ajs
Tue Sep 19 08:55:26 PDT 2006
On Tue, Sep 19, 2006 at 12:21:33PM -0300, User Marc wrote:
> As an appendum to this ... in over 10 years, what happened this summer is 
> a first 

I don't want to get into a bun fight over what has happened over
time; but while the extended outage this summer does seem to me to
have been a first, it is by no means the first serious failure over
the past several years.  Speaking for myself, I would have had a very
different reaction to all of this if we'd never before seen critical
services go offline: bad things sometimes happen.  But it doesn't
take much digging in the -hackers archive , for example, to discover
complaints about anonymous CVS not working.  Yes, it's usually fixed
quickly.  That's not the issue.  The issue is that it broke, and the
way it got fixed is someone noticed.

> Right now, all postgresql.org related vServers are backed up to the local 
> network, onto a second 64bit HP Proliant server, in case the one their are 
> on blows up ... as well, they are all backed up to a backup server on my 
> network here ... as well, over the next couple of days now that things 
> have finally started to quiet down, they will also be backed up to a 
> second *off site*, 64bit server, where they could come online very quickly 
> in case all my servers happen to blow sky high ...

None of that addresses the issue of who makes the decision, when,
how, and with what degree of confidence to fail over, to retire a
dead vServer, to restore from backup, &c.  I have to agree with Jan
that the answer to "we have a hit-by-bus problem" is not to have an
emergency panic-only backup person.  That's not a way to run
infrastructure of a project that purports to provide
industrial-quality tools.

A
-- 
Andrew Sullivan  | ajs at crankycanuck.ca
A certain description of men are for getting out of debt, yet are
against all taxes for raising money to pay it off.
		--Alexander Hamilton



More information about the Slony1-general mailing list