[Slony1-general] service needs (was: Migrating From gBorg)

Wed Sep 20 10:18:47 PDT 2006

On 9/19/06, Andrew Sullivan <ajs at crankycanuck.ca> wrote:
> I'm going to take this list, and expand it some, in point form.
>
> On Tue, Sep 19, 2006 at 10:27:03AM -0700, Darcy Buskermolen wrote:
> > Ok well then at this end, lets all start compiling a list of reqirements for
> > the slony project.
> >
> > 1) Reliable, redundent, distributed infrasctructure (manpower, hardware,
> > bandwith)
>
> I.   On service availability
>
> A.  Reliable infrastructure
>
> 1.      Service uptime (!= individual server uptime, maybe)
>
> 2.      Network uptime

s/uptime/availability/ ?

If we differentiate between services using tiers then network uptime
becomes implicit. For example:

i. First tier services include DNS resolution, essential (perhaps
static) web-pages and backups / off-site synchronization / replication

ii. Second tier services include CVS (or whatever source control
system is selected) and maybe mailing lists

iii. Third tier services include non-static web-pages and mailing-list archives

I see you're aiming at something like this below, but maybe
differentiating the tier from makes the document easier to maintain?

> 3.      In case of service failure, individuals must be available to
> solve the problem
>
>         a.  there should always be at least two people in the project
> who know how to repair any given service
>
>         b.  there should always be at least two such people who have the
> access rights to repair any given service
>
>         c.  there should always be at least two such people who have
> the authority to decide to repair any given service
>
>         d.  ideally, the "at least two people" principle above means
> "at any one time".  So when people take vacation, are offline, &c.,
> someone else should be able to step in.

e. there should be a clear, obvious way to determine, in the event of
a failure, who gets contacted first, second, etc. and how they are to
be contacted (to the extent that this is possible without getting the
attention of spammers and their ilk).

> B.  Service redundancy and distribution: how to achieve reliability.
>
> 1.      To the extent technically feasible, every service should be
> delivered from at least two machines.
>
> 2.      To the extent technically feasible, every project-critical
> service should be delivered from at least two geographically and
> topologically distributed locations.

 I think that redundant locations is certainly desireable. I don't
know if it's reasonably acheivable, but we need to save discussion
about how to do it until after we've decided what we want to do.

> 3.      To the extent that (1) and (2) is not technically feasible,
> "warm standby" systems should be prepared for failover conditions.

4. To the extent technically feasible, slony should be used to
implement this redundancy.

For example, if documentation is in the form of a Postgres driven
wiki, it would be excellent to document the install and operation of
this wiki as a practical example of best-practices in action for
newbies. (I'm working on the assumption that db driven CMS are common
enough, usefull enough and simple enough to be interesting for
newbies)

> C.      Policy and high availability
>
> 1.      Services should be classified according to "critical",
> "valuable", &c. (or some such similar scale); and each of these
> levels should have some planned level of response time to failures.
> (An initial suggestion is "24/7" and "12/5" service levels, but
> I'm open to suggestions here.)
>
> 2.      A communication plan for failures is at least as important as
> the ability to fix problems: a well-communicated failure with
> information well-distributed to the community will cause less damage
> than one poorly acknowledged.

Chris' comments about pulling the subscriber list seem reasonable.
However, I think we need to stick with first getting our requirements
down, then we can start figuring out implementation.

> 3.      Predictable outages of longer duration are preferable to
> unpredicted outages of any duration
>
> 4.      Occasional scheduled "fire drills" should be conducted to
> test the viability of infrastructure plans.
>
> > 2) visable intergration with PostgreSQL and other components
>
> Would a "replication for PostgreSQL home" be helpful here?  Like a
> www.postgresql.org/replication home site that included Slony,
> pgcluster, pgpool, &c?

While clearly valuable and important, I'm not sure how this relates
directly to slony project infrastructure. Do we want to offer to share
hosting with other postgresql replication solutions?

> > 3) Increased usability (newbee) documentaion
> > 4) Increased usability (newbee) tools
>
> I think these are important, but for an infrastructure discussion,
> presumably what we want is the ease of delivering these:
>
> III.    Easy documentation maintenance
>
> A.      The ability to deliver user-friendly documentation for new
> users
>
> B.      Integration of documentation with the main web site
>
> C.      Easy maintenance by community, so that no individual is a
> potential "blocker" on documentation updates

D.   Aggressively encourage and support feedback, especially by new
users so that the quality of the documentation can constantly be
improved.

> IV.     Easy delivery of tools
>
> A.      New-user tools
>
> B.      Test infrastructure
>
> I'm sure this isn't everything, but please feel free, all of you, to
> rip into me.

Me too. </aol>

Drew