[Slony1-general] Switching master-slave roles after a failover

Wed Feb 2 18:35:52 PST 2005

k-ohara at excite.co.jp wrote:

>How can I manage to implement Step 4 in the following scenario:
>
> step #1. A was master; B was slave.
> step #2. B detects a failure; promotes itself to a master.
> step #3. The cause of the failure is resolved and removed by admin.
> step #4. A becomes a new slave manually or automatically.
>
>I know I should rebuild server A `from scratch'
>if the cause is disk error or something.
>But in some cases (e.g. NIC error), A's disk is safe and sound.
>
>In such cases, I thought I could switch master-slave roles
>even after the failover command, but manuals and mailing list
>archives seemingly suggest not to do that.
>
>My idea was to kill all slon daemons, drop all slony schemata from
>both servers, pg_dump/undump from B to A if needed,
>then to re-install the schemata with reversed roles.
>  
>
The problem with NOT rebuilding A from scratch is that you may get
things into an inconsistent state.

Consider in a little more detail:

Step 1. A was "origin", B was a subscriber

Step 2. Network failure takes place so that B decides to take over via
FAILOVER.

Conditions at time of step 2: At the time of that takeover, the database
on A has 25 committed replicable transactions that had never made it to B.

FAILOVER treats those transactions as lost. But in fact, they are
sitting on A, committed.

You may resolve the cause of the failure, but this does not resolve
those 25 transactions that are in a sort of "limbo," sitting committed
on A, but not replicated anywhere else. Indeed, users may have
re-attempted the transactions on B so that there are logical equivalents
waiting to be replicated to subscribers. The systems are out of sync in
a way that Slony-I is not equipped to rectify.

At that point, you have a conflict that the replication system cannot
correct for.

The only thing to be safely done is to reconstruct A from scratch.

That is why what FAILOVER does is to _abandon_ the failed node.

If those 25 transactions represent business promises (e.g. - they
involved transactions to promise shipping products to customers, or
such), then you need to resolve this via taking a look at what was
outstanding on node A at the end. Some of the 25 transactions may be
irrelevant; others might be Really Important; evaluating that isn't
something Slony-I can do.

I have added some further discussion of this in the various appropriate
places in the Admin Guide; checked into CVS, and probably soon to be
published on some web site near you...
-- 
<http://cbbrowne.com/info/failover.html>