[Slony1-hackers] Multi node failover draft

Mon Feb 7 12:00:50 PST 2011

Problem

The current steps necessary to perform failover in a Slony replication 
cluster where multiple nodes have failed are complicated. Even a single 
node failover with multiple surviving subscribers can lead to unexpected 
failures and broken configurations. Many DBA's do not have the detail 
knowledge about Slony-I internals that is needed to write slonik scripts 
that do the necessary waiting between critical steps in the failover 
procedure.

Proposal summary

To solve this problem we will develop a slonik failover command that 
receives all the information about which nodes have failed and which 
sets should be taken over by which backup node at once. This command 
will perform sanity checks on the remaining nodes, then execute all the 
necessary steps including waiting for event propagation to get to a new, 
working cluster shape to the point, where the failed nodes are dropped 
from the configuration of the remaining cluster.

Command syntax

The new FAILOVER CLUSTER command will specify the list of failed nodes. 
This command is a string of space separated node-ID numbers. Following 
this are the actions to be taken for every set that originates on one of 
the failed nodes. There are two possible actions:

    * Specify a backup node that will be the new origin. This backup
      node must be a forwarding subscriber of the set and one of the
      surviving nodes.

    * Abandon the set. This can happen when there is not a single
      forwarding subscriber of the set left.

Stage 1 - Requirements and sanity checks

For the FAILOVER CLUSTER command it is necessary that the remaining 
nodes have a sufficient path network.

    * For every surviving set, the surviving subscribers must have a path
      to either the new origin (backup node), or another surviving
      forwarder for that set.

    * Nodes that are currently not subscribed to any set at all must have
      paths that somehow allow a functional listen network to be
      generated.

    * The slonik script that performs the FAILOVER CLUSTER command must
      specify admin conninfo's for all remaining nodes, since slonik will
      need to connect to all of them.

Stage 2 - Disabling slon processes

In order for slonik to perform failover procedures without concurrently 
running slon processes interfering with it via event processing, we need 
a way to tell the local nodes slon process to not startup normally but 
to stop and retry before spawning off all the worker threads. This 
should be a column inside the sl_node table, possibly the existing 
no_active column, which seems unused. Stage 2 sets this flag on all 
surviving nodes and restarts the slon processes via NOTIFY.

Stage 3 - Disabling paths to failed nodes.

Failures are not necessarily caused by DB server problems, but are often 
the result of network problems.

In any failover case, we must not only assume that the failed nodes are 
no longer reachable. It is actually best practice to forcibly make 
failed nodes unreachable by network management means. Slony cannot 
prevent any outside applications from still accessing those "failed" 
nodes. But we can make sure that the surviving nodes, as defined by the 
FAILOVER CLUSTER command, will not receive any more events from those 
nodes that may possibly interfere with the failover procedures. We 
accomplish this by updating all sl_path entries to/from any of the 
failed nodes on all remaining nodes to something, that does not 
represent a valid conninfo string. This way, the remaining nodes will no 
longer be able to connect and thus, no longer receive any events 
(outstanding or newly created).

Stage 4 - Remove abandoned sets from the configuration

All sets that have been specified to be abandoned will be removed from 
the configuration of all surviving nodes. Slonik will analyze after 
doing this which was the highest advanced surviving node that was 
subscribed to the set in order to inform the administrator which node 
has the most advanced data.

Stage 5 - Reshaping the cluster

This is the most complicated step in the failover procedure. Consider 
the following cluster configuration:

      A ----- C ----- D
      |       |
      |       |
      B      /
      |     /
      |    /
      E --/

Node A is origin to set 1 with subscribers B, C, D and E
Node C is origin to set 2 with subscribers A, B, D and E
Node E receives both sets via node B

Failure case is that nodes A and B fail and C is the designated backup 
node for set 1.

Although node E has a path to node C, which could have been created 
after the failure prior to executing FAILOVER CLUSTER, it will not use 
it to listen for events from node C. Its subscription for set 2, 
originating from node C, uses node B as data provider. This causes node 
E to listen for events from node C via B, which now is a disabled path.

Stage 5.1 - Determining the temporary origins

Per set, slonik will find which is the highest advanced forwarding 
subscriber. It will make that node the temporary origin of the set. In 
the above example, this can either be node C or E. If there are multiple 
nodes qualifying for this definition, the designated backup node is 
chosen. There only will be a temporary origin in this test case if node 
E is the most advanced. In that case, there will be as per today a 
FAILOVER_SET event, faked to originate from node A injected into node E, 
followed by a MOVE_SET event to transfer the ownership to node C. If 
node C, the designated backup node, is higher advanced, only a 
FAILOVER_SET originating from A will be injected into node C without a 
following MOVE_SET.

Note that at this point, the slon processes are still disabled so none 
of the events are propagating yet.

Stage 5.2 - Reshaping subscriptions.

Per set slonik will

    * resubscribe every subscriber of that set to the (temporary) origin

    * resubscribe every subscriber of that set, which does not have a
      direct path to the (temporary) origin, to an already resubscribed
      forwarder. This is an iterative process. It is possible that slonik
      finds a non-forwarding subscriber, that is higher advanced than the
      temporary origin or backup node. This nodes subscription must
      forcibly be dropped with a WARNING, because it has processed changes
      the other nodes don't have but for which the sl_log entries have
      been lost in the failure.

Stage 5.3 - Sanitizing sl_event

It is possible that there have been SUBSCRIBE_SET events outstanding 
when the failure occured. Further more, the data provider for that event 
may be one of the failed nodes. Since the slon processes are stopped at 
this time, the new subscriber will not be able to perform the 
subscription at all, so these events must be purged from sl_event on all 
nodes and WARNING messages given.

Stage 6 - Enable slon processes and wait

The slon processes will now start processing all the outstanding events 
that were generated or still queued up until all the faked FAILOVER_SET 
events have been confirmed by all other remaining nodes. This may take 
some time in case there were other subscriptions in progress when the 
failure happened. Those subscriptions had been interrupted when we 
disabled the slon processes in stage 2 and just have been restarted.

Stage 7 - Sanity check and drop failed nodes

After having waited for stage 6 to complete, none of the surviving nodes 
should list any of the failed nodes as the origin of any set and all 
events originating from one of the failed nodes should be confirmed by 
all surviving nodes. If those conditions are met, it is safe to initiate 
DROP NODE events for all failed nodes.

Examining more failure scenarios

Case 2 - 5 node cluster with 2 sets.

      A ----- C ----- E
      |       |
      |       |
      B       D

Node A is origin of set 1, subscribed to nodes B, C, D and E.
Node B is origin of set 2, subscribed to nodes A, C, D and E.

Failure case is nodes A and B fail, node C is designated backup for set 
1, node D is designated backup for set 2.

    * The remaining cluster has a sufficient path network to pass stage 1
    * Stage 2 cannot fail by definition
    * Stage 3 cannot fail by definition
    * Stage 4 is irrelevant for this example
    * Stage 5 will decide that C is the new origin of set 1 via
      FAILOVER_SET and that node C will be the temporary origin for set 2
      via FAILOVER_SET followed by an immediate MOVE_SET to D.
      All subscriptions for E and D will point to C as data provider.
    * Stage 6 will wait until both FAILOVER_SET and the MOVE_SET event
      from stage 5 have been confirmed by C, D and E.
    * Stage 7 should pass its sanity checks and drop nodes A and B.

Case 3 - 5 node cluster with 3 sets.

      A ----- C ----- E
      |       |
      |       |
      B ----- D

Node A is origin of set 1, subscribed to nodes B, C, D and E. Node D 
uses node C as data provider.
Node B is origin of set 2, subscribed to nodes A and C. Node D is 
currently subscribing to set 2 using the origin node B as provider.
Node B is origin of set 3, subscribed to node A only.

Failure case is nodes A and B fail, node C is designated backup for set 
1 and 2.

    * The remaining cluster has a sufficient path network to pass stage 1
    * Stage 2 cannot fail by definition
    * Stage 3 cannot fail by definition
    * Stage 4 will remove set 3 from sl_set on all nodes.
    * Stage 5 will decide that C is the new origin of sets 1 and 2 via
      FAILOVER_SET. Subscriptions for set 1 will point to node C.
      It is well possible that the SUBSCRIBE_SET event for node D
      subscribing set 2 from B is interrupted by the failover, but since
      it causes very little work, had already been processed and confirmed
      by nodes C and E. Stage 5.3 will delete these events from sl_event
      on nodes C and E. The event would fail when D receives it from E
      because it cannot connect to the data provider B.
    * Stage 6 will wait until both FAILOVER_SET events from stage 5 have
      been confirmed by C, D and E.
    * Stage 7 should pass its sanity checks and drop nodes A and B.

Comments?
Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin