Mon Feb 7 12:00:50 PST 2011
- Previous message: [Slony1-hackers] automatic WAIT FOR proposal
- Next message: [Slony1-hackers] Multi node failover draft
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Problem The current steps necessary to perform failover in a Slony replication cluster where multiple nodes have failed are complicated. Even a single node failover with multiple surviving subscribers can lead to unexpected failures and broken configurations. Many DBA's do not have the detail knowledge about Slony-I internals that is needed to write slonik scripts that do the necessary waiting between critical steps in the failover procedure. Proposal summary To solve this problem we will develop a slonik failover command that receives all the information about which nodes have failed and which sets should be taken over by which backup node at once. This command will perform sanity checks on the remaining nodes, then execute all the necessary steps including waiting for event propagation to get to a new, working cluster shape to the point, where the failed nodes are dropped from the configuration of the remaining cluster. Command syntax The new FAILOVER CLUSTER command will specify the list of failed nodes. This command is a string of space separated node-ID numbers. Following this are the actions to be taken for every set that originates on one of the failed nodes. There are two possible actions: * Specify a backup node that will be the new origin. This backup node must be a forwarding subscriber of the set and one of the surviving nodes. * Abandon the set. This can happen when there is not a single forwarding subscriber of the set left. Stage 1 - Requirements and sanity checks For the FAILOVER CLUSTER command it is necessary that the remaining nodes have a sufficient path network. * For every surviving set, the surviving subscribers must have a path to either the new origin (backup node), or another surviving forwarder for that set. * Nodes that are currently not subscribed to any set at all must have paths that somehow allow a functional listen network to be generated. * The slonik script that performs the FAILOVER CLUSTER command must specify admin conninfo's for all remaining nodes, since slonik will need to connect to all of them. Stage 2 - Disabling slon processes In order for slonik to perform failover procedures without concurrently running slon processes interfering with it via event processing, we need a way to tell the local nodes slon process to not startup normally but to stop and retry before spawning off all the worker threads. This should be a column inside the sl_node table, possibly the existing no_active column, which seems unused. Stage 2 sets this flag on all surviving nodes and restarts the slon processes via NOTIFY. Stage 3 - Disabling paths to failed nodes. Failures are not necessarily caused by DB server problems, but are often the result of network problems. In any failover case, we must not only assume that the failed nodes are no longer reachable. It is actually best practice to forcibly make failed nodes unreachable by network management means. Slony cannot prevent any outside applications from still accessing those "failed" nodes. But we can make sure that the surviving nodes, as defined by the FAILOVER CLUSTER command, will not receive any more events from those nodes that may possibly interfere with the failover procedures. We accomplish this by updating all sl_path entries to/from any of the failed nodes on all remaining nodes to something, that does not represent a valid conninfo string. This way, the remaining nodes will no longer be able to connect and thus, no longer receive any events (outstanding or newly created). Stage 4 - Remove abandoned sets from the configuration All sets that have been specified to be abandoned will be removed from the configuration of all surviving nodes. Slonik will analyze after doing this which was the highest advanced surviving node that was subscribed to the set in order to inform the administrator which node has the most advanced data. Stage 5 - Reshaping the cluster This is the most complicated step in the failover procedure. Consider the following cluster configuration: A ----- C ----- D | | | | B / | / | / E --/ Node A is origin to set 1 with subscribers B, C, D and E Node C is origin to set 2 with subscribers A, B, D and E Node E receives both sets via node B Failure case is that nodes A and B fail and C is the designated backup node for set 1. Although node E has a path to node C, which could have been created after the failure prior to executing FAILOVER CLUSTER, it will not use it to listen for events from node C. Its subscription for set 2, originating from node C, uses node B as data provider. This causes node E to listen for events from node C via B, which now is a disabled path. Stage 5.1 - Determining the temporary origins Per set, slonik will find which is the highest advanced forwarding subscriber. It will make that node the temporary origin of the set. In the above example, this can either be node C or E. If there are multiple nodes qualifying for this definition, the designated backup node is chosen. There only will be a temporary origin in this test case if node E is the most advanced. In that case, there will be as per today a FAILOVER_SET event, faked to originate from node A injected into node E, followed by a MOVE_SET event to transfer the ownership to node C. If node C, the designated backup node, is higher advanced, only a FAILOVER_SET originating from A will be injected into node C without a following MOVE_SET. Note that at this point, the slon processes are still disabled so none of the events are propagating yet. Stage 5.2 - Reshaping subscriptions. Per set slonik will * resubscribe every subscriber of that set to the (temporary) origin * resubscribe every subscriber of that set, which does not have a direct path to the (temporary) origin, to an already resubscribed forwarder. This is an iterative process. It is possible that slonik finds a non-forwarding subscriber, that is higher advanced than the temporary origin or backup node. This nodes subscription must forcibly be dropped with a WARNING, because it has processed changes the other nodes don't have but for which the sl_log entries have been lost in the failure. Stage 5.3 - Sanitizing sl_event It is possible that there have been SUBSCRIBE_SET events outstanding when the failure occured. Further more, the data provider for that event may be one of the failed nodes. Since the slon processes are stopped at this time, the new subscriber will not be able to perform the subscription at all, so these events must be purged from sl_event on all nodes and WARNING messages given. Stage 6 - Enable slon processes and wait The slon processes will now start processing all the outstanding events that were generated or still queued up until all the faked FAILOVER_SET events have been confirmed by all other remaining nodes. This may take some time in case there were other subscriptions in progress when the failure happened. Those subscriptions had been interrupted when we disabled the slon processes in stage 2 and just have been restarted. Stage 7 - Sanity check and drop failed nodes After having waited for stage 6 to complete, none of the surviving nodes should list any of the failed nodes as the origin of any set and all events originating from one of the failed nodes should be confirmed by all surviving nodes. If those conditions are met, it is safe to initiate DROP NODE events for all failed nodes. Examining more failure scenarios Case 2 - 5 node cluster with 2 sets. A ----- C ----- E | | | | B D Node A is origin of set 1, subscribed to nodes B, C, D and E. Node B is origin of set 2, subscribed to nodes A, C, D and E. Failure case is nodes A and B fail, node C is designated backup for set 1, node D is designated backup for set 2. * The remaining cluster has a sufficient path network to pass stage 1 * Stage 2 cannot fail by definition * Stage 3 cannot fail by definition * Stage 4 is irrelevant for this example * Stage 5 will decide that C is the new origin of set 1 via FAILOVER_SET and that node C will be the temporary origin for set 2 via FAILOVER_SET followed by an immediate MOVE_SET to D. All subscriptions for E and D will point to C as data provider. * Stage 6 will wait until both FAILOVER_SET and the MOVE_SET event from stage 5 have been confirmed by C, D and E. * Stage 7 should pass its sanity checks and drop nodes A and B. Case 3 - 5 node cluster with 3 sets. A ----- C ----- E | | | | B ----- D Node A is origin of set 1, subscribed to nodes B, C, D and E. Node D uses node C as data provider. Node B is origin of set 2, subscribed to nodes A and C. Node D is currently subscribing to set 2 using the origin node B as provider. Node B is origin of set 3, subscribed to node A only. Failure case is nodes A and B fail, node C is designated backup for set 1 and 2. * The remaining cluster has a sufficient path network to pass stage 1 * Stage 2 cannot fail by definition * Stage 3 cannot fail by definition * Stage 4 will remove set 3 from sl_set on all nodes. * Stage 5 will decide that C is the new origin of sets 1 and 2 via FAILOVER_SET. Subscriptions for set 1 will point to node C. It is well possible that the SUBSCRIBE_SET event for node D subscribing set 2 from B is interrupted by the failover, but since it causes very little work, had already been processed and confirmed by nodes C and E. Stage 5.3 will delete these events from sl_event on nodes C and E. The event would fail when D receives it from E because it cannot connect to the data provider B. * Stage 6 will wait until both FAILOVER_SET events from stage 5 have been confirmed by C, D and E. * Stage 7 should pass its sanity checks and drop nodes A and B. Comments? Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
- Previous message: [Slony1-hackers] automatic WAIT FOR proposal
- Next message: [Slony1-hackers] Multi node failover draft
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-hackers mailing list