296 – FAILOVER doesn't work when non-origin nodes fail at the same time

Bug 296 - FAILOVER doesn't work when non-origin nodes fail at the same time

Summary: FAILOVER doesn't work when non-origin nodes fail at the same time

Status:	RESOLVED FIXED

Alias:	None

Product:	Slony-I
Classification:	Unclassified
Component:	slonik (show other bugs)
Version:	devel
Hardware:	PC Linux

Importance:	low enhancement
Assignee:	Slony Bugs List

URL:

Depends on:
Blocks:

Reported:	2013-06-21 09:03 UTC by Steve Singer
Modified:	2013-07-10 10:30 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
proposed fixes for 296 (40.63 KB, patch) 2013-06-27 13:49 UTC, Steve Singer	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steve Singer 2013-06-21 09:03:14 UTC

This bug is present in 2.2.0 beta 4

Consider a 6 node cluster with two sets such that the subscribe network for the
two sets is like

 set 1
    1 --------> 2
   /\          /\ 
  3  4        5  6


 set 2 
   3-----------> 5
  /\            /\
 1  4          2  6

If Nodes 1,3,4  all fail at the same time (because they are in the same location)


---
FAILOVER( node=(id=1,backup node=2), node=(id=3,backup node=5));
---

doesn't work because slonik still thinks that node 4 is part of the cluster and will wait for it.    

Requiring a user to DROP node 4 before the failover like
--
drop node(id=4, event node = 2);
--

is an issue because slonik will still think that node 1,3 are valid nodes and if any ev_origin=4 events have made it to some nodes but not others slonik will wait for that event to be replicated everywhere.

Comment 1 Steve Singer 2013-06-21 12:28:53 UTC

If one tries the FAILOVER command like:

----
FAILOVER(node=(id=1,backup node=2), node=(id=3,backup node=5), node=(id=4,backup node=2));   
-----

slonik segfaults

This turns out to reveal a more generic problem where if you try a FAILOVER
when no valid failover targets exist for the node (in sl_failover_targers) slonik is segfaulting.  We can avoid this by a check on num_nodes in 
fail_node_promote but I wondering if that masks the symptoms.

If a origin node really has no failover targets then what should happen?

If the node is an origin node then I think we need to log an error.  Slonik currently has a error message "error no failover candidates for %d"  but that isn't working properly (requires a 2 line fix)

If the node is not an origin then we could, treat the node as failed so we don't consider it as a failover candidate for another node but not actually call the
failover functions for it

This can still leave as with a problem of what to do with the sl_subscribe entries with sl_receiver=this_node when sl_provider=another_failed_node.   Currently slony tries to set sl_provider=backup_node but if the paths are missing then this fails.  We could create 'fake paths' or remove the sl_subscribe rows with sl_receiver being the failed node , with no backup, that is going away. Other ideas are also welcome

Comment 2 Steve Singer 2013-06-24 12:56:59 UTC

My proposal is that we should do the following

1. Require users to list all of the failed nodes on the FAILOVER command, even non-origin nodes.

2. If a node isn't listed in the FAILOVER command as a failed node and if it is a RECEIVER of a set from a failed node it MUST have a direct PATH to the backup node.  We will check this early on and give an error.  Users can then do a STORE PATH or add that node to the list of failed nodes.  

3. If a node is a non-origin and is listed as a failed node we redirect it to the backup node by manipulating sl_subscribe and we post a FAILOVER event on the backup node so all other nodes can update sl_subscribe as needed.  We avoid issues
with foreign key violations between sl_subscribe -> sl_path and sl_listen -> sl_path because the direct path exists.

Comment 3 Steve Singer 2013-06-27 13:49:23 UTC

Created attachment 165 [details]
proposed fixes for 296

I have pushed a set of fixes for these issues at https://github.com/ssinger/slony1-engine/commits/bug296 or in the attached patch.

This set of changes follows 1 and 2 from my previous comment but doesn't exactly follow point 3.

We pass the complete list of failed nodes in the FAILOVER_NODE event, so in the above example we would do
FAILOVER(node=(id=1, backup node=2), node=(id=3,backup node=5), node=(id=4, backup node=2))

slonik will generate two FAILOVER_NODE events.  

FAILOVER_NODE (backup_node=2,failed_node=1,  seq_no=XXXXXXX, failed_node_list=1,3,4])

FAILOVER_NODE (backup_node=5,failed_node=1,  seq_no=XXXXXXX, failed_node_list=1,3,4])

Each node that proceses these events will know that the other two nodes have also failed and to not repoint the sl_subscribe line for node 4 to a backup node.

Comment 4 Steve Singer 2013-07-10 10:30:59 UTC

A version of this patch along with additional failover fixes has been pushed to master for 2.2.0 b5