Steve Singer ssinger at ca.afilias.info
Fri Oct 3 10:40:45 PDT 2014
On 10/03/2014 08:27 AM, Glyn Astill wrote:
> Hi All,
>
> I'm looking at a slony setup using 2.1.4, with 4 nodes in the following
> configuration:
>
>      Node 1 --> Node 2
>      Node 1 --> Node 3 --> Node 4
>
> Node 1 is the origin of all sets, and node 3 is a provider of all to
> node 4.  What I'm looking to do is fail over to node 2 when both nodes 1
> and 3 have gone down.
>
> Is this possible?


Improvements with dealing with multiple nodes failing at once was one of 
the big changes with 2.2

You might want to try something like

NODE 1 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5432
NODE 2 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5433
NODE 3 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5434

   FAILOVER (
           ID = 1, BACKUP NODE = 2);
  SUBSCRIBE SET (ID = 1, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);

DROP NODE (ID = 3, EVENT NODE = 2);
DROP NODE (ID = 1, EVENT NODE = 2);

But I haven't tried to setup a cluster in this configuration so I can't 
say for sure if it will work or not.  As a general comment I think 
trying to reshape the cluster before the FAILOVER command will be 
problematic.

When I started doing a lot of failover tests with 2.1 I discovered a lot 
of cases that wouldn't work, or wouldn't work reliably.  That lead to 
major changes in the 2.2 for failover.




>
> In both a live environment that I've not had chance to move to 2.2 and
> my test environment I'm seeing the same issues, for my test environment
> the slonik script is:
>
>      CLUSTER NAME = test_replication;
>
>      NODE 1 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5432
> user=slony';
>      NODE 2 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5433
> user=slony';
>      NODE 3 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5434
> user=slony';
>      NODE 4 ADMIN CONNINFO = 'dbname=TEST host=localhost port=5435
> user=slony';
>
>      SUBSCRIBE SET (ID = 1, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
>      WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);
>      SUBSCRIBE SET (ID = 2, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
>      WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);
>      SUBSCRIBE SET (ID = 3, PROVIDER = 2, RECEIVER = 4, FORWARD = YES);
>      WAIT FOR EVENT (ORIGIN = 2, CONFIRMED = 4, WAIT ON = 2);
>
>      DROP NODE (ID = 3, EVENT NODE = 2);
>
>      FAILOVER (
>          ID = 1, BACKUP NODE = 2
>      );
>
>      DROP NODE (ID = 1, EVENT NODE = 2);
>
> slonik is failing at the first subscribe set line as follows:
>
>      $ slonik test.scr
>      test.scr:8: could not connect to server: Connection refused
>          Is the server running on host "localhost" (127.0.0.1) and accepting
>          TCP/IP connections on port 5432?
>      test.scr:8: could not connect to server: Connection refused
>          Is the server running on host "localhost" (127.0.0.1) and accepting
>          TCP/IP connections on port 5434?
>      test.scr:8: could not connect to server: Connection refused
>          Is the server running on host "localhost" (127.0.0.1) and accepting
>          TCP/IP connections on port 5432?
>      Segmentation fault
>
> I get the same behaviour until I bring node 1 back up, then the script
> almost succeeds, but for an error
> stating that a record in sl_event already exists:
>
>      $ slonik ~/test.scr
>      ~/test.scr:8: could not connect to server: Connection refused
>          Is the server running on host "localhost" (127.0.0.1) and accepting
>          TCP/IP connections on port 5434?
>      waiting for events  (1,5000000172) only at (1,5000000162) to be
> confirmed on node 4
>      executing failedNode() on 2
>      ~/test.scr:17: NOTICE:  failedNode: set 1 has no other direct
> receivers - move now
>      ~/test.scr:17: NOTICE:  failedNode: set 2 has no other direct
> receivers - move now
>      ~/test.scr:17: NOTICE:  failedNode: set 3 has no other direct
> receivers - move now
>      ~/test.scr:17: NOTICE:  failedNode: set 1 has other direct
> receivers - change providers only
>      ~/test.scr:17: NOTICE:  failedNode: set 2 has other direct
> receivers - change providers only
>      ~/test.scr:17: NOTICE:  failedNode: set 3 has other direct
> receivers - change providers only
>      NOTICE: executing "_test_replication".failedNode2 on node 2
>      ~/test.scr:17: waiting for event (1,5000000175).  node 4 only on
> event 5000000162
> NOTICE: executing "_test_replication".failedNode2 on node 2
>      ~/test.scr:17: PGRES_FATAL_ERROR lock table
> "_test_replication".sl_event_lock,
> "_test_replication".sl_config_lock;select
> "_test_replication".failedNode2(1,2,2,'5000000174','5000000176');  -
> ERROR:  duplicate key value violates unique constraint "sl_event-pkey"
>      DETAIL:  Key (ev_origin, ev_seqno)=(1, 5000000176) already exists.
>      CONTEXT:  SQL statement "insert into "_test_replication".sl_event
>                  (ev_origin, ev_seqno, ev_timestamp,
>                  ev_snapshot,
>                  ev_type, ev_data1, ev_data2, ev_data3)
>                  values
>                  (p_failed_node, p_ev_seqfake, CURRENT_TIMESTAMP,
>                  v_row.ev_snapshot,
>                  'FAILOVER_SET', p_failed_node::text, p_backup_node::text,
>                  p_set_id::text)"
>      PL/pgSQL function
> _test_replication.failednode2(integer,integer,integer,bigint,bigint)
> line 14 at SQL statement
>      NOTICE: executing "_test_replication".failedNode2 on node 2
>      ~/test.scr:17: waiting for event (1,5000000177).  node 4 only on
> event 5000000175
>      ~/test.scr:21: begin transaction; -
>
>   After this sl_set on node 4 still has node 1 as the origin for one of
> the sets
>   (Is this possibly becasuse I'm not waiting properly or waiting on the
> wrong node?):
>
>      TEST=# table _test_replication.sl_set;
>       set_id | set_origin | set_locked |    set_comment
>      --------+------------+------------+-------------------
>            2 |          1 |            | Replication set 2
>        1 |          2 |            | Replication set 1
>            3 |          2 |            | Replication set 3
>      (3 rows)
>
> I've attached the slon logs if that would provide any better insight.
>
> Any help would be greatly appreciated.
>
> Thanks
> Glyn
>



More information about the Slony1-general mailing list