kris_b78 at o2.pl kris_b78
Tue Jan 25 18:12:09 PST 2005
Dear Slony enthusiasts, 

I am trying to set up a replication system consisting of 3 nodes (I am using slony-I 1.0.5 and postgresql 7.4.5). I start the nodes one by one: first node 1 - the master (init cluster, create set and so on - no problems there), then nodes 2 and 3 - slaves (just store node, add store paths and listens and subscribe set - no problems there as well). When I have all three nodes up and running, the sl_listen table on each node looks more or less as follows:

li_origin | li_provider | li_receiver
-----------+------------+------------
         2 |           2 |           1
         1 |           1 |           2
         3 |           3 |           1
         1 |           1 |           3
         2 |           2 |           3
         3 |           3 |           2

(sl_path looks similar - separate paths between each two nodes - 6 paths altogehter). As you can see, each node listens for events directly on every other node. (the order of the entries differs on each node, but that does not matter, I guess?) the sl_subscribe table:

 sub_set | sub_provider | sub_receiver | sub_forward | sub_active
---------+-------------+-------------+------------+-----------
       1 |            1 |            2 | t           | t
       1 |            1 |            3 | t           | t

- which means nodes 2 and 3 are direct receivers of set 1 originating on node 1. The data is replicated fine, switchovers are performed smoothly and everything seems to be ok.
The big problem is failover. I tried to do the simplest thing - failover to node 3. My failover script is very simple:
	
	cluster name = clusterix;

	node 1 admin conninfo = 'dbname=clusterix1 hostaddr=127.0.0.1 user=clusterix ';
	node 2 admin conninfo = 'dbname=clusterix2 hostaddr=127.0.0.1 user=clusterix ';
	node 3 admin conninfo = 'dbname=clusterix3 hostaddr=127.0.0.1 user=clusterix ';
try {

	failover (id = 1, backup node = 3);}
 	on error {
 	echo 'failover error';
 	exit 13;
 }
 
 The failover command succeeds. On node 2 the sl_subscribe and sl_set tables are changed to:
  sub_set | sub_provider | sub_receiver | sub_forward | sub_active
---------+-------------+-------------+------------+-----------
       1 |            3 |            2 | t           | t
and

set_id | set_origin | set_locked |     set_comment
--------+-----------+-----------+---------------------
      1 |          3 |            | All clusterix tables

, which is exactly what I'd expect. But on node 3, which was supposed to become my new master node, these tables look somewhat strange:

sub_set | sub_provider | sub_receiver | sub_forward | sub_active
---------+-------------+-------------+------------+-----------
       1 |            2 |            3 | t           | t
       1 |            3 |            2 | t           | t

      and

       set_id | set_origin | set_locked |     set_comment
--------+-----------+-----------+---------------------
      1 |          1 |            | All clusterix tables

As you can see, according to sl_subscribe, node 3 is the provider AND the receiver of node 2, and node 2 is the provider AND the receiver of node 3 - which makes no sense to me... Not to mention the origin of set 1, acc. to sl_set is still node 1, which had failed.... In the end I can neither write anything to my database on node 3 (slony thinks it is being replicated), nor drop node 1 (slony tells me it is still the origin of set 1). So the big question is:
WHAT AM I DOING WRONG? While I was investigating the problem I found out that the sl_event table on node 3 does not contain the FAILOVER_SET event (which is present on node 2). I tried to dwell deeper into the contents of slony tables, but found no clues. Now, to make things even more awkward: When I start up all the three nodes (exactly as I described at the beginning), switchover to node 3 (works fine) and THEN failover to node 1 - it works! I reckon that it is because node 1 was started first, but I found no differences in the contents of slony tables that would clearly explain such behaviour. If will really appreciate any help in solving this problem,

Chris Bandurski
chris at gv.pl

       





More information about the Slony1-general mailing list