elein elein
Wed Oct 5 00:30:06 PDT 2005
Yes, it should. But it doesn't.  I believe any message is ever
sent to the 3rd node. This is the same in my example.  See also the sl_setsync
table.  It has a reference to node 1 (or 10).

On Tue, Oct 04, 2005 at 06:17:06PM -0400, Fiel Cabral wrote:
> The sl_event table on Node 2 contains a FAILOVER_SET event but node 3 (the
> backup node specified in the failover command) does not. Should the backup
> node's sl_event table contain the FAILOVER_SET?
> 
> sl_event on node 2 contains a FAILOVER_SET:
>         ev_timestamp        | ev_origin | ev_seqno |       ev_type
> ----------------------------+-----------+----------+---------------------
>  2005-10-04 17:49:10.487603 |         2 |        1 | STORE_PATH
>  2005-10-04 17:49:10.70457  |         2 |        2 | STORE_PATH
>  2005-10-04 17:49:10.712416 |         2 |        3 | STORE_LISTEN
>  2005-10-04 17:49:10.77891  |         2 |        4 | STORE_LISTEN
>  2005-10-04 17:49:38.146642 |         2 |        5 | SUBSCRIBE_SET
>  2005-10-04 17:49:05.608095 |         1 |      306 | STORE_NODE
>  2005-10-04 17:49:05.608095 |         1 |      307 | ENABLE_NODE
>  2005-10-04 17:49:08.029042 |         1 |      308 | STORE_NODE
>  2005-10-04 17:49:08.029042 |         1 |      309 | ENABLE_NODE
>  2005-10-04 17:49:10.641208 |         1 |      310 | STORE_PATH
>  2005-10-04 17:49:10.679501 |         1 |      311 | STORE_PATH
>  2005-10-04 17:49:10.722549 |         1 |      312 | STORE_LISTEN
>  2005-10-04 17:49:10.751999 |         1 |      313 | STORE_LISTEN
>  2005-10-04 17:55:02.413185 |         2 |        6 | SYNC
>  2005-10-04 17:49:42.44082  |         1 |      314 | ENABLE_SUBSCRIPTION
>  2005-10-04 17:49:10.60801  |         3 |        1 | STORE_PATH
>  2005-10-04 17:49:42.769833 |         1 |      315 | ENABLE_SUBSCRIPTION
>  2005-10-04 17:49:10.678128 |         3 |        2 | STORE_PATH
>  2005-10-04 17:49:10.713706 |         3 |        3 | STORE_LISTEN
>  2005-10-04 17:49:10.743235 |         3 |        4 | STORE_LISTEN
>  2005-10-04 17:49:38.417454 |         3 |        5 | SUBSCRIBE_SET
>  2005-10-04 17:49:52.680621 |         1 |      316 | SYNC
>  2005-10-04 17:50:53.010532 |         1 |      317 | SYNC
>  2005-10-04 17:51:53.112317 |         1 |      318 | SYNC
>  2005-10-04 17:52:53.146222 |         1 |      319 | SYNC
>  2005-10-04 17:53:53.192119 |         1 |      320 | SYNC
>  2005-10-04 17:54:53.602106 |         1 |      321 | SYNC
>  2005-10-04 17:55:53.710807 |         1 |      322 | SYNC
>  2005-10-04 17:56:02.893106 |         2 |        7 | SYNC
>  2005-10-04 17:56:42.786823 |         3 |        6 | SYNC
>  2005-10-04 17:56:53.833985 |         1 |      323 | SYNC
>  2005-10-04 17:57:03.007883 |         2 |        8 | SYNC
>  2005-10-04 17:57:43.692981 |         3 |        7 | SYNC
>  2005-10-04 17:57:53.902912 |         1 |      324 | SYNC
>  2005-10-04 17:58:03.062867 |         2 |        9 | SYNC
>  2005-10-04 17:58:43.736478 |         3 |        8 | SYNC
>  2005-10-04 17:58:53.953325 |         1 |      325 | SYNC
>  2005-10-04 17:59:03.112996 |         2 |       10 | SYNC
>  2005-10-04 17:59:43.77303  |         3 |        9 | SYNC
>  2005-10-04 17:59:54.095892 |         1 |      326 | SYNC
>  2005-10-04 18:00:03.155204 |         2 |       11 | SYNC
>  2005-10-04 18:00:43.810793 |         3 |       10 | SYNC
>  2005-10-04 18:01:03.196571 |         2 |       12 | SYNC
>  2005-10-04 18:01:43.865925 |         3 |       11 | SYNC
>  2005-10-04 18:02:03.216029 |         2 |       13 | SYNC
>  2005-10-04 18:02:43.905505 |         3 |       12 | SYNC
>  2005-10-04 18:03:03.238632 |         2 |       14 | SYNC
>  2005-10-04 18:03:38.947704 |         1 |      327 | FAILOVER_SET
>  2005-10-04 18:03:48.819508 |         3 |       13 | SYNC
>  2005-10-04 18:03:49.921361 |         2 |       15 | SYNC
>  2005-10-04 18:04:48.875801 |         3 |       14 | SYNC
>  2005-10-04 18:04:49.970829 |         2 |       16 | SYNC
>  2005-10-04 18:05:48.92941  |         3 |       15 | SYNC
>  2005-10-04 18:05:49.985511 |         2 |       17 | SYNC
>  2005-10-04 18:06:48.963277 |         3 |       16 | SYNC
>  2005-10-04 18:06:49.998737 |         2 |       18 | SYNC
>  2005-10-04 18:07:49.033346 |         3 |       17 | SYNC
>  2005-10-04 18:07:50.028334 |         2 |       19 | SYNC
>  2005-10-04 18:08:49.051861 |         3 |       18 | SYNC
>  2005-10-04 18:08:50.056542 |         2 |       20 | SYNC
>  2005-10-04 18:09:49.075309 |         3 |       19 | SYNC
>  2005-10-04 18:09:50.093277 |         2 |       21 | SYNC
> (62 rows)
> 
> sl_event on node 3 (backup node) does not have the FAILOVER_SET:
> 
>         ev_timestamp        | ev_origin | ev_seqno |       ev_type
> ----------------------------+-----------+----------+---------------------
>  2005-10-04 17:49:10.60801  |         3 |        1 | STORE_PATH
>  2005-10-04 17:49:10.678128 |         3 |        2 | STORE_PATH
>  2005-10-04 17:49:10.713706 |         3 |        3 | STORE_LISTEN
>  2005-10-04 17:49:10.743235 |         3 |        4 | STORE_LISTEN
>  2005-10-04 17:49:38.417454 |         3 |        5 | SUBSCRIBE_SET
>  2005-10-04 17:49:10.487603 |         2 |        1 | STORE_PATH
>  2005-10-04 17:49:08.029042 |         1 |      308 | STORE_NODE
>  2005-10-04 17:49:10.70457  |         2 |        2 | STORE_PATH
>  2005-10-04 17:49:08.029042 |         1 |      309 | ENABLE_NODE
>  2005-10-04 17:49:10.712416 |         2 |        3 | STORE_LISTEN
>  2005-10-04 17:49:10.641208 |         1 |      310 | STORE_PATH
>  2005-10-04 17:49:10.77891  |         2 |        4 | STORE_LISTEN
>  2005-10-04 17:49:10.679501 |         1 |      311 | STORE_PATH
>  2005-10-04 17:49:38.146642 |         2 |        5 | SUBSCRIBE_SET
>  2005-10-04 17:49:10.722549 |         1 |      312 | STORE_LISTEN
>  2005-10-04 17:55:02.413185 |         2 |        6 | SYNC
>  2005-10-04 17:56:02.893106 |         2 |        7 | SYNC
>  2005-10-04 17:49:10.751999 |         1 |      313 | STORE_LISTEN
>  2005-10-04 17:49:42.44082  |         1 |      314 | ENABLE_SUBSCRIPTION
>  2005-10-04 17:56:42.786823 |         3 |        6 | SYNC
>  2005-10-04 17:57:03.007883 |         2 |        8 | SYNC
>  2005-10-04 17:49:42.769833 |         1 |      315 | ENABLE_SUBSCRIPTION
>  2005-10-04 17:49:52.680621 |         1 |      316 | SYNC
>  2005-10-04 17:50:53.010532 |         1 |      317 | SYNC
>  2005-10-04 17:51:53.112317 |         1 |      318 | SYNC
>  2005-10-04 17:52:53.146222 |         1 |      319 | SYNC
>  2005-10-04 17:53:53.192119 |         1 |      320 | SYNC
>  2005-10-04 17:54:53.602106 |         1 |      321 | SYNC
>  2005-10-04 17:55:53.710807 |         1 |      322 | SYNC
>  2005-10-04 17:56:53.833985 |         1 |      323 | SYNC
>  2005-10-04 17:57:43.692981 |         3 |        7 | SYNC
>  2005-10-04 17:57:53.902912 |         1 |      324 | SYNC
>  2005-10-04 17:58:03.062867 |         2 |        9 | SYNC
>  2005-10-04 17:58:43.736478 |         3 |        8 | SYNC
>  2005-10-04 17:58:53.953325 |         1 |      325 | SYNC
>  2005-10-04 17:59:03.112996 |         2 |       10 | SYNC
>  2005-10-04 17:59:43.77303  |         3 |        9 | SYNC
>  2005-10-04 17:59:54.095892 |         1 |      326 | SYNC
>  2005-10-04 18:00:03.155204 |         2 |       11 | SYNC
>  2005-10-04 18:00:43.810793 |         3 |       10 | SYNC
>  2005-10-04 18:01:03.196571 |         2 |       12 | SYNC
>  2005-10-04 18:01:43.865925 |         3 |       11 | SYNC
>  2005-10-04 18:02:03.216029 |         2 |       13 | SYNC
>  2005-10-04 18:02:43.905505 |         3 |       12 | SYNC
>  2005-10-04 18:03:03.238632 |         2 |       14 | SYNC
>  2005-10-04 18:03:48.819508 |         3 |       13 | SYNC
>  2005-10-04 18:03:49.921361 |         2 |       15 | SYNC
>  2005-10-04 18:04:48.875801 |         3 |       14 | SYNC
>  2005-10-04 18:04:49.970829 |         2 |       16 | SYNC
>  2005-10-04 18:05:48.92941  |         3 |       15 | SYNC
>  2005-10-04 18:05:49.985511 |         2 |       17 | SYNC
>  2005-10-04 18:06:48.963277 |         3 |       16 | SYNC
>  2005-10-04 18:06:49.998737 |         2 |       18 | SYNC
>  2005-10-04 18:07:49.033346 |         3 |       17 | SYNC
>  2005-10-04 18:07:50.028334 |         2 |       19 | SYNC
>  2005-10-04 18:08:49.051861 |         3 |       18 | SYNC
>  2005-10-04 18:08:50.056542 |         2 |       20 | SYNC
>  2005-10-04 18:09:49.075309 |         3 |       19 | SYNC
>  2005-10-04 18:09:50.093277 |         2 |       21 | SYNC
>  2005-10-04 18:10:49.100012 |         3 |       20 | SYNC
>  2005-10-04 18:10:50.117138 |         2 |       22 | SYNC
> (61 rows)
> 
> 
> On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com> wrote:
> 
>     The problem persists after the node IDs were changed from [1, 2, 3] to [10,
>     20, 30].
> 
>     Inside gdb, the failedNode2 query did not return an error (function return
>     value was 0).
> 
>     Node 2 was able to move the set_origin = node 3.
>     Nodes 3 is stuck with set_origin = node 1.
> 
> 
>     On 10/4/05, Fiel Cabral < e4696wyoa63emq6w3250kiw60i45e1 at gmail.com > wrote:
> 
>         Thanks Elein. I'll run gdb and step through slonik_failed_node to
>         (maybe) see if failedNode2 is failing.
> 
> 
> 
>         On 10/4/05, elein <elein at varlena.com > wrote:
> 
>             Fiel,
> 
>             In my own tests, with node 10->20->30, failover from 10 to 20
>             failed
>             because node 30 was unusable and had to be recreated from scratch.
>             This is a serious bug in my book.
> 
>             In one case the problem seemed to be dropping the first node
>             "too soon".  I have not tested that case so I don't know that
>             this was the problem.
> 
>             What I have verified is that the third node never recieved any
>             message
>             regarding the failover and did not change its information
>             to get its table set from the new origin, 20.
> 
>             Also, try not to use Node 1, 2, 3.  Node 1 has some special meaning
>             in some cases that you will want to avoid.
> 
>             We are with you, not ignoring you.
> 
>             --elein
> 
>             On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote:
>             > Right after running the failover command I issue the DROP NODE
>             command to drop
>             > node 1. slonik prints an error message and exits with return
>             value 12:
>             >
>             > sys:17: TRY: drop node
>             > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1);  -
>             ERROR:  Slony-I:
>             > Node 1 is still origin of one or more sets
>             >
>             > Something should have changed the origin to node 3 but it isn't
>             happening.
>             >
>             >
>             > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com
>             > wrote:
>             >
>             >     I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and
>             I'm trying to
>             >     failover from node 1 to node 3. The failover command succeeds
>             but the
>             >     database of node 3 is still read-only and the origin is still
>             node 1. I
>             >     don't have the same problem when doing failover with only two
>             nodes because
>             >     the set is moved immediately by failedNode.
>             >
>             >     failedNode (in the code below) is able to set the provider
>             successfully.
>             >
>             >     Some code elsewhere is actually moving the replication set.
>             Where is that
>             >     code? Is it in slon or slonik or in the sql scripts?
>             >
>             >     How do I find out that slon caught the signal and is doing
>             the right thing
>             >     in response to the signal?
>             >
>             >         784 raise notice ''failedNode: set % has other direct
>             receivers -
>             >     change providers only'', v_row.set_id;
>             >         785                         -- ----
>             >         786                         -- Backup node is not the
>             only direct
>             >     subscriber. This
>             >         787                         -- means that at this moment,
>             we redirect
>             >     all direct
>             >         788                         -- subscribers to receive
>             from the backup
>             >     node, and the
>             >         789                         -- backup node itself to
>             receive from
>             >     another one.
>             >         790                         -- The admin utility will
>             wait for the slon
>             >     engine to
>             >         791                         -- restart and then call
>             failedNode2() on
>             >     the node with
>             >         792                         -- the highest SYNC and
>             redirect this to it
>             >     on
>             >         793                         -- backup node later.
>             >         794                         -- ----
>             >     ... etc ...
>             >         811
>             >         812         -- ----
>             >         813         -- Make sure the node daemon will restart
>             >         814         -- ----
>             >         815         notify "_ at CLUSTERNAME@_Restart";
>             >         816
>             >
>             >     -Fiel
>             >
>             >
>             >
>             >
>             >
>             >
> 
>             > _______________________________________________
>             > Slony1-general mailing list
>             > Slony1-general at gborg.postgresql.org
>             > http://gborg.postgresql.org/mailman/listinfo/slony1-general
> 
> 
> 
> 
> 
> 
> 

> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general



More information about the Slony1-general mailing list