Fiel Cabral e4696wyoa63emq6w3250kiw60i45e1
Tue Oct 4 22:35:57 PDT 2005
The problem persists after the node IDs were changed from [1, 2, 3] to [10,
20, 30].

Inside gdb, the failedNode2 query did not return an error (function return
value was 0).

Node 2 was able to move the set_origin = node 3.
Nodes 3 is stuck with set_origin = node 1.

On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com> wrote:
>
> Thanks Elein. I'll run gdb and step through slonik_failed_node to (maybe)
> see if failedNode2 is failing.
>
>
> On 10/4/05, elein <elein at varlena.com> wrote:
> >
> > Fiel,
> >
> > In my own tests, with node 10->20->30, failover from 10 to 20 failed
> > because node 30 was unusable and had to be recreated from scratch.
> > This is a serious bug in my book.
> >
> > In one case the problem seemed to be dropping the first node
> > "too soon". I have not tested that case so I don't know that
> > this was the problem.
> >
> > What I have verified is that the third node never recieved any message
> > regarding the failover and did not change its information
> > to get its table set from the new origin, 20.
> >
> > Also, try not to use Node 1, 2, 3. Node 1 has some special meaning
> > in some cases that you will want to avoid.
> >
> > We are with you, not ignoring you.
> >
> > --elein
> >
> > On Tue, Oct 04, 2005 at 11:13:19AM -0400, Fiel Cabral wrote:
> > > Right after running the failover command I issue the DROP NODE command
> > to drop
> > > node 1. slonik prints an error message and exits with return value 12:
> >
> > >
> > > sys:17: TRY: drop node
> > > sys:19: PGRES_FATAL_ERROR select "_whatever".dropNode(1); - ERROR:
> > Slony-I:
> > > Node 1 is still origin of one or more sets
> > >
> > > Something should have changed the origin to node 3 but it isn't
> > happening.
> > >
> > >
> > > On 10/4/05, Fiel Cabral <e4696wyoa63emq6w3250kiw60i45e1 at gmail.com>
> > wrote:
> > >
> > > I have 3 nodes. Nodes 2 and 3 are subscribers of node 1 and I'm trying
> > to
> > > failover from node 1 to node 3. The failover command succeeds but the
> > > database of node 3 is still read-only and the origin is still node 1.
> > I
> > > don't have the same problem when doing failover with only two nodes
> > because
> > > the set is moved immediately by failedNode.
> > >
> > > failedNode (in the code below) is able to set the provider
> > successfully.
> > >
> > > Some code elsewhere is actually moving the replication set. Where is
> > that
> > > code? Is it in slon or slonik or in the sql scripts?
> > >
> > > How do I find out that slon caught the signal and is doing the right
> > thing
> > > in response to the signal?
> > >
> > > 784 raise notice ''failedNode: set % has other direct receivers -
> > > change providers only'', v_row.set_id;
> > > 785 -- ----
> > > 786 -- Backup node is not the only direct
> > > subscriber. This
> > > 787 -- means that at this moment, we redirect
> > > all direct
> > > 788 -- subscribers to receive from the backup
> > > node, and the
> > > 789 -- backup node itself to receive from
> > > another one.
> > > 790 -- The admin utility will wait for the slon
> > > engine to
> > > 791 -- restart and then call failedNode2() on
> > > the node with
> > > 792 -- the highest SYNC and redirect this to it
> > > on
> > > 793 -- backup node later.
> > > 794 -- ----
> > > ... etc ...
> > > 811
> > > 812 -- ----
> > > 813 -- Make sure the node daemon will restart
> > > 814 -- ----
> > > 815 notify "_ at CLUSTERNAME@_Restart";
> > > 816
> > >
> > > -Fiel
> > >
> > >
> > >
> > >
> > >
> > >
> >
> > > _______________________________________________
> > > Slony1-general mailing list
> > > Slony1-general at gborg.postgresql.org
> > > http://gborg.postgresql.org/mailman/listinfo/slony1-general
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20051004/e1a2dc53/attachment-0001.html


More information about the Slony1-general mailing list