Thu Sep 29 20:28:39 PDT 2005
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] Failover Stalls
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I believe I waited a while for the failover to get
over to the third node. Should I have killed and
restarted the slon processes?
If I get a chance today, I will try to reconstruct
my test cases.
--elein
On Wed, Sep 28, 2005 at 06:08:34PM -0400, Christopher Browne wrote:
> elein wrote:
>
> >Chris Browne has been helpful but busy.
> >Can someone else try to look into this problem.
> >
> >It seems to be a basic failure on failover when
> >there are cascaded replicas. You are left unable
> >to add back in the failed node.
> >
> >
> I am getting convinced that the problem (a problem?) lies in trying to
> delete the dead node too soon.
>
> If I work slowly enough, it all "works out;" I can drop the failed node
> and add it back in.
>
> If I wait (oh, say 10 minutes) between FAILOVER and dropping node 1,
> then that's enough time for references to node 1 to get purged out of
> event logs and sl_setsync and such.
>
> If I don't wait, the "unrelated" nodes wind up falling over, unhappy
> about the attempt to delete a node that they still are referencing.
>
> This seems a nicely evil bit of log...
>
> ERROR remoteWorkerThread_2: "begin transaction; set transaction
> isolation level serializable; lock table "_failtest".sl_config_lock;
> select "_failtest".dropNode_int(1); notify "_failtest_Restart"; notify
> "_failtest_Event"; notify "_failtest_Confirm"; insert into
> "_failtest".sl_event (ev_origin, ev_seqno, ev_timestamp,
> ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1 ) values ('2',
> '102', '2005-09-28 21:55:53.176607', '462075', '462076', '',
> 'DROP_NODE', '1'); insert into "_failtest".sl_confirm (con_origin,
> con_received, con_seqno, con_timestamp) values (2, 3, '102',
> CURRENT_TIMESTAMP); commit transaction;" PGRES_FATAL_ERROR ERROR:
> update or delete on "sl_node" violates foreign key constraint
> "ssy_origin-no_id-ref" on "sl_setsync"
> DETAIL: Key (no_id)=(1) is still referenced from table "sl_setsync".
>
> This bit of log file gets generated on node #3 (the extra subscriber)
> when I submit the DROP NODE request.
>
> It's not suggesting an analytical fix to me just yet... I'm not sure I
> want to automagically delete the sl_setsync entry just yet at the point
> that this happens...
>
- Previous message: [Slony1-general] Failover failures
- Next message: [Slony1-general] Failover Stalls
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list