Christopher Browne cbbrowne
Tue Jan 16 07:41:14 PST 2007
Thomas Pundt <mlists at rp-online.de> writes:
> | A node with id = 3 lags 17 days behind the origin. While that node
> | is lagging, all events are kept in the log.
> |
> | So you need to either let that node catch up or remove it from the
> | cluster. When that is completed, the log will be cleaned after about
> | 20 minutes.
>
> yes, that's clear; what I don't know is what I can do to let the node catch
> up with the rest. I suspect that somehow maybe a confirmation event got lost.
> Node 3 seems to be up to date itself:
>
> This is the master (node 1):
> [pg81 at pgmaster:~] echo 'select count(1) from "_RPO".sl_event;' | psql 
> RPONLINE -p5481
>  count  
> --------
>  935314
> (1 row)
>
> This is from node 3:
> [pg81 at pgmaster:~] echo 'select count(1) from "_RPO".sl_event;' | psql 
> RPONLINE -p5481 -hpgdb2
>  count 
> -------
>   1050
> (1 row)
>
> What I'd like to know is: is there anything (except rebuilding node 3) I can
> do to get rid of the old events?

It's not at all clear from this whether or not node 3 is behind or not.

On the origin node run the query:
   select * from "_RPO".sl_status;

If *that* reports node 3 as being way behind, then that's a pretty
good indication that node 3 is way behind.

One possibility come to mind:

We've got one environment where the network is a little bit flakey;
occasionally a node will cease to successfully pass back
confirmations.  It's replicating fine; it's just not reporting that it
is.

Restarting the slon processes cleared out old connections and brought
back sanity.

It's easy enough (and safe enough) to stop and restart slon processes;
you might try doing that and see if that clears some blockage of
confirmation events.

Otherwise, if node #3 is honest-to-goodness Way Behind, then the thing
to investigate is what's up with that.  That's causing sl_log_1 to
build up in size as a side-effect; don't worry about the size; that'll
fix itself after you fix what's broken about node #3.

Maybe you need to drop node 3 or unsubscribe its sets, and rebuild it;
that's not obvious at this point.  Check the logs for node 3 to see
what *is* up with it...
-- 
(reverse (concatenate 'string "ofni.sailifa.ac" "@" "enworbbc"))
<http://dba2.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)



More information about the Slony1-general mailing list