[Slony1-general] pg_dump and replication lag in 2.0.7

Thu Sep 8 10:43:29 PDT 2011

On 11-09-08 11:43 AM, Glyn Astill wrote:

>
>      SELECT st_origin, st_received, st_lag_num_events, round(extract(epoch from st_lag_time))
>      FROM "<my_replication_cluster>".sl_status;
>
> A graph for the weeks leading up to and after the upgrade is attached.  I upgraded on the night of the 25th/26th and ignoring any other downtime where I was obviously fiddling with things, you can see the syncs going out after that date.  As you can imagine, I'm massively embarrassed that it took me 3 months to notice it happening.
>

st_lag_time is a measure of the difference between now() and the last 
unconfirmed event.  The pg_dump locks sl_event which prevents the SYNC's 
from being created so there might not be any unconfirmed events to be 
measured by this check.

Sometime between 2.0.4 and 2.0.6 we fixed a bug that prevented SYNC 
events from being generated from pure slaves. I suspect your check is 
now measuring the other half of replication (if you do your select from 
sl_status you should see at least two rows, it isn't clear if your 
graphing both of them or just one).

If  now()-st_last_event_ts gets too high it means that SYNC events are 
not being generated.  You might want to alert on both SYNC events not 
being generated and events not being confirmed.

>
> Glyn