[Slony1-general] pg_dump and replication lag in 2.0.7

Fri Sep 9 01:27:58 PDT 2011

> From: Steve Singer <ssinger at ca.afilias.info>
> On 11-09-08 11:43 AM, Glyn Astill wrote: 
>> 
>>       SELECT st_origin, st_received, st_lag_num_events, round(extract(epoch 
> from st_lag_time))
>>       FROM "<my_replication_cluster>".sl_status;
>> 
>>  A graph for the weeks leading up to and after the upgrade is attached.  I 
> upgraded on the night of the 25th/26th and ignoring any other downtime where I 
> was obviously fiddling with things, you can see the syncs going out after that 
> date.  As you can imagine, I'm massively embarrassed that it took me 3 
> months to notice it happening.
>> 
> 
> st_lag_time is a measure of the difference between now() and the last 
> unconfirmed event.  The pg_dump locks sl_event which prevents the SYNC's 
> from being created so there might not be any unconfirmed events to be measured 
> by this check.
> 
> 
> Sometime between 2.0.4 and 2.0.6 we fixed a bug that prevented SYNC events from 
> being generated from pure slaves. I suspect your check is now measuring the 
> other half of replication (if you do your select from sl_status you should see 
> at least two rows, it isn't clear if your graphing both of them or just 
> one).
> 
> If  now()-st_last_event_ts gets too high it means that SYNC events are not being 
> generated.  You might want to alert on both SYNC events not being generated and 
> events not being confirmed.
> 

Okay, you know better than me.  However I'm positive that when we were on 1.2 and I was in overnight our slaves were up to date whilst the backups were running, it's only circumstansial of course, but pretty sure I'd have noticed in 3 years if not as I'd query those slaves all the time.

I've excluded the slony scchema from the dump now, so we're all good anyway.  

Thanks
Glyn