[Slony1-general] Figuring out replication is finished / replicas are same

Tue Feb 8 15:42:34 PST 2005

Postgres Learner wrote:

>Hi all!
>I am a relatively new user to slony, though I do have some experience
>with postgres. I need to set slony up in a special production
>environment. Basically our production database remains idle for most
>of the day and does some extremely heavy processing in batches for
>around 30 mins/day.
>
>We need to set up a replicated database to failover in case something
>goes wrong. So I was thinking about using slony. Since we need to
>process our batches REALLY REALLY fast, I was thinking of stopping the
>slony daemon while we process the batches and restart after we are
>done. This works fine - I set it up and slony seems to continue
>replication from where it left off when I shut the daemons. -Please
>tell me if this is wrong.
>
>Now the problem is that I can't figure out how to measure the time
>taken to replicate/resync after slony daemon restarts. Basically I
>want to know the window after which one server can safely go down
>without causing problems. I looked into the documentation and even
>searched on the net, but couldn't find anything. Please point me in
>the right direction.
>  
>
Vivek has been suggesting some useful ideas as to determining if things 
are back up to date.

I'll suggest another; what you could do is to inject a change right at 
what you consider to be "the end," and then check to see if that change 
has propagated yet.  If it has, then all the previous changes have made 
it thru, and you should be good to shut things down.

What we have traditionally done in our environment to get a sort of 
end-to-end test that replication is working is to add a 
"replication_test" table, replicate it, and update that table 
periodically and check to see if the updates make it through.

Alternatively, if you have some sort of transaction ID or batch ID, you 
might check on the subscriber to see if the last one found on the origin 
has made it thru.

The other issue worth thinking about is whether or not this usage 
"abuses" the replication system such that it would be considered "poor 
usage."  I do have a couple of thoughts...

1.  You should leave the slon running against the origin node all the 
time, including when it is undergoing the heavy processing.

If you shut it off, you'll discover that all of the changes during that 
30 minute period are treated as one really big SYNC, and things will 
behave badly when you turn on the "subscriber" slon as it tries to grab 
all the data at once.

If you instead leave it on, there will likely be hundreds-->thousands of 
SYNCs during that 30 minutes, and turning replication back on will "play 
better."

Alternative to running that slon is to look in CVS HEAD for 
"generate_syncs.sh"; run that as often as possible (I'd do so at least 
once per minute, and would prefer more often than that) during the peak 
time.  It won't cost much, performance-wise, and will cut down on the 
grief at the end of the 30 minutes.

2.  You should try to keep the slons running most of the time so that 
the systems are largely kept in sync and so that there is not a large 
buildup of rows in sl_log_1 and sl_seqlog.

If you shut off replication for days at a time, those tables will often 
build up, and performance will be questionable.