[Solved] Re: [Slony1-general] replication Lag, sync grouping not happening

Thu Feb 21 14:47:44 PST 2008

Ow Mun Heng <Ow.Mun.Heng at wdc.com> writes:
> On Thu, 2008-02-21 at 18:47 +0000, Christopher Browne wrote:
>> Ow Mun Heng <Ow.Mun.Heng at wdc.com> writes:
>> > I'm not sure what is happening, usually when the slave lags behind, I
>> > will stop the slon process and then add in the -o10 -g500 options to the
>> > master process an I usually will see that the syncs on the subscriber
>> > will be grouped together.
>> >
>> > AS of right now, I'm not seeing this happening and it's just processing
>> > the syncs 1 by 1 and it's taking a long time for this to happen.
>> >
>> > I also tried the -o10 -g500 on both the master and the slave and still
>> > it goes 1 by 1.
>> > 2008-02-22 01:58:51 MYT DEBUG2 remoteHelperThread_1_1: inserts=0 updates=0 deletes=0
>> > 2008-02-22 01:58:51 MYT DEBUG2 remoteWorkerThread_1: SYNC 483711 done in 88.818 seconds
>> > 2008-02-22 02:00:07 MYT DEBUG2 remoteHelperThread_1_1: inserts=1984 updates=0 deletes=0
>> > 2008-02-22 02:00:10 MYT DEBUG2 remoteWorkerThread_1: SYNC 483712 done in 78.634 seconds
>> > 2008-02-22 02:01:34 MYT DEBUG2 remoteHelperThread_1_1: inserts=529 updates=0 deletes=56
>> > 2008-02-22 02:01:36 MYT DEBUG2 remoteWorkerThread_1: SYNC 483713 done in 85.745 seconds
>> > 2008-02-22 02:03:25 MYT DEBUG2 remoteHelperThread_1_1: inserts=1532 updates=0 deletes=0
>> > 2008-02-22 02:03:28 MYT DEBUG2 remoteWorkerThread_1: SYNC 483714 done in 112.476 seconds
>> > 2008-02-22 02:05:47 MYT DEBUG2 remoteHelperThread_1_1: inserts=1557 updates=0 deletes=0
>> > 2008-02-22 02:05:49 MYT DEBUG2 remoteWorkerThread_1: SYNC 483715 done in 140.691 seconds
>> > 2008-02-22 02:08:26 MYT DEBUG2 remoteHelperThread_1_1: inserts=2600 updates=0 deletes=225
>> > 2008-02-22 02:08:27 MYT DEBUG2 remoteWorkerThread_1: SYNC 483716 done in 157.839 seconds
>> 
>> I believe that -o10 causes Slony-I to try to track having SYNC
>> processing time take an estimated time of 10ms per group; the value is
>> measured in milliseconds, not seconds.
>> 
>> That being the case, if the last *single* SYNC took "lots more than
>> 10ms," then the slon will not be considering processing several SYNCs
>> at once.  (And note that since the times were also >>> 10s, the
>> principle would still hold if -o was measuring in seconds.)
>> 
>> Based on the timings you indicate, the only way that you'll see SYNC
>> grouping is if you set the value to something more like 200000.
>
> Master : slon -d4 -c2 -g500 -s60000 -o200000 -f slon_master.conf
> Slave :  slon -d2     -g500         -o200000 -f slon_slave1.conf | egrep -i 'done in|inserts='
>
> 2008-02-22 03:11:58 MYT DEBUG2 remoteHelperThread_1_1: inserts=807 updates=0 deletes=130
> 2008-02-22 03:12:04 MYT DEBUG2 remoteWorkerThread_1: SYNC 483756 done in 62.840 seconds
> 2008-02-22 03:13:19 MYT DEBUG2 remoteHelperThread_1_1: inserts=4626 updates=0 deletes=8
> 2008-02-22 03:13:19 MYT DEBUG2 remoteWorkerThread_1: SYNC 483759 done in 75.382 seconds
> 2008-02-22 03:14:49 MYT DEBUG2 remoteHelperThread_1_1: inserts=8824 updates=0 deletes=418
> 2008-02-22 03:14:50 MYT DEBUG2 remoteWorkerThread_1: SYNC 483766 done in 90.575 second
> 2008-02-22 03:17:06 MYT DEBUG2 remoteHelperThread_1_1: inserts=19587 updates=0 deletes=566
> 2008-02-22 03:17:07 MYT DEBUG2 remoteWorkerThread_1: SYNC 483781 done in 136.992 seconds
> 2008-02-22 03:20:12 MYT DEBUG2 remoteHelperThread_1_1: inserts=24451 updates=1138 deletes=484
> 2008-02-22 03:20:14 MYT DEBUG2 remoteWorkerThread_1: SYNC 483802 done in 187.493 seconds
>
> Seems like this is starting back to go in groups.
>
> To be frank, the -o -s options really befuddles me. I've read the docs
> but I guess I don't really understand them enough to know whether these
> options work on the master or the slave. (hence as above, I just put it
> on both master ans slave)

Yes, the docs aren't clear enough.

-o and -g are relevant to subscriber behaviour; they do not affect the
origin in any way.

-s and -t are relevant to origin behaviour, primarily.  (Events are,
every so often, generated on ALL nodes, so it's not solely an "origin
thing.")

Clearly this warrants elaborating on the docs...

> On another front, I tend to believe that one of the reason for the lag
> is because my disks are slow. (1x 500GB IDE 7200 rpm and they're bogged
> down, atop shows 90% usage nearly 80% of the time) To add to that, I
> noticed that it will start to slow even more when sl_log_1/2 becomes
> large ~2GB and no amount of vacuum/reindex/recreate index will get it
> back up to speed. (fetch 100 from log becomes real slow >500secs )
>
> Chris(you) already shown me how to manually force a logswitch, and thus,
> now I'm considering making a job to manually force the switch like every
> 6 hours just to get the size under control. Is this a good Idea?

You can automate that by, um, ...  Hey, that's not well enough
documented :-).

The default, as controlled by data in sl_registry, is to switch logs
once per week.

Actually, looking at the code, the way it accesses sl_registry,
there's not a straightforward way to change that to either daily or
multiple times per day.  I think for CVS HEAD, I'm inclined to clean
that stuff out because there is a cleaner way, using the new parameter,
cleanup_interval.

For now, I think that running a script that starts the log switch
every 6 hours is probably about the right idea.
-- 
select 'cbbrowne' || '@' || 'linuxdatabases.info';
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
"...make -k all to compile  everything in the core distribution.  This
will take anywhere from 15 minutes  (on a Cray Y-MP) to 12 hours."  
-- X Window System Release Notes