[Slony1-general] Re: Slow catch-up

Wed Aug 17 23:05:34 PDT 2005

Christopher, thanks for this.  I tried -g 80 but it made no noticable
difference.  I will try the higher settings and also -o 0.

Yes, we do have some pretty massive activity going on.  This is a
simulation of a heavy website hit.  We have a whole bunch of robots
firing traffic at the database.  Unfortunately I had to kill the cluster
as it was giving us too many headaches, and there are other demands on
both my time and the database.  I will definitely be trying again
though, probably in a couple of weeks.

I am still concerned about what the xxid functions are doing.  The fact
that they cause such massive filtering of the sl_log_1 query means that
the query will always run very slowly, and reading the code for these
functions yielded me no clues.  Any clarification you could provide
would be much appreciated.

Thanks again.
__
Marc

On Wed, 2005-08-17 at 17:23 -0400, Christopher Browne wrote:
> Marc Munro wrote:
> 
> >Brad, thank you so much for your response.  I have more information but
> >still no solution.
> >
> >In response to Brad's suggestions:
> >
> >1) pg_stat_activity
> >Interesting.  There is an idle in transaction process from the provider.
> >Shutting down and restarting the database resulted in a new session with
> >the same status.  This appears to be something to do with slony.
> >  
> >
> That may be fine, as long as the idle-in-transaction process isn't old.
> 
> >2) vacuum subscriber's pg_listener
> >No bloat to report.  Only a handful of rows existed.
> >
> >  
> >
> The question is how many rows it *couldn't* purge; how many dead tuples
> were kicking around.  e.g. - the output of VACUUM ANALYZE on the table...
> 
> >3) vacuum subscribers sl_log_1
> >Nothing - there are no rows in the subscriber.
> >
> >4) Vacuuming subscriber
> >Already done.
> >
> >Results of further investigation:
> >I think the problem may simply be that the load on the provider during
> >my initial sync was very high.  This has resulted in sl_log_1 becoming
> >huge.  In my slony logs on the subscriber I often see lines like:
> >
> >2005-08-17 09:08:47 PDT DEBUG2 remoteHelperThread_1_1: 83.952 seconds
> >delay for first row
> >
> >Vacuuming the sl_log_1 table on the provider shows it to be pretty big:
> >
> >INFO:  --Relation _cage2.sl_log_1--
> >INFO:  Pages 714558: Changed 1084, Empty 0; Tup 35986240: Vac 0, Keep 0,
> >UnUsed 7172.
> >	Total CPU 52.50s/9.31u sec elapsed 177.23 sec.
> >INFO:  --Relation pg_toast.pg_toast_2177383887--
> >INFO:  Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
> >	Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
> >INFO:  Analyzing _cage2.sl_log_1
> >
> >I have tracked down a specific slony query that takes a long time to run
> >on the provider (100+ secs):
> >
> >select log_origin, log_xid, log_tableid, log_actionseq, log_cmdtype,
> >log_cmddata from "_cage2".sl_log_1 where log_origin = 1 and (  (
> > log_tableid in
> >(100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,99999)
> >    and (log_xid < '2106689363' and "_cage2".xxid_lt_snapshot(log_xid,
> >'2106433975:2106689363:''2106433975'',''2106688480'''))
> >    and (log_xid >= '2106433975' and "_cage2".xxid_ge_snapshot(log_xid,
> >'2106433975:2106689266:''2106433975'',''2106688480'''))
> >) ) order by log_actionseq;
> >
> >This returns 662 rows.  If I remove the xxid_ge_snapshot and
> >xxid_lt_snapshot function calls it returns 2135674.  This would seem to
> >be the root of my performance problem.  
> >
> >I don't understand what the functions do, but allowing them to so
> >drastically prune the result set means that the query does a lot of work
> >for relatively few rows.
> >
> >Can anyone explain what is going on here?
> >  
> >
> You're hitting a "Rod Taylor" style situation, where you have some
> pretty enormous amount of activity going on.
> 
> Each SYNC is running for a really long time, and there's no much way of
> avoiding that :-(.
> 
> What Rod did under analagous conditions was to group a lot of SYNCs
> together.  Since it'll probably be doing a Seq Scan, you might as well
> try to process most of sl_log_1.
> 
> With the default "-g 6", you'll probably see replication falling behind :-(
> 
> Suggestion: set the -g option as high as you can get away with.  The
> SYNCs will run for quite a long time, but at least they'll find lots of
> data to replicate.  In v 1.0.x, the maximum setting out of the box for
> -g is "-g 100", which tries to process 100 SYNCs at a time, if you're
> running behind.
> 
> If you're running 1.1.x, the maximum value is 10000, and you'd probably
> find it useful to set it pretty high until things catch up.  And set the
> "desired_sync_time" value (the -o option to slon) to 0, so that it
> doesn't do the "adaptive" thing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://gborg.postgresql.org/pipermail/slony1-general/attachments/20050817/ec4d94d8/attachment.bin