[Slony1-general] Slow-ny

Thu Nov 11 16:56:21 PST 2004

On 11/11/2004 8:53 AM, Rod Taylor wrote:

> On Thu, 2004-11-11 at 08:28 -0500, Jan Wieck wrote:
>> On 11/10/2004 9:28 PM, Rod Taylor wrote:
>> 
>> > On Wed, 2004-11-10 at 09:25 -0500, Rod Taylor wrote:
>> >> The declare'd statement which finds the 2000 or so statements within a
>> >> "snapshot" seems to take a little over a minute to FETCH the first 100
>> >> entries. Statement attached.
>> > 
>> > After bumping the snapshots to 100, it managed to get through the
>> > contents of the 48 hour transaction within about 8 hours -- the 48 hour
>> > transaction was running while slony was doing the COPY step.
>> > 
>> > I think Slony could do significantly better if it retrieved all sync's
>> > which have the same minxid in one shot -- even if this goes far beyond
>> > the maximum sync group size.
>> 
>> The reason for not going beyond the maximum sync group size is to avoid 
>> doing the whole work again if anything goes wrong. It is bad enough that 
>> the copy_set needs to be done in one single, humungous transaction.
>> 
>> What you seem to experience here are several of the nice performance 
>> improvements that PostgreSQL received after 7.2 ... just in reverse.
> 
> I disagree. I fixed most of those within Slony.  The issue now (common
> for all versions of PostgreSQL and Slony) is that at one point the
> log_xid range being requested by the LOG cursor covered 22Million tuples
> (17 million XIDs) which is trimmed back down by the *snapshot functions.
> 
>     and (log_xid < '1715803287' and _test1_xxid_lt_snapshot(log_xid,
> '1715764209:1715803287:''1715764209'',''1715803088'',''1715785290'''))
>     and (log_xid >= '1698717542' and _test1_xxid_ge_snapshot(log_xid,
> '1698717542:1705890743:''1705890707'',''1705821719'',''1705889897'',''1705890741'',''1705859044'',''1705890086'',''1705890344'',''1705885231'',''1698717542'''))
> 
> And one from about mid-way through looked like:
> 
>     and (log_xid < '1705890743' and _test1_xxid_lt_snapshot(log_xid,
> '1698717542:1705890743:''1705890707'',''1705821719'',''1705889897'',''1705890741'',''1705859044'',''1705890086'',''1705890344'',''1705885231'',''1698717542'''))
>     and (log_xid >= '1698717542' and _test1_xxid_ge_snapshot(log_xid,
> '1698717542:1705866099:''1705866060'',''1705866038'',''1705821719'',''1705859044'',''1705866092'',''1705865763'',''1705866097'',''1705858396'',''1698717542'''))

Now I see what's going on here ... 1698717542 was probably the 
transaction doing the COPY (or some other really long running thing. 
Which leads to the problem that during this initial catchup, it really 
has to sift through all existing log rows since the xid based index is 
useless.

You are right ... going with the usual group size doesn't do any good 
here. I still think there should be some upper limit to it, but that 
could be way above 100, 10000 or so maybe.

Thanks for being so persistent :-)

> 
> repeat hundreds of times. When the log_xid range is large, you're going
> to sift through a ton of data, doesn't matter what version of PostgreSQL
> is used.
> 
> If the 1698717542 transaction was twice as long, it would have been
> falling behind.

I guess the right strategy for grouping should be

     keep adding to the group until
     a) exceeding absolute upper limit
     b) discover change in minxid AND exceeding group size

Does that sound reasonable to everyone?

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #