[Slony1-general] Re: Out of memory errors

Sat Nov 19 16:07:10 PST 2005

> I think I found the underlying cause. The buffer lines never shrink.
> They grow continuously to fit the largest string which has gone
> through them.  We use to see a gradual increase because the occassional
> long record would grow that buffer and eventually enough large records
> would be processed to exceed process virtual memory.  The new problem
> is caused by adding a
> new table with lots of long lines which cause all the buffers to grow
> large and prevent even a single SYNC from being processed.

So it's a very strange sort of memory leak...

> I have a patch which frees the data string instead of resetting when
> it is above a certain
> size (I chose 32 KB).  There is a danger that a single fetch could be
> too big when lots of really huge rows come together.

32K sounds reasonable; I'd be keen on seeing that patch.

The problem of the "single fetch" is one I have had in mind for a while
now.  I have created a would-be solution that *doesn't* work; I think I'll
have to try again.

My "doesn't work" solution is thus...

- When rows are inserted into sl_log_?, we count the width of the data
field (well, it's only approximate, but that's good enough for this
purpose) and store it as sl_length

- The cursor that does the fetches is changed; we leave out the data
field, which means that the return set will remain compact, so it would be
pretty safe to arbitrarily do FETCH 1000 rather than the present FETCH
100.

The processing is then different.  Within each FETCH, we loop thru,
looking at the length field.

We collect entries into a query set, one by one, until the aggregate size
breaks some barrier.  Let's say 10MB.  Then we pass that query set to a
set returning function which will pass back, in order, the full data for
the set of items requested.

Thus, if you have a lot of 11MB rows, those will be processed more or less
one by one.  That doesn't strike me as being likely to be spectacularly
inefficient; the extra transaction and query overhead would be hidden
nicely by the huge size of the row.

Unfortunately, the SRF (Set Returning Function) seems prone to do seq
scans on sl_log_1 *for each row!!!*  Performance winds up sucking quite
spectacularly :-(.

I have two thoughts on resolving the performance issue:

1.  Rather than having the SRF querying for the rows individually, it
could simulate the 'outer' query, so it would try to pull all of them at
once.

2.  We draw the sl_log_? entries for the given SYNC set into a temp table,
where an added sequence field, sl_seqno starts at 1 and goes up to however
many rows are associated with the SYNC.

Application becomes dead simple; the "outer" query becomes "select
sl_seqno, sl_length from sl_log_temp", and all we need track, in the inner
loop, is  how sl_seqno progresses.  (Which obviously needs an index.)

Two costs accrue:
  - We make an extra copy of all the rows of sl_log_? that are applied
  - Temp table lifecycle generates dead tuples in various pg_catalog tables
    so they'll need regular vacuuming

But your change should be orthogonal to any of this stuff...