[Slony1-general] Proposal: using COPY to pull sl_log_? data to subscribers

Mon Jan 28 07:41:41 PST 2008

Simon Riggs <simon at 2ndquadrant.com> writes:
> On Mon, 2008-01-07 at 12:43 -0500, Christopher Browne wrote:
>
>>   1.  Some processing load gets taken off the provider
>> 
>>       The "LOG cursor" query becomes a "COPY [foo] to stdout" query,
>>       and it's worth noting that we lose the need to have an ORDER BY
>>       clause, which eliminates a sort.
>
> If you define the cursors as NO SCROLL it will improve larger sorts.

In Googling around a bit on this, it looks like this is significant
for PG >= 8.2, so it seems as though this would be a good change to
put in in HEAD, possibly to be replaced by COPY later...

It seems to be just a 3 line change, as there are 3 queries that
create the LOG cursor (with just sl_log_1, just sl_log_2, or combining
both).

>>   3.  We use COPY to load data onto the subscriber
>> 
>>       There are two very large benefits to this, in that:
>>        i) COPY should be *WAY* faster than the INSERT presently used;
>>        ii) We can COPY in specific-sized-buffer chunks, which eliminates
>>            the somewhat-overly baroque code that tries to limit slon
>>            memory usage.
>
> Why not allow a table-specific pair of log tables? These would be
> dedicated just to that table, completely separate from the general
> table. That way a large INSERT-heavy table could isolate its data from
> other tables, allowing a COPY just on that table, though 
>
> That way you aren't changing the underlying mechanisms, which work
> nicely, but allow a different route for large INSERT-heavy tables, which
> are likely the main cause of problems in this area.

The option of using a separate table is somewhat interesting, but, on
the one hand, orthogonal to the proposed change, and on the other, I
expect that the replace-DELETE-with-TRUNCATE approach has already
gleaned most of the possible improvements.

The effect of the "use COPY to load data" proposal shouldn't be
significant at all on how the log data is stored; it doesn't (at this
point, at least) change the structure of the log tables; it does two
major things:

  1.  Replaces the "LOG cursor" with a pair of COPY statements, one
      that pulls log data from the source, and one that drops that
      data, verbatim, onto the subscriber.

      If we decide that "heavily updated table T1" *supremely* needs
      this, due to heavy update patterns on it, I rather think that
      letting the other tables' data come along for the ride will come
      "nearly for free."

  2.  Replaces the logic where sl_log_* data is interpreted (e.g. - to
      determine if the request is an INSERT/DELETE/UPDATE, and on what
      table) inside the slon, with interpreting this on the subscriber.

      I don't see a benefit in dividing this process up; if it's a
      "win" to do it on the subscriber, for the heavily updated
      table(s), then should it not also be an improvement for the
      lightly-updated tables?

For both sides of the processing, I don't see a benefit falling out of
dividing replicated data into "low-update rates" versus "high update
rates."  If the "use COPY and evaluate on the subscriber" method is
dominantly better than the present method, then we should migrate to
it.

Indeed, splitting up the process has, to my view, a substantial
detriment, namely that it means that we would have *TWO* replication
processing mechanisms in the code base to test and maintain.  That
won't come for free :-(.
-- 
select 'cbbrowne' || '@' || 'linuxdatabases.info';
http://cbbrowne.com/info/wp.html
Rules  of  the Evil  Overlord  #97.  "My  dungeon  cells  will not  be
furnished with  objects that  contain reflective surfaces  or anything
that can be unravelled." <http://www.eviloverlord.com/>