Christopher Browne cbbrowne at ca.afilias.info
Tue Jul 3 09:33:12 PDT 2007
"Andrew Hammond" <andrew.george.hammond at gmail.com> writes:
> On 6/29/07, Christopher Browne <cbbrowne at mail.libertyrms.com> wrote:
> A really interesting win would be in detecting cases where you can go from
>
> WHERE id IN ( a list )
>
> to
>
> WHERE a < id AND id < b
>
> However I think this is only possible at the time the transaction
> happens (how else will you know if your sequence is contigious. And
> that suggests to me that it's not reasonable to do at this time.

That also seems near-nondeterministic in that we're capturing data
based on the state of things when the transactions (multiple!) happen,
on the data source, when the effects will be based on the state of
things on destination nodes, at a different point in time.

I'll see about doing an experiment on this to see if, for the DELETE
case, it seems to actually help.  It may be that the performance
effects are small to none, so that the added code complication isn't
worthwhile.

> Also, ISTM that the big reason we don't like statement based
> replication is that SQL has many non-deterministic aspects. However,
> there is probably a pretty darn big subset of SQL which is provably
> non-deterministic. And for that subset, would it be any less
> rigorous to transmit those statements than to transmit the per-row
> change statments like we currently do?

Well, by capturing the values, we have captured a deterministic form
of the update.

Jan and I had a chat last week on ideas of how to do "wilder
transformations" (e.g. - like adding/dropping columns, or of
replicating "WHERE FOO IN ('BAR')"); what we arrived at was that, in
such cases, what we'd need to do is to have custom 'logtrigger'
functions that would have full access to OLD.* and NEW.* (e.g. - the
two sets of columns, old and new), which would then use them, perhaps
with arbitrary complexity, construct sl_log_n entries.

The "fully general" logtrigger function would be *way* less efficient
than the present ones; you don't get complex transformations for free.

>> It would take some parsing of the log_cmddata to do this, nonetheless,
>> I think it ought to be possible to compress this into some smaller
>> number of queries.  Again, if we limited each query to process 100
>> tuples, at most, that would still seem like enough to call it a "win."
>
> I can see two places to find these wins. When the statement is parsed
> (probably very affordable) and, as you mentioned above, by inspecting
> the log tables. I think that we'd have to be pretty clever with the
> log tables to avoid having it get too expensive. I wonder if full text
> indexing with an "sql stemmer" might be clever way to index that data
> usefully.

I have a *small* regret in this; it would be very nice if data in
sl_log_[n].log_cmddata were split into two portions:

1.  For an INSERT, split between the column name list and the VALUES
    portion;

    You could, in principle, join together a set of VALUES entries for
    the same table as long as the list of column names match.

2.  For an UPDATE, split between the SET portion and the WHERE
    portion;

    You could, in principle, join together a set of entries which
    have identical SET portions by folding together the WHERE
    clauses.

3.  For DELETE, there's nothing to be split :-).

    It's trivial to fold DELETE requests together as I previously
    showed.

> Two downsides of the parser approach that I can see are
> 1) the postgresql parser / planner is already plenty complex
> 2) it doesn't group stuff across multiple statements

I don't see any possibility of using a parser-based approach; that
jumps us back into statement-based replication, which is susceptible
to nondeterminism problems.

Remember, the thought we started with was:
   "What if we could do something that would make mass operations less
    expensive?"

I don't want to introduce anything that can materially increase
processing costs.

The more intelligent we try to get, the more expensive the
logtrigger() function gets, and if the price is high enough, then we
gain nothing.

The only "win" I see is if we can opportunistically join some
statements together.  If we have to make the log trigger function
universally *WAY* more expensive, well, that's a performance loss :-(.
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/unix.html
Rules of the  Evil Overlord #207. "Employees will  have conjugal visit
trailers which  they may use provided  they call in  a replacement and
sign out on  the timesheet. Given this, anyone caught  making out in a
closet  while  leaving  their   station  unmonitored  will  be  shot."
<http://www.eviloverlord.com/>


More information about the Slony1-general mailing list