[Slony1-general] Catching up a large backlog: a few observations

Thu Apr 24 08:54:12 PDT 2008

Jacques Caron <jc at oxado.com> writes:
> At 20:56 23/04/2008, Christopher Browne wrote:
>>I don't see any reason in the code why limiting the number of events
>>processed should break anything.  I think I'd want to set the limit
>>based on a configuration parameter, but at first blush, the following
>>seems reasonable:
>>
>>Index: remote_listen.c
>>===================================================================
>>RCS file: /home/cvsd/slony1/slony1-engine/src/slon/remote_listen.c,v
>>retrieving revision 1.40
>>diff -c -u -r1.40 remote_listen.c
>>cvs diff: conflicting specifications of output style
>>--- remote_listen.c     6 Feb 2008 20:20:50 -0000       1.40
>>+++ remote_listen.c     23 Apr 2008 18:29:14 -0000
>>@@ -1,4 +1,4 @@
>>-/* ----------------------------------------------------------------------
>>+* ----------------------------------------------------------------------
>>   * remote_listen.c
>>   *
>>   *     Implementation of the thread listening for events on
>>@@ -697,7 +697,7 @@
>>         {
>>                 slon_appendquery(&query, ")");
>>         }
>>-       slon_appendquery(&query, " order by e.ev_origin, e.ev_seqno");
>> +       slon_appendquery(&query, " order by e.ev_origin, e.ev_seqno
>> limit 2000");
>>
>>         rtcfg_unlock();
>
> I've been running slon with a similar change for a little while
> (finally catching up on a huge backlog) and it seems not to have
> caused any problems. But indeed a configuration parameter may be a
> good idea.

It's not very hard to make it configurable, and that eliminates the
need for a config change to require a recompile, which is of decidedly
non-zero value :-).

>>- Every time a message is added to the queue in remote_listen.c, we
>>   add to the counter
>>
>>- Every time a message is processed from the queue in remote_worker.c,
>>   we decrement the counter
>>
>>- In remote_listen.c, any time the size of the queue is larger than
>>   "os_event_threshold" (defaults to > the LIMIT used in the query in
>>   remote_listen.c), then we sleep for "os_event_sleep" milliseconds
>>   before processing another iteration of the "event search loop."
>
> Sounds good. There could be a check that keeps the thread in a sleep
> loop until the counter comes below said threshold as well. Or the
> limit in the select could be (threshold - count). Obviously if count >
> threshold we want to skip the select and go back to sleep (and of
> course care must be taken with the poll/listen switches).

I'm not sure we want to stop queueing events altogether for any
extended period of time.  *That* seems like a risky thing to do.

>>Alternatively, this could get more sophisticated, with some extra
>>config parms:
>>
>>  * os_event_limit            - How many events to pull at a time,
>>                                and the threshold for further stuff
>>  * os_event_initialsleep     - If os_events > limit, then,
>>                                initially, sleep this many ms
>>  * os_event_increment        - When os_events continues to be >
>>                                os_event_limit, add this to the
>>                                sleep time
>>  * os_event_maxsleep         - Don't let sleep time exceed this
>
> Don't know if we need that much "complexity"? I think the above logic
> ("only get as much as we need") might be simpler and requires less
> tuning (and less default values to pick!).

If the defaults are reasonable, almost nobody ever need touch it :-).

And it seems to me that if we put in *some* delays, that's likely to
be enough to resolve 95% of the problem cases.  I'm reluctant to take
the approach of stopping queueing altogether for a extended period.

- If the problem is that there is a backlog of irrelevant SYNCs
  because some node was out of commission for a while, the "LIMIT N"
  combined with having just about *any* sort of delay in the listener
  loop should enable painless catchup.

- If the problem is that there is a backlog of SYNCs that *will* need
  to be processed, then metering them in, via "LIMIT N" + some delays,
  should prevent the slon from blowing up.  If it *does* blow up, it'll
  restart, *hopefully* after getting some work done.

The alternative solution is to do "strict metering" where we don't
allow the queue to grow past some pre-defined size.  But I'm not sure
what that size should be.

>> > I also found out that setting desired_sync_time to 0 and increasing
>> > significantly sync_group_maxsize helps a lot when catching up. Is
>> > there a specific reason to have a low default value for this? Since
>> > it's bounded by the number of available events anyway, I'm not sure
>> > how low values actually help anything -- at least when
>> > desired_sync_time=0.
>>
>>There is a reason, if you're running log shipping; you might want to
>>be sure that each SYNC is kept separate, so that you could most
>>closely associate the set of data with its SYNC time.
>
> So it should be either 1 or a relatively large value, then? Not sure
> the current default makes much sense?

That seems right.  Maybe we should increase the default in 2.0...

> But I haven't had much experience with this other than watching slon
> trying to cope with this backlog :-( Not sure how things work out in
> a "normal" situation.
>
> Also, something stupid: the code has:
>
> max_sync = ((last_sync_group_size * 200) / 100) + 1;
>
> I'm quite sure this can be simplified as:
>
> max_sync = last_sync_group_size * 2 + 1;
>
> without changing the result (actually the compiler probably does it
> already). Either that or the intended result got lost somewhere :-)

Hmm.  At some point, some of the calculations used larger
numerator/denominator to avoid accidentally truncating an integer to
0.  That was evidently not one of the places where that was necessary.

>>For sure, the "desired_sync_time" is only an approximation.  It has
>>the implicit assumption that run time time for a set of SYNCs is be
>>roughly proportional to the number of SYNCs, which isn't always true.
>>
>>Any policy that is applied here is necessarily an approximation, so I
>>don't know that it is too likely to see *huge* improvements from a
>>substitute policy.  If you can describe another that is readily coded,
>>I'll certainly listen :-).
>
> I don't have any good suggestions for that yet, I've been doing the
> catching up with desired_sync_time = 0 so that code is effectively
> disabled :-) One possible option, though, would be to split the "time
> for first fetch" from the rest, and maybe consider that as a
> "constant" in the computation. I.e. consider that:
>
> estimated_times_it_takes_to_handle_n_syncs = time_for_first_fetch + n
> * time_for_remainder_of_last_run / number_of_syncs_in_last_run
>
> rather than n * time_of_last_run / number_of_syncs_in_last_run as I
> believe it is right now? I'm wild-guessing here :-)
>
> Jacques.

Ah, you could be right there.  Yes, it may be that the "time for first
fetch" is nearly constant, and so should be taken out of the estimate.

Mind you, we may be "gilding buggy whips" here; trying to improve an
estimate that is fundamentally flawed.  There is the fundamental flaw
that there is no real reason to expect two SYNCs to be equally
expensive, if there is variation in system load.
-- 
"cbbrowne","@","linuxfinances.info"
http://cbbrowne.com/info/linuxxian.html
If a hole in the street is a manhole, is a hole in a man a streethole?