[Slony1-general] Catching up a large backlog: a few observations

Wed Apr 23 11:56:13 PDT 2008

Jacques Caron <jc at oxado.com> writes:
> Hi all,
>
> There is a bit of an issue in the way slon gets events from the master
> when there is a (very) large backlog: the remote listener thread tries
> to read all the events available in memory, which means:
> - the process can grow quite a lot, eat useful cache memory from the
> DB (if slon is running on the same instance as the DB which is the
> usual case), eventually start swapping, and it worsens the backlog
> - and/or the process can actually exceed memory limits, exit and
> restart, fetching events again
> - the initial load of all available events itself may never complete
> (if it exceeds memory limits), and thus no replication happens since
> it won't start working until this initial load is complete
>
> One first and easy fix for the last problem is to add a simple "LIMIT
> x" in remoteListen_receive_events. This will at least allow slon to
> start handling events while more are loaded. In situations where
> events can still be read a lot faster than they are handled (which is
> usually the case), forcing a sleep in the loop helps, but I'm not sure
> how this could be made to work in the general case.

That sounds pretty plausible.

I don't see any reason in the code why limiting the number of events
processed should break anything.  I think I'd want to set the limit
based on a configuration parameter, but at first blush, the following
seems reasonable:

Index: remote_listen.c
===================================================================
RCS file: /home/cvsd/slony1/slony1-engine/src/slon/remote_listen.c,v
retrieving revision 1.40
diff -c -u -r1.40 remote_listen.c
cvs diff: conflicting specifications of output style

--- remote_listen.c	6 Feb 2008 20:20:50 -0000	1.40
+++ remote_listen.c	23 Apr 2008 18:29:14 -0000
@@ -1,4 +1,4 @@
-/* ----------------------------------------------------------------------
+* ----------------------------------------------------------------------
  * remote_listen.c
  *
  *	Implementation of the thread listening for events on
@@ -697,7 +697,7 @@
 	{
 		slon_appendquery(&query, ")");
 	}
-	slon_appendquery(&query, " order by e.ev_origin, e.ev_seqno");
+	slon_appendquery(&query, " order by e.ev_origin, e.ev_seqno limit 2000");
 
 	rtcfg_unlock();
 


> A further and better fix would be to also add a count of "outstanding"
> events (that would be incremented when new events are loaded and
> decremented once they have been handled), and to have the listener
> thread sleep a bit when that count exceeds a given threshold. No need
> to have tens of millions of events in memory (with the possible
> complications given above) if we handle at most a few thousand at a
> time...

That's not a bad thought...

This would *definitely* point at adding a config parameter or two...

Yes, indeed, we maintain a variable on the queue (or perhaps across
queues???), so that we have a count of the number of outstanding
messages.

- Every time a message is added to the queue in remote_listen.c, we
  add to the counter

- Every time a message is processed from the queue in remote_worker.c,
  we decrement the counter

- In remote_listen.c, any time the size of the queue is larger than
  "os_event_threshold" (defaults to > the LIMIT used in the query in
  remote_listen.c), then we sleep for "os_event_sleep" milliseconds
  before processing another iteration of the "event search loop."

Alternatively, this could get more sophisticated, with some extra
config parms:

 * os_event_limit            - How many events to pull at a time,
                               and the threshold for further stuff
 * os_event_initialsleep     - If os_events > limit, then,
                               initially, sleep this many ms
 * os_event_increment        - When os_events continues to be > 
                               os_event_limit, add this to the
                               sleep time
 * os_event_maxsleep         - Don't let sleep time exceed this

With defaults...
  os_event_limit = 2000
  os_event_initialsleep = 2000
  os_event_increment = 500    # add 0.5s each time
  os_event_maxsleep = 15000   

Any time the queue shrinks below os_event_limit, then we reset the
sleep time back to os_event_initialsleep.

> I also found out that setting desired_sync_time to 0 and increasing
> significantly sync_group_maxsize helps a lot when catching up. Is
> there a specific reason to have a low default value for this? Since
> it's bounded by the number of available events anyway, I'm not sure
> how low values actually help anything -- at least when
> desired_sync_time=0.

There is a reason, if you're running log shipping; you might want to
be sure that each SYNC is kept separate, so that you could most
closely associate the set of data with its SYNC time.

> Finally, when in some situations fetching from the log is slow (that
> can happen when trying to fetch log items that happened during a
> long transaction, as the bounds for the index search are quite
> large), I am not sure that the logic behind desired_sync_time and
> such works very well: here the time it takes is not proportional to
> the number of events (the time per event actually decreases when the
> number of events handled at once increases, as most of the time is
> spent -wasted?- in the initial fetch).
>
> Obviously if there was a way to build an index that better matches
> the fetches it would help, but I'm not quite sure that is possible
> (I haven't quite figured the whole minxid/maxxid/xip etc. thing
> yet).
>
> Comments?

There are some improvements in 2.0 to the query on the log table,
notably for the special case where you have a really long running
transaction.

For sure, the "desired_sync_time" is only an approximation.  It has
the implicit assumption that run time time for a set of SYNCs is be
roughly proportional to the number of SYNCs, which isn't always true.

Any policy that is applied here is necessarily an approximation, so I
don't know that it is too likely to see *huge* improvements from a
substitute policy.  If you can describe another that is readily coded,
I'll certainly listen :-).
-- 
let name="cbbrowne" and tld="linuxdatabases.info" in String.concat "@" [name;tld];;
http://linuxfinances.info/info/x.html
"When campaigning, be swift as  the wind; in leisurely march, majestic
as the forest; in raiding and plundering, like fire; in standing, firm
as  the  mountains.   As  unfathomable  as the  clouds,  move  like  a
thunderbolt."  -- Sun Tzu, "The Art of War"