Wed Apr 23 09:42:43 PDT 2008
- Previous message: [Slony1-general] Slony Replication in wide-area applications?
- Next message: [Slony1-general] Catching up a large backlog: a few observations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all, There is a bit of an issue in the way slon gets events from the master when there is a (very) large backlog: the remote listener thread tries to read all the events available in memory, which means: - the process can grow quite a lot, eat useful cache memory from the DB (if slon is running on the same instance as the DB which is the usual case), eventually start swapping, and it worsens the backlog - and/or the process can actually exceed memory limits, exit and restart, fetching events again - the initial load of all available events itself may never complete (if it exceeds memory limits), and thus no replication happens since it won't start working until this initial load is complete One first and easy fix for the last problem is to add a simple "LIMIT x" in remoteListen_receive_events. This will at least allow slon to start handling events while more are loaded. In situations where events can still be read a lot faster than they are handled (which is usually the case), forcing a sleep in the loop helps, but I'm not sure how this could be made to work in the general case. A further and better fix would be to also add a count of "outstanding" events (that would be incremented when new events are loaded and decremented once they have been handled), and to have the listener thread sleep a bit when that count exceeds a given threshold. No need to have tens of millions of events in memory (with the possible complications given above) if we handle at most a few thousand at a time... I also found out that setting desired_sync_time to 0 and increasing significantly sync_group_maxsize helps a lot when catching up. Is there a specific reason to have a low default value for this? Since it's bounded by the number of available events anyway, I'm not sure how low values actually help anything -- at least when desired_sync_time=0. Finally, when in some situations fetching from the log is slow (that can happen when trying to fetch log items that happened during a long transaction, as the bounds for the index search are quite large), I am not sure that the logic behind desired_sync_time and such works very well: here the time it takes is not proportional to the number of events (the time per event actually decreases when the number of events handled at once increases, as most of the time is spent -wasted?- in the initial fetch). Obviously if there was a way to build an index that better matches the fetches it would help, but I'm not quite sure that is possible (I haven't quite figured the whole minxid/maxxid/xip etc. thing yet). Comments? Jacques.
- Previous message: [Slony1-general] Slony Replication in wide-area applications?
- Next message: [Slony1-general] Catching up a large backlog: a few observations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list