[Slony1-general] Slony Falling Behind

Tue Sep 27 17:06:20 PDT 2005

Hi all,
	I am having some problems trying to get slony to catch up after a
weekend of heavy updates. I've tried playing with the -s and -g settings
for slony, but that hasn't seemed to have any effect on the situation.
I'm not sure where to proceed from here, so heres where I say "uncle"
and ask for help from those wiser than I :-)  I'm not sure what info you
need so I'll just have at it ( If I leave anything out let me know and
i'll get it for you):

setup: replicate 3 databases from node 1 to node 2
	Each database is replicated using a separate slony config file (using
the slon-tools)

	node 1: provider
		PG 7.4.7
		Slony 1.1.0

	node 2: subsciber
		PG 8.0.3
		Slony 1.1.0

Problem:
	node 2 is ~ 3 days behind node 1 after a series of heavy updates on the
weekend and falling further behind.

Symtops:
	the FETCH statement on the provider is eating 100% of one of the CPU's
on the server, not touching the disks at all.  The subscriber spends 99%
of its time Idle in Transaction (waiting for the fetch to finish?)

Details:

from the node2 slony log:

2005-09-27 10:55:28 CDT DEBUG2 remoteListenThread_1: queue event
1,1691509 SYNC
2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122 seconds
delay for first row
2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122 seconds
until close cursor
2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq
value: 1000000000000000
2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610537 done
in 120.170 seconds
2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: Received event
1,1610538 SYNC
2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610538
processing
2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: syncing set 1 with
114 table(s) from provider 1
2005-09-27 10:55:30 CDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC
635243

from sl_status:
select * from _pl_replication.sl_status ;
 st_origin | st_received | st_last_event |      st_last_event_ts      |
st_last_received |    st_last_received_ts     |
st_last_received_event_ts  | st_lag_num_events |      st_lag_time
-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------------
         1 |           2 |       1691576 | 2005-09-27 10:58:05.907383 |
1610538 | 2005-09-27 10:55:44.046123 | 2005-09-24 14:02:55.407712 |
81038 | 2 days 20:55:12.07423

from sl_log_1:
 select count(*) from _pl_replication.sl_log_1;
  count
----------
 14427361

Any ideas or suggestions on how to reverse this process so it will start
catching up again?