[Slony1-general] Slony Falling Behind

Tue Sep 27 20:08:19 PDT 2005

Joe Markwardt wrote:

>Hi all,
>	I am having some problems trying to get slony to catch up after a
>weekend of heavy updates. I've tried playing with the -s and -g settings
>for slony, but that hasn't seemed to have any effect on the situation.
>I'm not sure where to proceed from here, so heres where I say "uncle"
>and ask for help from those wiser than I :-)  I'm not sure what info you
>need so I'll just have at it ( If I leave anything out let me know and
>i'll get it for you):
>
>setup: replicate 3 databases from node 1 to node 2
>	Each database is replicated using a separate slony config file (using
>the slon-tools)
>
>	node 1: provider
>		PG 7.4.7
>		Slony 1.1.0
>
>	node 2: subsciber
>		PG 8.0.3
>		Slony 1.1.0
>
>Problem:
>	node 2 is ~ 3 days behind node 1 after a series of heavy updates on the
>weekend and falling further behind.
>
>Symtops:
>	the FETCH statement on the provider is eating 100% of one of the CPU's
>on the server, not touching the disks at all.  The subscriber spends 99%
>of its time Idle in Transaction (waiting for the fetch to finish?)
>
>Details:
>
>from the node2 slony log:
>
>2005-09-27 10:55:28 CDT DEBUG2 remoteListenThread_1: queue event
>1,1691509 SYNC
>2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122 seconds
>delay for first row
>2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122 seconds
>until close cursor
>2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq
>value: 1000000000000000
>2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610537 done
>in 120.170 seconds
>2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: Received event
>1,1610538 SYNC
>2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610538
>processing
>2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: syncing set 1 with
>114 table(s) from provider 1
>2005-09-27 10:55:30 CDT DEBUG2 syncThread: new sl_action_seq 1 - SYNC
>635243
>
>from sl_status:
>select * from _pl_replication.sl_status ;
> st_origin | st_received | st_last_event |      st_last_event_ts      |
>st_last_received |    st_last_received_ts     |
>st_last_received_event_ts  | st_lag_num_events |      st_lag_time
>-----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------------
>         1 |           2 |       1691576 | 2005-09-27 10:58:05.907383 |
>1610538 | 2005-09-27 10:55:44.046123 | 2005-09-24 14:02:55.407712 |
>81038 | 2 days 20:55:12.07423
>
>from sl_log_1:
> select count(*) from _pl_replication.sl_log_1;
>  count
>----------
> 14427361
>
>Any ideas or suggestions on how to reverse this process so it will start
>catching up again?
>
>  
>

We have just been able to successfully point at a Probably Really Good
answer to this; it just got checked into CVS HEAD and 1_1_STABLE...

Let me suggest that you create an extra index:

create index sl_log_1_idx2 on _pl_replication.sl_log_1(log_xid
_pl_replication.xxid_ops);

This needs to be created on any provider node; it won't be particularly
valuable on nodes that only subscribe.

It should MASSIVELY increase the speed of the queries; you will more
than likely see the "120.122 seconds delay for first row" message
diminish to just a few seconds.

In the short term, you might also want to consider two other things:

 a) Set the "-s" option somewhat higher until things catch up; default
is 10000 ms; "-s60000" would do one SYNC per minute, which might be
useful...

 b) It looks like you're processing just one SYNC at a time; make sure
the "-g" option is set to some value; we often run with "-g40" to group
40 SYNCs together at a time...