[Slony1-general] Slony Falling Behind

Tue Sep 27 22:30:46 PDT 2005

On 9/27/2005 4:16 PM, Joe Markwardt wrote:

> On Tue, 2005-09-27 at 15:38 -0400, Christopher Browne wrote:
>> Jan Wieck wrote:
>> 
>> > On 9/27/2005 3:08 PM, Christopher Browne wrote:
>> >
>> >> Joe Markwardt wrote:
>> >>
>> >>> Hi all,
>> >>>     I am having some problems trying to get slony to catch up
> after a
>> >>> weekend of heavy updates. I've tried playing with the -s and -g
>> >>> settings
>> >>> for slony, but that hasn't seemed to have any effect on the
> situation.
>> >>> I'm not sure where to proceed from here, so heres where I say
> "uncle"
>> >>> and ask for help from those wiser than I :-)  I'm not sure what
> info
>> >>> you
>> >>> need so I'll just have at it ( If I leave anything out let me know
> and
>> >>> i'll get it for you):
>> >>>
>> >>> setup: replicate 3 databases from node 1 to node 2
>> >>>     Each database is replicated using a separate slony config file
>> >>> (using
>> >>> the slon-tools)
>> >>>
>> >>>     node 1: provider
>> >>>         PG 7.4.7
>> >>>         Slony 1.1.0
>> >>>
>> >>>     node 2: subsciber
>> >>>         PG 8.0.3
>> >>>         Slony 1.1.0
>> >>>
>> >>> Problem:
>> >>>     node 2 is ~ 3 days behind node 1 after a series of heavy
> updates
>> >>> on the
>> >>> weekend and falling further behind.
>> >>>
>> >>> Symtops:
>> >>>     the FETCH statement on the provider is eating 100% of one of
> the
>> >>> CPU's
>> >>> on the server, not touching the disks at all.  The subscriber
> spends
>> >>> 99%
>> >>> of its time Idle in Transaction (waiting for the fetch to finish?)
>> >>>
>> >>> Details:
>> >>>
>> >>> from the node2 slony log:
>> >>>
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteListenThread_1: queue event
>> >>> 1,1691509 SYNC
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122
> seconds
>> >>> delay for first row
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122
> seconds
>> >>> until close cursor
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: new
> sl_rowid_seq
>> >>> value: 1000000000000000
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610537
> done
>> >>> in 120.170 seconds
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: Received
> event
>> >>> 1,1610538 SYNC
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610538
>> >>> processing
>> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: syncing set 1
> with
>> >>> 114 table(s) from provider 1
>> >>> 2005-09-27 10:55:30 CDT DEBUG2 syncThread: new sl_action_seq 1 -
> SYNC
>> >>> 635243
>> >>>
>> >>> from sl_status:
>> >>> select * from _pl_replication.sl_status ;
>> >>> st_origin | st_received | st_last_event |      st_last_event_ts
> |
>> >>> st_last_received |    st_last_received_ts     |
>> >>> st_last_received_event_ts  | st_lag_num_events |      st_lag_time
>> >>>
> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------------
>> >>>
>> >>>         1 |           2 |       1691576 | 2005-09-27
> 10:58:05.907383 |
>> >>> 1610538 | 2005-09-27 10:55:44.046123 | 2005-09-24 14:02:55.407712
> |
>> >>> 81038 | 2 days 20:55:12.07423
>> >>>
>> >>> from sl_log_1:
>> >>> select count(*) from _pl_replication.sl_log_1;
>> >>>  count
>> >>> ----------
>> >>> 14427361
>> >>>
>> >>> Any ideas or suggestions on how to reverse this process so it will
>> >>> start
>> >>> catching up again?
>> >>>
>> >>>  
>> >>>
>> >>
>> >> We have just been able to successfully point at a Probably Really
> Good
>> >> answer to this; it just got checked into CVS HEAD and 1_1_STABLE...
>> >>
>> >> Let me suggest that you create an extra index:
>> >>
>> >> create index sl_log_1_idx2 on _pl_replication.sl_log_1(log_xid
>> >> _pl_replication.xxid_ops);
>> >
>> >
>> > Is that an optimization problem introduced by using BETWEEN instead
> of
>> > the excessive OR chains?
>> 
>> No, the BETWEEN patch is not in CVS yet, so nobody would be using it
> by
>> accident :-).
>> 
>> The extra index has proven useful with both version 1.0.5 and 1.1.0.
>> 
> Unfortunately it doesn't appear that it helped me out any :-/
> 
> pl=# \d _pl_replication.sl_log_1
>          Table "_pl_replication.sl_log_1"
>     Column     |         Type         | Modifiers
> ---------------+----------------------+-----------
>  log_origin    | integer              |
>  log_xid       | _pl_replication.xxid |
>  log_tableid   | integer              |
>  log_actionseq | bigint               |
>  log_cmdtype   | character(1)         |
>  log_cmddata   | text                 |
> Indexes:
>     "sl_log_1_idx1" btree (log_origin, log_xid _pl_replication.xxid_ops,
> log_actionseq)
>     "sl_log_1_idx2" btree (log_xid _pl_replication.xxid_ops)
> 
> 
> 
> 2005-09-27 14:39:06 CDT DEBUG2 remoteHelperThread_1_1: 119.242 seconds
> delay for first row
> 2005-09-27 14:39:06 CDT DEBUG2 remoteHelperThread_1_1: 119.243 seconds
> until close cursor
> 2005-09-27 14:39:06 CDT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq
> value: 1000000000000000
> 2005-09-27 14:39:06 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610644 done
> in 119.289 seconds
> 
> I did also make the changes to the -s and -g parameters and it appears
> to be generating syncs less often, but it doesn't really look to be
> catching up any faster.
> 
> postgres     12839  0.0  0.6 288840 51708 pts/3  S    14:48
> 0:00 /usr/local/pgsql/bin//slon -s 60000 -d2 -g80 pl_replication
> host=192.168.1.200 dbname=pl
> 
> 
> 
> Any other changes you can suggest?

Did you give it an ANALYZE after creating the index?

Jan

> 
> _______________________________________________
> Slony1-general mailing list
> Slony1-general at gborg.postgresql.org
> http://gborg.postgresql.org/mailman/listinfo/slony1-general

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #