[Slony1-general] Slony Falling Behind

Tue Sep 27 22:37:42 PDT 2005

On Tue, 2005-09-27 at 17:30 -0400, Jan Wieck wrote:
> On 9/27/2005 4:16 PM, Joe Markwardt wrote:
> 
> > On Tue, 2005-09-27 at 15:38 -0400, Christopher Browne wrote:
> >> Jan Wieck wrote:
> >> 
> >> > On 9/27/2005 3:08 PM, Christopher Browne wrote:
> >> >
> >> >> Joe Markwardt wrote:
> >> >>
> >> >>> Hi all,
> >> >>>     I am having some problems trying to get slony to catch up
> > after a
> >> >>> weekend of heavy updates. I've tried playing with the -s and -g
> >> >>> settings
> >> >>> for slony, but that hasn't seemed to have any effect on the
> > situation.
> >> >>> I'm not sure where to proceed from here, so heres where I say
> > "uncle"
> >> >>> and ask for help from those wiser than I :-)  I'm not sure what
> > info
> >> >>> you
> >> >>> need so I'll just have at it ( If I leave anything out let me know
> > and
> >> >>> i'll get it for you):
> >> >>>
> >> >>> setup: replicate 3 databases from node 1 to node 2
> >> >>>     Each database is replicated using a separate slony config file
> >> >>> (using
> >> >>> the slon-tools)
> >> >>>
> >> >>>     node 1: provider
> >> >>>         PG 7.4.7
> >> >>>         Slony 1.1.0
> >> >>>
> >> >>>     node 2: subsciber
> >> >>>         PG 8.0.3
> >> >>>         Slony 1.1.0
> >> >>>
> >> >>> Problem:
> >> >>>     node 2 is ~ 3 days behind node 1 after a series of heavy
> > updates
> >> >>> on the
> >> >>> weekend and falling further behind.
> >> >>>
> >> >>> Symtops:
> >> >>>     the FETCH statement on the provider is eating 100% of one of
> > the
> >> >>> CPU's
> >> >>> on the server, not touching the disks at all.  The subscriber
> > spends
> >> >>> 99%
> >> >>> of its time Idle in Transaction (waiting for the fetch to finish?)
> >> >>>
> >> >>> Details:
> >> >>>
> >> >>> from the node2 slony log:
> >> >>>
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteListenThread_1: queue event
> >> >>> 1,1691509 SYNC
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122
> > seconds
> >> >>> delay for first row
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteHelperThread_1_1: 120.122
> > seconds
> >> >>> until close cursor
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: new
> > sl_rowid_seq
> >> >>> value: 1000000000000000
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610537
> > done
> >> >>> in 120.170 seconds
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: Received
> > event
> >> >>> 1,1610538 SYNC
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610538
> >> >>> processing
> >> >>> 2005-09-27 10:55:28 CDT DEBUG2 remoteWorkerThread_1: syncing set 1
> > with
> >> >>> 114 table(s) from provider 1
> >> >>> 2005-09-27 10:55:30 CDT DEBUG2 syncThread: new sl_action_seq 1 -
> > SYNC
> >> >>> 635243
> >> >>>
> >> >>> from sl_status:
> >> >>> select * from _pl_replication.sl_status ;
> >> >>> st_origin | st_received | st_last_event |      st_last_event_ts
> > |
> >> >>> st_last_received |    st_last_received_ts     |
> >> >>> st_last_received_event_ts  | st_lag_num_events |      st_lag_time
> >> >>>
> > -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------------
> >> >>>
> >> >>>         1 |           2 |       1691576 | 2005-09-27
> > 10:58:05.907383 |
> >> >>> 1610538 | 2005-09-27 10:55:44.046123 | 2005-09-24 14:02:55.407712
> > |
> >> >>> 81038 | 2 days 20:55:12.07423
> >> >>>
> >> >>> from sl_log_1:
> >> >>> select count(*) from _pl_replication.sl_log_1;
> >> >>>  count
> >> >>> ----------
> >> >>> 14427361
> >> >>>
> >> >>> Any ideas or suggestions on how to reverse this process so it will
> >> >>> start
> >> >>> catching up again?
> >> >>>
> >> >>>  
> >> >>>
> >> >>
> >> >> We have just been able to successfully point at a Probably Really
> > Good
> >> >> answer to this; it just got checked into CVS HEAD and 1_1_STABLE...
> >> >>
> >> >> Let me suggest that you create an extra index:
> >> >>
> >> >> create index sl_log_1_idx2 on _pl_replication.sl_log_1(log_xid
> >> >> _pl_replication.xxid_ops);
> >> >
> >> >
> >> > Is that an optimization problem introduced by using BETWEEN instead
> > of
> >> > the excessive OR chains?
> >> 
> >> No, the BETWEEN patch is not in CVS yet, so nobody would be using it
> > by
> >> accident :-).
> >> 
> >> The extra index has proven useful with both version 1.0.5 and 1.1.0.
> >> 
> > Unfortunately it doesn't appear that it helped me out any :-/
> > 
> > pl=# \d _pl_replication.sl_log_1
> >          Table "_pl_replication.sl_log_1"
> >     Column     |         Type         | Modifiers
> > ---------------+----------------------+-----------
> >  log_origin    | integer              |
> >  log_xid       | _pl_replication.xxid |
> >  log_tableid   | integer              |
> >  log_actionseq | bigint               |
> >  log_cmdtype   | character(1)         |
> >  log_cmddata   | text                 |
> > Indexes:
> >     "sl_log_1_idx1" btree (log_origin, log_xid _pl_replication.xxid_ops,
> > log_actionseq)
> >     "sl_log_1_idx2" btree (log_xid _pl_replication.xxid_ops)
> > 
> > 
> > 
> > 2005-09-27 14:39:06 CDT DEBUG2 remoteHelperThread_1_1: 119.242 seconds
> > delay for first row
> > 2005-09-27 14:39:06 CDT DEBUG2 remoteHelperThread_1_1: 119.243 seconds
> > until close cursor
> > 2005-09-27 14:39:06 CDT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq
> > value: 1000000000000000
> > 2005-09-27 14:39:06 CDT DEBUG2 remoteWorkerThread_1: SYNC 1610644 done
> > in 119.289 seconds
> > 
> > I did also make the changes to the -s and -g parameters and it appears
> > to be generating syncs less often, but it doesn't really look to be
> > catching up any faster.
> > 
> > postgres     12839  0.0  0.6 288840 51708 pts/3  S    14:48
> > 0:00 /usr/local/pgsql/bin//slon -s 60000 -d2 -g80 pl_replication
> > host=192.168.1.200 dbname=pl
> > 
> > 
> > 
> > Any other changes you can suggest?
> 
> Did you give it an ANALYZE after creating the index?
> 
> 
> Jan
> 
Yep.  And I restarted the slon's as well, so they should be using the
index as long as the planner thinks it will help.  Is there anyway I can
see the query that slon is generating to create the cursor? If I could
see that query then I could do some explain analyze's and see what they
say...

-Joe
> 
> > 
> > _______________________________________________
> > Slony1-general mailing list
> > Slony1-general at gborg.postgresql.org
> > http://gborg.postgresql.org/mailman/listinfo/slony1-general
> 
>