Jan Wieck JanWieck at Yahoo.com
Mon Sep 10 15:46:47 PDT 2007
On 9/10/2007 4:33 PM, Cyril SCETBON wrote:
> 
> Cyril SCETBON wrote:
>>
>>
>> Jan Wieck wrote:
>>> On 9/7/2007 9:36 AM, Cyril SCETBON wrote:
>>>> Hi,
>>>>
>>>> I got this configuration                Node1 --> Node2 (5 seconds 
>>>> late)
>>>>                                                           |
>>>>                                                           --> Node3 
>>>> (2 hours late)
>>>>
>>>> Node2 is processing each SYNC from Node3 and Node2, but Node3 is 
>>>> processing each SYNC from Node2 but not from Node1 which is the 
>>>> origin of the sets :
>>>>
>>>> On Node3 we see  `grep processing 
>>>> /var/log/slony1/node3-pns_profiles_preprod.log|awk '{print 
>>>> $5}'|sort|uniq -c`
>>>>      19 remoteWorkerThread_1:
>>>>     963 remoteWorkerThread_2:
>>>>
>>>> On Node2 we see `grep processing 
>>>> /var/log/slony1/node2-pns_profiles_preprod.log |awk '{print 
>>>> $5}'|sort|uniq -c`
>>>>    1570 remoteWorkerThread_1:
>>>>     865 remoteWorkerThread_3:
>>>>
>>>> Why is there so many SYNC not processed on Node3 ???
>>>>
>>>> Node3 got 22440 queue event and 25 Received event from 
>>>> remoteWorkerThread_1, while Node2 got 4467 queue event and 1578 
>>>> Received event from the same worker.
>>>>
>>>> Is there something to do ?
>>>
>>> How about looking for some error messages?
>> None.
> I've put slon in debug level 2
>>>
>>> What comes to mind would be that sl_event is grossly out of shape and 
>>> that the event selection times out.
>> Seems vacuuming sl_log_1 takes too much time cause of 
>> vacuum_cost_delay and that selecting from this table use a seq scan. 
>> I'm investiguating.
> I forced vacuum to go faster and checked slon logs of subscribers. They 
> got similar disks capabilities which seems to be the bottleneck on all 
> node (wait io ~=50% in vmstat).
> 
> I found replication tasks time are different :
> 
> On node 3 :
>                      delay in seconds = 585.974ms
>                      cleanupEvent in seconds = 9.25167s
> 
> On node 2 :
>                      delay in seconds = 37.6463ms
>                      cleanupEvent in seconds = 0.203265s
> 
> May these times explain why node 3 is late compared to node 2 ? What do 
> you think I have to investiguate now ?

Considering that node 2 can pretty well keep up but node 3 is falling 
way behind, the problem cannot be caused by node 1. Neither can it be 
caused by the event selection of node 3, so that leaves us with either 
the log selection done by node 3 against the data provider node 2, or 
the actual speed of node 3 itself.

In debug level 2, what does node 3's slon usually report as "delay for 
first row" when processing SYNC events?


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #


More information about the Slony1-general mailing list