Wed Sep 19 10:36:31 PDT 2007
- Previous message: [Slony1-general] Processing of SYNC from origin node
- Next message: [Slony1-general] Processing of SYNC from origin node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jan Wieck wrote: > On 9/11/2007 2:49 AM, Cyril SCETBON wrote: >> >> Jan Wieck wrote: >>> On 9/10/2007 4:33 PM, Cyril SCETBON wrote: >>>> >>>> Cyril SCETBON wrote: >>>>> >>>>> >>>>> Jan Wieck wrote: >>>>>> On 9/7/2007 9:36 AM, Cyril SCETBON wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I got this configuration Node1 --> Node2 (5 >>>>>>> seconds late) >>>>>>> | >>>>>>> --> >>>>>>> Node3 (2 hours late) >>>>>>> >>>>>>> Node2 is processing each SYNC from Node3 and Node2, but Node3 is >>>>>>> processing each SYNC from Node2 but not from Node1 which is the >>>>>>> origin of the sets : >>>>>>> >>>>>>> On Node3 we see `grep processing >>>>>>> /var/log/slony1/node3-pns_profiles_preprod.log|awk '{print >>>>>>> $5}'|sort|uniq -c` >>>>>>> 19 remoteWorkerThread_1: >>>>>>> 963 remoteWorkerThread_2: >>>>>>> >>>>>>> On Node2 we see `grep processing >>>>>>> /var/log/slony1/node2-pns_profiles_preprod.log |awk '{print >>>>>>> $5}'|sort|uniq -c` >>>>>>> 1570 remoteWorkerThread_1: >>>>>>> 865 remoteWorkerThread_3: >>>>>>> >>>>>>> Why is there so many SYNC not processed on Node3 ??? >>>>>>> >>>>>>> Node3 got 22440 queue event and 25 Received event from >>>>>>> remoteWorkerThread_1, while Node2 got 4467 queue event and 1578 >>>>>>> Received event from the same worker. >>>>>>> >>>>>>> Is there something to do ? >>>>>> >>>>>> How about looking for some error messages? >>>>> None. >>>> I've put slon in debug level 2 >>>>>> >>>>>> What comes to mind would be that sl_event is grossly out of shape >>>>>> and that the event selection times out. >>>>> Seems vacuuming sl_log_1 takes too much time cause of >>>>> vacuum_cost_delay and that selecting from this table use a seq >>>>> scan. I'm investiguating. >>>> I forced vacuum to go faster and checked slon logs of subscribers. >>>> They got similar disks capabilities which seems to be the >>>> bottleneck on all node (wait io ~=50% in vmstat). >>>> >>>> I found replication tasks time are different : >>>> >>>> On node 3 : >>>> delay in seconds = 585.974ms >>>> cleanupEvent in seconds = 9.25167s >>>> >>>> On node 2 : >>>> delay in seconds = 37.6463ms >>>> cleanupEvent in seconds = 0.203265s >>>> >>>> May these times explain why node 3 is late compared to node 2 ? >>>> What do you think I have to investiguate now ? >>> >>> Considering that node 2 can pretty well keep up but node 3 is >>> falling way behind, the problem cannot be caused by node 1. Neither >>> can it be caused by the event selection of node 3, so that leaves us >>> with either the log selection done by node 3 against the data >>> provider node 2, or the actual speed of node 3 itself. >>> >>> In debug level 2, what does node 3's slon usually report as "delay >>> for first row" when processing SYNC events? >> that's what I gave as 'delay in seconds' above > > OK, so the origin can provide log rows almost instantaneously, while > node 2 has apparently some issues with the same. Although half a > second isn't a catastrophe, it indicates that there are some > performance issues handling the overall workload already on that system. > > Now when in comes to node 3, this means that it is not doing any > actual replication work for 500ms per sync group. Which should not > pose a real problem. So my guess is that node 3 is simply too slow to > keep up with the write load of the origin, or that the network > connection is too slow to actually deliver the log data fast enough. > If this is a WAN connection (which by itself can explain 500ms for the > first FETCH of 100 log rows), you might want to try using an ssh > tunnel with compression. Although I use SSH compression, it's not better. On The Provider there are 400 write/s, do you think it should be worth to increase SLON_DATA_FETCH_SIZE to 500 or 1000 in the remote_worker to improve my performance ? Network latency (18ms for a ping to another geographical site vs 0.2 ms on the same geographical site) I got 1024 tables that are spread in 64 sets, I was thinking too that maybe spreading this 64 sets into 2 databases on the same host would improve performance by using 2 differents slony clusters on the same machine. So, smaller sl_log_? and 2 differents slon daemons (1 by cluster) to take care of the replication. > > The other thing to check is to make sure all databases are tuned. both hosts can serve more than 500 w/s -- Cyril SCETBON
- Previous message: [Slony1-general] Processing of SYNC from origin node
- Next message: [Slony1-general] Processing of SYNC from origin node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list