Tue Feb 18 19:34:54 PST 2014
- Previous message: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed
- Next message: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 02/16/14 20:06, Jeff Frost wrote: > > On Feb 16, 2014, at 5:00 PM, Tory M Blue <tmblue at gmail.com> wrote: > >> >> >> >> As can be seen the connection is reaped, slon/postgres continue on their way, it's not until the next data copy is required that it finds it's connection is no longer there. Why it can't recreate a conneciton as one would do if they stopped and started slon is kind of beyond me. Just not 100% sure where it's being killed. >> > > Because the initial sync must be done as a single transaction. > > >> >> 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: 7183.069 seconds to copy table "tracking"."spotlightimp" >> 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: copy table "tracking"."adimp" >> 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: Begin COPY of table "tracking"."adimp" >> 2014-02-16 16:40:46 PST ERROR remoteWorkerThread_1: "select "_cls".copyFields(19);" >> 2014-02-16 16:40:46 PST WARN remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds >> NOTICE: Slony-I: Logswitch to sl_log_2 initiated >> CONTEXT: SQL statement "SELECT "_cls".logswitch_start()" >> PL/pgSQL function _cls.cleanupevent(interval) line 96 at PERFORM >> 2014-02-16 16:40:49 PST INFO cleanupThread: 6541.365 seconds for cleanupEvent() >> >> >> Am I doing this wrong? figured that since I've seen connections at 15 minutes of processing complete fine, I thought that 30 minutes is more then enough. So send the first hey are you still there at 15 minutes then continue with them every 5 minutes, for a count of 30. >> >> But the above seems to have been reaped in the 20 minute area.. >> >> net.ipv4.tcp_keepalive_time = 600 >> net.ipv4.tcp_keepalive_probes = 30 >> net.ipv4.tcp_keepalive_intvl = 300 > > > Set it so that it's sending keepalives every 30 seconds. > > Something like this: > > net.ipv4.tcp_keepalive_time = 30 > net.ipv4.tcp_keepalive_probes = 10 > net.ipv4.tcp_keepalive_intvl = 30 > Jeff is right. Really understanding these values may help too. Since Slony allows to set them specifically for Slony in its config file, that is where it really should be done, rather than setting the global values in the kernel. Those kernel values should be adjusted to more appropriate values than suitable for a link to the Moon too, but that's another story. tcp_keepalive_time is the number of seconds, the kernel waits since the last transmission on a socket, before starting to "probe". In this case, the end of the COPY, when slony is going into the long "<IDLE> in transaction" while building the indexes, is the start of that timer. tcp_keepalive_intvl is the interval between keepalive packets. But that interval only kicks in once the tcp_keepalive_time has elapsed. tcp_keepalive_probes is the number of keepalive packets, that need to be missing in a row, before the connection is considered "lost", resulting in a "connection reset by peer". One single keepalive received resets the whole thing. I actually go a lot more aggressive at this. Something like tcp_keepalive_time = 5 tcp_keepalive_probes = 24 tcp_keepalive_intvl = 5 Pretty much all "remote" slony connections do something every couple of seconds. And a link that cannot deal with a few keepalive packets per minute is not suitable for running slony over anyway. And who cares about wasting bandwidth on an idle link? Seriously! It's not like we are paying by the kilobyte of used bandwidth these days, are we? What is also important here that my above settings lead to a timeout and canceling of the PostgreSQL backend within 2 minutes. There is something really bad about hanging backends of lost Slony connections, when you actually ever need to failover. You really want them to time out in a timely fashion. Trust me. Regards, Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
- Previous message: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed
- Next message: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list