Once in a while, I experience a problem replicating a new set, where the slon remote worker thread is terminated by a signal 11. I'm not much of a C programmer, so I haven't been able to patch this myself. I don't know how much of the following is relevant, but here is what I do know: It only occurs when debugging is set at 2 or higher. According to strace, the last thing the process does before receiving the sig 11 is output the log entry "syncing set %d with %d table(s) from provider %d\n" (with real data of course), which appears to be called from remote_worker.c at line 4232. The log entry is fully output before the thread is killed. In my case, it always happens on set id 100, which I use as a temporary set when adding new tables and sequences, and only on the first sync of the set after subscribing it to a new node. Reducing the debugging to 0 has so far gotten past the issue all 3 times it has happened, and later syncs at a higher debug level work fine. It's not reproducible except at the specific time it's happening (ie. I don't have a test case that can demonstrate it). I've looked at the logging code and can't see a buffer issue in it. I think probably it's just causing an interaction with the real problem elsewhere.
I've found one bug which would cause the symptoms you describe, although there may be others :). Present in 1.2.15 and 1.2.16, at least. The problem should manifest itself when an overly long line is printed to the log (so it doesn't happen when the log level is low). Here's the case that was causing my slon to segfault; it's in sync_event() in remote_worker.c: slon_log(SLON_DEBUG4, " ssy_action_list value: %s\n", ssy_action_list); ssy_action_list in this case was over 135000 characters. When this got into slon_log and that function found that its buffer was not big enough, it tried to resize the buffer and call vsnprintf() again, but /it reused the va_arg/, which is not allowed. Here is one possible approach to fixing the problem: --- old_misc.c 2009-09-15 13:57:29.000000000 -0600 +++ misc.c 2009-09-15 14:11:10.000000000 -0600 @@ -172,7 +172,10 @@ off = strlen(outbuf); - while (vsnprintf(&outbuf[off], outsize - off, fmt, ap) >= outsize - off) + va_list apcopy; + va_copy(apcopy, ap); + + while (vsnprintf(&outbuf[off], outsize - off, fmt, apcopy) >= outsize - off) { outsize *= 2; outbuf = realloc(outbuf, outsize); @@ -181,7 +184,12 @@ perror("slon_log: realloc()"); slon_retry(); } + va_end(apcopy); + va_copy(apcopy, ap); } + + va_end(apcopy); + #ifdef HAVE_SYSLOG if (Use_syslog >= 1) {
Created attachment 34 [details] The patch from the above comment Sorry, I'm such a Bugzilla noob. This attachment is probably better.
This still bites us when I don't rebuild our packages with the above patch. It will happen anytime someone does a sufficiently large set subscription with debugging level of 2 or higher. Is there anything else I can do to encourage upstream merging of this fix?
It took me two days to find Paul's patch. Finally I also managed to make my boss regain confidence in Slony-I. A signal 11 was the blocker for our move with multiple big databases. Slony-I is a really important software for PostgreSQL infrastructures. Please, don't let it be unappreciated because of 'slon: child terminated status: 11'.
Jan implemented this in a slightly shorter fashion in 1.2.20: http://main.slony.info/viewcvs/viewvc.cgi/slony1-engine/src/slon/misc.c?r1=1.22.2.3&r2=1.22.2.4&pathrev=REL_1_2_STABLE http://main.slony.info/viewcvs/viewvc.cgi/slony1-engine/src/slon/misc.c?r1=1.26.2.2&r2=1.26.2.3&pathrev=REL_2_0_STABLE The issue is certainly present in 1.2.15 and 1.2.16, but it shouldn't be as of 1.2.20.
The fix from Jan will not address this problem. The problem is that the va_list "ap" gets reused by a new call to vsnprintf() when outbuf is too small and has to be realloc'd. Since vsnprintf has to read the contents of ap, ap becomes undefined after it returns. If you want to use the list of variable arguments multiple times (i.e., call vsnprintf multiple times), you must copy it using va_copy(). I can verify that it is still in 1.2.20; we were bit by it just last week, when I posted my comment, and we run 1.2.20.
I have committed this patch against 1.2 and 2.0 CVS. I also added a test case to the utf8 test case that seems to reproduce the crash without the patch.