Bug 55 - Possible buffer overflow in slon, still happens as of 1.2.14
Summary: Possible buffer overflow in slon, still happens as of 1.2.14
Status: RESOLVED FIXED
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: 1.2
Hardware: PC Linux
: low minor
Assignee: Slony Bugs List
URL:
Depends on:
Blocks:
 
Reported: 2008-07-18 09:46 UTC by Alan Hodgson
Modified: 2010-03-31 07:14 UTC (History)
2 users (show)

See Also:


Attachments
The patch from the above comment (598 bytes, patch)
2009-09-15 13:14 UTC, paul cannon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Hodgson 2008-07-18 09:46:01 UTC
Once in a while, I experience a problem replicating a new set, where the slon remote worker thread is terminated by a signal 11.

I'm not much of a C programmer, so I haven't been able to patch this myself.

I don't know how much of the following is relevant, but here is what I do know:

It only occurs when debugging is set at 2 or higher. 

According to strace, the last thing the process does before receiving the sig 11 is output the log entry "syncing set %d with %d table(s) from provider %d\n" (with real data of course), which appears to be called from remote_worker.c at line 4232. The log entry is fully output before the thread is killed.

In my case, it always happens on set id 100, which I use as a temporary set when adding new tables and sequences, and only on the first sync of the set after subscribing it to a new node.

Reducing the debugging to 0 has so far gotten past the issue all 3 times it has happened, and later syncs at a higher debug level work fine. It's not reproducible except at the specific time it's happening (ie. I don't have a test case that can demonstrate it).

I've looked at the logging code and can't see a buffer issue in it. I think probably it's just causing an interaction with the real problem elsewhere.
Comment 1 paul cannon 2009-09-15 13:13:02 UTC
I've found one bug which would cause the symptoms you describe, although there may be others :). Present in 1.2.15 and 1.2.16, at least.

The problem should manifest itself when an overly long line is printed to the log (so it doesn't happen when the log level is low). Here's the case that was causing my slon to segfault; it's in sync_event() in remote_worker.c:

            slon_log(SLON_DEBUG4, " ssy_action_list value: %s\n",
                     ssy_action_list);

ssy_action_list in this case was over 135000 characters. When this got into slon_log and that function found that its buffer was not big enough, it tried to resize the buffer and call vsnprintf() again, but /it reused the va_arg/, which is not allowed.

Here is one possible approach to fixing the problem:

--- old_misc.c	2009-09-15 13:57:29.000000000 -0600
+++ misc.c	2009-09-15 14:11:10.000000000 -0600
@@ -172,7 +172,10 @@
 
 	off = strlen(outbuf);
 
-	while (vsnprintf(&outbuf[off], outsize - off, fmt, ap) >= outsize - off)
+	va_list apcopy;
+	va_copy(apcopy, ap);
+
+	while (vsnprintf(&outbuf[off], outsize - off, fmt, apcopy) >= outsize - off)
 	{
 		outsize *= 2;
 		outbuf = realloc(outbuf, outsize);
@@ -181,7 +184,12 @@
 			perror("slon_log: realloc()");
 			slon_retry();
 		}
+		va_end(apcopy);
+		va_copy(apcopy, ap);
 	}
+
+	va_end(apcopy);
+
 #ifdef HAVE_SYSLOG
 	if (Use_syslog >= 1)
 	{
Comment 2 paul cannon 2009-09-15 13:14:53 UTC
Created attachment 34 [details]
The patch from the above comment

Sorry, I'm such a Bugzilla noob. This attachment is probably better.
Comment 3 paul cannon 2010-03-24 09:44:25 UTC
This still bites us when I don't rebuild our packages with the above patch. It will happen anytime someone does a sufficiently large set subscription with debugging level of 2 or higher.

Is there anything else I can do to encourage upstream merging of this fix?
Comment 4 stephan friese 2010-03-25 12:24:37 UTC
It took me two days to find Paul's patch. Finally I also managed to make my boss regain confidence in Slony-I. A signal 11 was the blocker for our move with multiple big databases.

Slony-I is a really important software for PostgreSQL infrastructures. Please, don't let it be unappreciated because of 'slon: child terminated status: 11'.
Comment 5 Christopher Browne 2010-03-29 13:59:52 UTC
Jan implemented this in a slightly shorter fashion in 1.2.20:

http://main.slony.info/viewcvs/viewvc.cgi/slony1-engine/src/slon/misc.c?r1=1.22.2.3&r2=1.22.2.4&pathrev=REL_1_2_STABLE
http://main.slony.info/viewcvs/viewvc.cgi/slony1-engine/src/slon/misc.c?r1=1.26.2.2&r2=1.26.2.3&pathrev=REL_2_0_STABLE

The issue is certainly present in 1.2.15 and 1.2.16, but it shouldn't be as of 1.2.20.
Comment 6 paul cannon 2010-03-29 15:16:16 UTC
The fix from Jan will not address this problem. The problem is that the va_list "ap" gets reused by a new call to vsnprintf() when outbuf is too small and has to be realloc'd. Since vsnprintf has to read the contents of ap, ap becomes undefined after it returns.

If you want to use the list of variable arguments multiple times (i.e., call vsnprintf multiple times), you must copy it using va_copy().

I can verify that it is still in 1.2.20; we were bit by it just last week, when I posted my comment, and we run 1.2.20.
Comment 7 Steve Singer 2010-03-31 07:14:13 UTC
I have committed this patch against 1.2 and 2.0 CVS.

I also added a test case to the utf8 test case that seems to reproduce the crash without the patch.