Seems as though this is a duplicate of bug #264, and that patch appears to be in 2.1.4, so I'm not sure why I'm still seeing this behavior. After an initial sync that takes about 8hrs, slon generates this extremely long query (that log line by itself is 107MB in size) that looks like: 2013-11-13 10:44:26 EST ERROR remoteWorkerThread_2: "declare LOG cursor for select log_origin, log_txid, log_tableid, log_actionseq, log_cmdtype, octet_length(log_cmddata), case when octet_length(log_cmddata) <= 8192 then log_cmddata else null end from "_migration".sl_log_1 where log_origin = 2 and log_tableid in (7,9,23,24,38,39,42,47,48,53,57,74,75,86,97,116,120,128,129,130,136,137,143) and log_txid >= '6236982' and log_txid < '6236996' and "pg_catalog".txid_visible_in_snapshot(log_txid, '6236996:6236996:') and ( log_actionseq <> '5342437751' and log_actionseq <> '5342362294' and log_actionseq <> '5341832481' and log_actionseq <> '5342413870' and log_actionseq <> '5342418592' and log_actionseq <> '5342202064' and log_actionseq <> '5341939025' and log_actionseq <> '5341887545' and log_actionseq <> '5342114362' and log_actionseq <> '5342300971' and log_actionseq <> '5341759946' and log_actionseq <> '5342195755' and log_actionseq <> '5342198736' and log_actionseq <> '5342218415' and log_actionseq <> '5342188807' and log_actionseq <> '5342225657' and log_actionseq <> '5342434567' and log_actionseq <> '5342088216' and log_actionseq <> '5342293611' and log_actionseq <> '5341866011' and ..... and log_actionseq <> '5342342863' ) order by log_actionseq" PGRES_FATAL_ERROR ERROR: stack depth limit exceeded max_stack_depth is 4MB on these systems, but even setting it to the max doesn't do the job. BTW, the versions in bugzilla only go up to 2.0, which is why this is filed under devel.
I do see the order by clause on the log_actionseq select in the REL_2_1_STABLE branch, but the log snippet you pasted clearly shows those numbers out of order. Can you try to reindex the sl_log_* tables? Note that in 2.2 the problem is gone entirely because we no longer use the log_actionseq for that. Instead the log selection for the first SYNC after subscribe is based on the snapshot of the copy_set() operation. At the time we did not feel comfortable to backpatch that into 2.1. That might be a reasonable thing to do now.
For some reason this didn't make it back into bugzilla: I actually went and downloaded the source for 2.1.4 and 2.1.3 and both of them appear to have the ORDER BY, so I'm not sure what's going on here. Unfortunately, I truncated the sl_log_* tables so we could get this sync completed as it's holding up a migration. How stable is 2.2.1 considered? I have a feeling we are going to hit this problem again today, so is there any further info I could gather before I tear it down?
Afilias is running 2.2.0 in production and we have done extensive testing on 2.2.1, will be upgrading to it shortly. So we definitely consider it "production".
That said, we discussed the matter when I was in TO last week and we do think that backpatching the 2.2 changes, that eliminate the action sequence list altogether, is appropriate now. A subsequent 2.1 release should have that.
Thanks! That's good to hear. Perhaps we'll upgrade to 2.2 if we can't get around the issue before the backport. So, is there any further info I can gather because I think it's likely to hit this issue again today or not really because you're just going to backport the 2.2 code?
The thing I don't understand is why you are hitting this issue at all. The 2.1 STABLE branch has the ORDER BY clause, so the list should be ordered. The log you pasted shows an unordered list. I hate to ask this, but is it possible that you upgraded Slony while there was an outstanding SUBSCRIBE going on?
I agree, I actually went and downloaded the source for both 2.1.3 and 2.1.4 and they both have the ORDER BY in the code. I actually went and asked on #postgresql-apt to just check that the 2.1.4 debian package was really build from 2.1.4, but haven't heard back yet. It actually happened a couple times and in each case the cluster was completely dropped and recreated. very strange!
The place where that action sequence ID list is coming from is the sl_setsync.ssy_action_list. That datum is created by copy_set() as part of the SUBSCRIBE operation. The only way I can imagine you ended up with a 2.1.4 running into that problem is that a SUBSCRIBE was done with a previous version, the sl_setsync entry created and then a successful upgrade to 2.1.4 was performed. That of course didn't fix the existing sl_setsync entry and the system is still stuck and cannot complete the initial SYNC after SUBSCRIBE.
So, I just looked in sl_setsync for this new pair of servers and the ids are not in order: '5864690123','5864690319','5864683875','5864690590','5864681889','5864660235','5864685143' BUT, there is only about 400K of data in that column. Also, these two servers have only ever had 2.1.4 on them prior to the subscription.
Apparently a construct of (select x, ... order by x) union (select x, ... order by x) does NOT create a sorted list at all. It may have created something with one break in the sort order in some older PostgreSQL release, but right now I think we really need to backport the patch that went into 2.2.