Bug 202 - remote_worker error creating log cursor
Summary: remote_worker error creating log cursor
Status: NEW
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: 2.0
Hardware: PC Linux
: low major
Assignee: Slony Bugs List
URL:
Depends on:
Blocks:
 
Reported: 2011-04-21 01:43 UTC by Jason Tevnan
Modified: 2011-05-02 00:39 UTC (History)
1 user (show)

See Also:


Attachments
Full cursor declaration (2.08 KB, text/x-sql)
2011-04-21 01:43 UTC, Jason Tevnan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Tevnan 2011-04-21 01:43:14 UTC
Created attachment 89 [details]
Full cursor declaration

Slony can not continue replicating after mal-formed txid_snapshot is generated.

As far as i can see from the remote_worker.c src file, the log cursor is generated on the fly: in which case the txid_snapshot was calculated wrongly:

In my case in the line:
"pg_catalog".txid_visible_in_snapshot(log_txid, '204108434:204108466:204108434,204108435,204108435,204108439,204108441,204108444,204108447,204108449,204108451,204108453')
the xip_list contains two identical entries: <snip>204108435,204108435,</snip>

Due to the nature of this error, i set the severity to major (i had to do an initial set up the replication cluster).

If this is a known issue, please close it. 
Regards
Comment 1 Steve Singer 2011-04-21 05:51:51 UTC
Did you happen to get a dump of sl_event from your origin containing the details of the SYNC events involved?

Also what version of slony and postgresql are you using.
Comment 2 Jason Tevnan 2011-04-21 06:11:54 UTC
of course i forgot the most important info - Version infos:
Slony version: 2.0.4
Postgres: 8.4.7
running on Ubuntu 10.04 LTS

The only dump i have of the replication schema is dated 3 hours prior to the incident.


I have been following up on the issue myself as well, and admit that the issue is not in the remote_worker, but rather in the event creator: 

approx. line 3834 in remote_worker.c:
(void) slon_mkquery(&new_qual,
	"(log_txid < '%s' and "
        "\"pg_catalog\".txid_visible_in_snapshot(log_txid, '%s'))",
	event->ev_maxtxid_c, event->ev_snapshot_c);
Comment 3 Steve Singer 2011-04-21 06:48:24 UTC
(In reply to comment #2)

The event creation code (in slony1_funcs.c) just calls 

"pg_catalog\".txid_current_snapshot(),

to get the list of in progress txids
Comment 4 Jason Tevnan 2011-04-21 07:10:35 UTC
> "pg_catalog\".txid_current_snapshot(),

I see how the list is constructed now: tranaction ids between: currenttxid  and oldtxid are then racked up into the event. Of course, without a dump of the event table it is hard to reconstruct, yet the only way a situation in which two identical txids can occur in my original post is that the event table has two identical entries for this txid.
Please correct me if i am wrong, but wouldn't a unique clause on the ev_snapshot hinder a problem like this from occurring?
Thanks for your time.
Comment 5 Jan Wieck 2011-04-25 07:55:19 UTC
As far as I can tell, this is erroneous data produces by Postgres itself. Slony uses txid_current_snapshot() to get that snapshot datum, which now apparently cannot pass through a data type output+input cycle.

I'm going to run a few tests this afternoon and contact the PG hackers list after that.
Comment 6 Jason Tevnan 2011-05-02 00:39:34 UTC
(In reply to comment #5)
> As far as I can tell, this is erroneous data produces by Postgres itself. Slony
> uses txid_current_snapshot() to get that snapshot datum, which now apparently
> cannot pass through a data type output+input cycle.
> 
> I'm going to run a few tests this afternoon and contact the PG hackers list
> after that.

Did your tests offer any explanations for this behavior?