[Slony1-general] Aieeee! Problem with 2.0.4

Sat Jul 31 23:00:58 PDT 2010

Karl Denninger wrote:
> I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a
> serious problem.
>
> The reason for the "gotta do it now" was that somehow one of the tables
> got out of sync, and a delete was failing to propagate - hanging the
> process.
>
> OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched.  So I
> upgraded to 2.0.4 on all the nodes, and told the subscriber to reload -
> ditched the client config and re-subscribed the sets.
>
> All went well until a very large table came up - it failed.
>
> There's no error in the logs indicating why, other than the following:
>
> Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG remoteWorkerThread_3:
> copy table "public"."images"
> Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG remoteWorkerThread_3:
> Begin COPY of table "public"."images"
> Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR  remoteWorkerThread_3:
> PGgetCopyData() server closed the connection unexpectedly
> Jul 31 22:54:24 dbms TICKER[70295]: [155-2]     This probably means the
> server terminated abnormally
> Jul 31 22:54:24 dbms TICKER[70295]: [155-3]     before or while
> processing the request.
> Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN   remoteWorkerThread_3:
> data copy for set 1 failed 1 times - sleep 15 seconds
>
> And in 15 seconds, the entire process of trying to re-init the node
> starts over - from the beginning!
>
> Near as I can tell, it's failing pretty early on.
>
> The source host is fine.  This particular table contains a BYTEA field,
> and it's BIG.  ~20ish gigs big.  But I've re-initialized in the past
> without problems.  I tried going back to 2.0.2, and that still fails. 
> Both servers are running with encoding set to SQL_ASCII, if it matters.
>
> When it fails the SERVER's COPY is still running - so the client is
> definitely wrong on the reported error.  I have NOTHING in the server's
> SLON log and there are no comms problems between the two hosts.
>
> I'm going to run a dump of the table and see if I can manually bring it
> over to the other host and load it.  There's nothing going on with the
> master that implicates the data being damaged......
>
> Ideas?
>
> -- Karl
>   
Oh boy, I think I know what's going on....

This looks like a problem in the SSL code (!!!!!)

I shut off SSL and I'm now ~4x further than where the copy has
previously failed.  For obvious reasons this is decidedly un-good - a
perusal of the server's log (not slony's, but the server's) disclosed
several SSL errors in the log, which would account for the problem and
what the client thought were disconnects - but weren't.

Specifically, I got a bunch of these...

Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT:  copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thumb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG:  SSL error:
internal error
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT:  copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thumb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG:  SSL error:
internal error

This isn't SLONY's issue, but it's definitely a problem.  I'll report it
over on the Postgres list in the morning...

In the meantime, I think I can get the copy to go with SSL off, and then
turn it back on once the copy is complete.

-- Karl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: karl.vcf
Type: text/x-vcard
Size: 124 bytes
Desc: not available
Url : http://lists.slony.info/pipermail/slony1-general/attachments/20100801/a1ff49f7/attachment.vcf