[Slony1-general] data copy for set 1 failed 3 times

Fri Nov 30 08:44:46 PST 2012

On Fri, Nov 30, 2012 at 6:39 AM, Steve Singer <ssinger at ca.afilias.info>wrote:

> On 12-11-29 11:23 PM, Tory M Blue wrote:
>
>>
>> Well this is frustrating. So I was successful in replicating a smaller
>> data set without issue. Once I try to replicate large amounts of data ,
>> it seems to fail and restart, at what it feels the end of the biggest
>> table of each set.
>>
>> 2012-11-29 19:54:06 PST CONFIG remoteWorkerThread_1: 16341.883 seconds
>> to copy table "tracking"."spotimpressions"
>> 2012-11-29 19:54:06 PST CONFIG remoteWorkerThread_1: copy table
>> "tracking"."impressions"
>> 2012-11-29 19:54:06 PST CONFIG remoteWorkerThread_1: Begin COPY of table
>> "tracking"."impressions"
>> 2012-11-29 19:54:06 PST ERROR  remoteWorkerThread_1: "select
>> "_admissioncls".copyFields(19)**;"
>> 2012-11-29 19:54:06 PST WARN   remoteWorkerThread_1: data copy for set 2
>> failed 1 times - sleep 15 seconds
>>
>> This large table ran for 4+ hours and the minute it starts with the very
>> next table, it "fails", Identical behavior when doing set 1 which has a
>> large table
>>
>>
>> 1235574-2012-11-29 12:22:12 PST CONFIG remoteWorkerThread_1: Begin COPY
>> of table "cls"."customers"
>> 1235665-2012-11-29 12:22:12 PST ERROR  remoteWorkerThread_1: "select
>> "_admissioncls".copyFields(8);**"
>> 1235759:2012-11-29 12:22:12 PST WARN   remoteWorkerThread_1: data copy
>> for set 1 failed 1 times - sleep 15 seconds
>> Followed sometime later by this
>> 2012-11-29 12:22:28 PST DEBUG2 remoteWorkerThread_2: forward confirm
>> 3,5001168772 received by 4
>> 2012-11-29 12:22:28 PST INFO   copy_set 1 - omit=f - bool=0
>> 2012-11-29 12:22:28 PST INFO   omit is FALSE
>>
>>
>> So what's going on, it appears to have made it through the heavy
>> lifting, but it immediately goes to fail as it starts a much smaller
>> table. Why does it wait to make it through the largest table in the set,
>> before it says "bahh just kidding".
>>
>> AHHH interesting, yet again at the moment of "fail" a log switchover is
>> starting, this is identical to each and every failure. Why is a log
>> switch appearing right before every failure?!
>>
>> Can I disable this for a test? , disable logswitch
>>
>
> You can disable/alter the log switch by making the cleanup interval in the
> slon on the master to be very big, longer than your tests/subscriptions
> take to run. (see cleanup_interval (interval) , on
> http://www.slony.info/**documentation/2.1/slon-config-**interval.html<http://www.slony.info/documentation/2.1/slon-config-interval.html>
> )
>
> This is for testing purposes I'n not recommending this as a solution. I
> also doubt this is the cause of your problem (but let us know if it does
> turn out to be that, because it means something is wrong, somewhere)
>
> You never did send me the output of:
> select "_admissioncls".copyFields(19)**;  or equivilent from your master.
>  You also never sent any information about the schema on the problem table.
>
> I actually did, or thought I did.

I ran this query on the non replicated host. I ran it on the master and it
came back
So it's there, but for some reason it's what is printing out last on the
replication node before each failure.

I've gone ahead and replicated a smaller (identical schema,tables etc),
database over the tubes with no issue, it's finishing up now. But the size
causes one table to be in the copy stage for over an hour, but it seems to
finish, and right after is when things seem to go south and it restarts.

Thanks again!
Tory

admissionclsdb=# select "_admissioncls".copyFields(8);

copyfields

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------
 ("cust_seq_id","customer_id","customer_name","disabled","rimfire_account","defa
ult_status","impression_credits","cost_fractional_cpm","acct_list","pricing_type
","pricing_model","subscription_price","monthly_price","monthly_minimum","accoun
t_type")
(1 row)

> You might want to turn query logging on for the origin/provider node (at
> least for the slony user).  This will tell us exactly what the SQL being
> executed is when the error occurs.
>
>
I'll give this a try.

> Possibilities include:
> 1)  copyFields() is still returning osmething bad, ie ')' for this table,
> so the SQL that later gets executed is
> COPY ()) FROM "tracking"."impressions";
>
> or some other bad SQL in the copy.
>
> 2) The connection is actually aborting during the copy for connection
> related reasons.   In the past people have reported issues where their
> firewall resets, connections after x minutes.  We've also in the past had
> issues with openssl where some limit was reached and the connection was
> killed due to an openssl issue.
>
> I really thought about this, but the one large table dies exactly after
it's finished, in just over an hour, the last test with the bigger table,
took over 4 hours, neither of them appeared to have an EOF or other
connection reaping event.

> 3) Something else
>
>
> I'm sure it's something. Just can't figure it out

Thanks again!
Tory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20121130/4c7f7e9d/attachment.htm