[Slony1-general] Huge database remote sync issue. Ideas?

Fri Jun 22 14:46:56 PDT 2007

Shaun Thomas wrote:
> Howdy folks,
>
> We're in the middle of a migration / upgrade, and I've got a giant slony 
> set in place, and I get no errors on anything, and syncing starts up 
> just great.  But something seems to be weird here:
>
> 2007-06-21 19:08:44 CDT FATAL  cleanupThread: "delete 
> from "_replication".sl_log_1 where log_origin = '10' and log_xid 
> < '757377'; delete from "_replication".sl_log_2 where log_origin = '10' 
> and log_xid < '757377'; delete from "_replication".sl_seqlog where 
> seql_origin = '10' and seql_ev_seqno < '2'; 
> select "_replication".logswitch_finish(); " - server closed the 
> connection unexpectedly
>
> After it copies a huge amount, say 15-17GB of our 40-45GB total, the 
> pace slows from about 300MB per minute to 5MB / minute, then to almost 
> nothing.  The remote system we're mirroring to has an idle disconnect 
> which is likely killing the connection in question, causing a giant 
> rollback of current progress.  The FATAL error above, tells me it's 
> doing a log switch on Node 10, which makes no sense, since Node 10 is a 
> slave, and should have no events.  This is also the same error I get, 
> every single time, even though the log_xid number itself may change.
>
> So my questions:
>
> 1. Why is log switching on node 10, instead of node 1, which is 
> providing the data?
>   
That'll take place routinely on all nodes; slony needs to switch between
sl_log_1 and sl_log_2 periodically, and does so on every node.

There's something odd about the problem with logswitch_finish(); can you
check your logs to see if the DBMS saw a Signal 11 or such at 19:08:44? 
If node 1 is the origin, that set of queries should trivially run
quickly with no muss and fuss on node 10, as there shouldn't be *any*
data in sl_log_1/2 on node 10.

It seems as though there may be something funky happening at the network
level, not particularly diagnosable (nor controllable) at the DBMS level...
> 2. Why is this mysterious log switch stalling the data copy, so our idle 
> timer slaughters the initial table COPY commands mid-progress.
>
>   
The log switch shouldn't be having that effect; it doesn't make sense
for it to break things.
> 3. Is there some way the initial copy can *not* be an "all or nothing" 
> proposition?  45GB seems an awfully huge first-bite, and it seems 
> unfair that not a single error or disconnect may occur during the 
> entire process of copying that much data.  Checkpoints?  Something? 
> Maybe a configuration for a heartbeat, anything I missed?
>
>   
If you have multiple tables, you could set up a replication set per
table, and subscribe one table at a time.  In practice, you probably
have five tables that are bigger than all the others put together; if
you set up a set for each of those 5, and a set for "the rest," that's
probably about as good as it can get.
> 4. Is it possible to somehow... bootstrap the mirror?  Make an exact 
> data copy of the current database and have slony only copy updates 
> after a certain point?  I mean, I could probably do a dump/restore and 
> let slony keep everything up to date, before our systems launch the 
> nightly insert jobs.
>
>   
Jan's thinking about having a way to do this with Slony-I 2.x with PG
8.3; it's still a glimmer in the eye, at this point...