[Slony1-general] Huge database remote sync issue. Ideas?

Mon Jun 25 07:30:34 PDT 2007

On 6/22/2007 6:07 PM, Shaun Thomas wrote:
> On Friday 22 June 2007 04:46:56 pm Christopher Browne wrote:
> 
> 
>> That'll take place routinely on all nodes; slony
>> needs to switch between sl_log_1 and sl_log_2 periodically,
>> and does so on every node. 
> 
> I was wondering about that.  I was mostly suspicious because it copies 
> like mad until about 15-17G, and then slows to a crawl before erupting 
> with those errors and aborting the sync.  The logswitch was just where 
> it kept dying, so I kept it as a possible candidate.

Any error on any DB connection will usually cause slon to restart 
everything internally. Do you by any chance run the slon serving the 
node behind the firewall from the outside? In that case, you will have 
trouble because the cleanup thread is idle most of its life and when it 
tries to do its work, your firewall has pulled the plugs. So that's not 
going to work ever. You will have to run the slon on the remote site.

The slowdown you experience might be related to how slony does the 
initial copy. Per table it disables index maintenance, copies the data, 
then enables indexes and issues a reindex for that table. So if at that 
time it is copying medium size tables with many indexes, this would be 
what I'd expect to happen.

> 
>> There's something odd about the problem with logswitch_finish();
>> can you check your logs to see if the DBMS saw a Signal 11 or
>> such at 19:08:44?  
> 
> I didn't see any Sig-11, but postgres *did* whine about the client 
> unexpectedly closing the connection.  But considering the slony client 
> was just as confused about the connection drop, that's what made me 
> think of the Savvis firewall getting pissy.

Slon itself never does a proper PQfinish() call to close the 
connections. So whenever slon is stopped or internally restarted, you 
will see those messages about clients disconnecting. However, the 
firewall is still the source of your troubles. It is just plain wrong to 
kill a perfectly healthy TCP connection just because it wasn't used for 
a while. You might want to try using the tcp_keepalives_* config options 
in the postgresql.conf file of the poor server behind that stupid flamewall.

> 
>> If you have multiple tables, you could set up a replication set per
>> table, and subscribe one table at a time.  In practice, you probably
>> have five tables that are bigger than all the others put together;
> 
> That's exactly the case.  Not counting indexes, I have 1 5GB table with 
> 44M rows, a 4GB table with 50M rows, and a 2.5G table with 10M rows.  
> If I put those in their own sets, the rest would have no problem being 
> in the remainder.
> 
> But I wonder about something - does slony turn off indexes to facilitate 
> the data copy, and then reenable them so they're all created after the 
> copy is done?  If that's the case, that could be my problem.  17G is 
> about the size of all the tables with no indexes, fully vacuumed.  If 
> slony is waiting around forever for the indexes to finish generating 
> before committing and declaring the initial copy successful, that could 
> account for my idle time.

It does that table by table.

Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #