[Slony1-general] Slony Falling Behind

Thu Sep 29 15:29:14 PDT 2005

Joe Markwardt wrote:

>On Thu, 2005-09-29 at 01:09 +0300, Hannu Krosing wrote:
>  
>
>>On K, 2005-09-28 at 12:37 -0500, Joe Markwardt wrote:
>>    
>>
>>>On Tue, 2005-09-27 at 21:08 -0400, Christopher Browne wrote:
>>>      
>>>
>>>>Joe Markwardt wrote:
>>>>
>>>>        
>>>>
>>>>>On Wed, 2005-09-28 at 01:20 +0300, Hannu Krosing wrote:
>>>>> 
>>>>>
>>>>>          
>>>>>
>>>>>>On T, 2005-09-27 at 16:37 -0500, Joe Markwardt wrote:
>>>>>>
>>>>>>   
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Yep.  And I restarted the slon's as well, so they should be using the
>>>>>>>index as long as the planner thinks it will help.  Is there anyway I can
>>>>>>>see the query that slon is generating to create the cursor? If I could
>>>>>>>see that query then I could do some explain analyze's and see what they
>>>>>>>say...
>>>>>>>     
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>I usually look at the queries in postgres log.
>>>>>>
>>>>>>In your case you could set up postgres to log only long queries, say
>>>>>>over 50 sec, and then just tail -f the log.
>>>>>>
>>>>>>When analysing take notice that slon specially sets enable_seqsnam and
>>>>>>enable_indexscan before running the query on sl_log_1.
>>>>>>
>>>>>>   
>>>>>>
>>>>>>            
>>>>>>
>>>>>Thanks for the tip, but the only queries that are being logged are the
>>>>>FETCH statements...I suppose I could set it up to log all queries (and
>>>>>plans) and then restart slon, then turn off the query logging.
>>>>> 
>>>>>
>>>>>          
>>>>>
>>>>Right you are...
>>>>
>>>>The interesting query is the one that sets up the cursor, and it doesn't
>>>>take long :-(.
>>>>
>>>>You can tell how many SYNCs are being processed together by looking at
>>>>two consecutive lines that say "SYNC XXXX done in Y.xyz seconds"
>>>>
>>>>(They'll have lines in between...)
>>>>
>>>>If it's "SYNC 1610644 done in 119 seconds" followed (2 minutes later) by
>>>>"SYNC 1610724 done in 119 seconds", then you know that it's doing groups
>>>>of 80, and even if it's not catching up quickly, that's at least the
>>>>right direction to go in.
>>>>
>>>>        
>>>>
>>>chris, going through the logs from yesterday afternoon it appears that
>>>it is varying the number of SYNCs processed each time, isn't the -o0
>>>supposed to force it to always do 80?
>>>
>>>I keep getting log entries like this:
>>>
>>>2005-09-28 11:06:29 CDT DEBUG3 calc sync size - last time: 1 last
>>>length: 127436 ideal: 80 proposed size: 2
>>>
>>>with varying values for proposed size...
>>>
>>>Attached you'll find the output of the explain analyze's from the cursor
>>>declaration I got from the logs.  It appears that it is using the new
>>>index, and the new index is indeed faster than a sequential scan on the
>>>table.  I'm not sure if there is anything else that could be done to
>>>make this query run faster while it's catching up, but i'll play around
>>>some.
>>>      
>>>
>>So it takes 114 secs for indexscan over ~300k xids worth of rows
>>returning nothing. What hardware are you running on (RAM/disks/CPU) ?
>>
>>    
>>
>6 disk 15Krpm FC raid 1+0 array, dual opteron 246's, 8GB o ram
>
>>From what I can tell from the plan its ~120 secs for an index scan over
>~14 million rows, expecting to find 300K, but not finding any. Also,
>according to vmstat and iostat, I'm not hitting the disk at all, so I'm
>assuming that means the entire sl_log_1 table, or at least the index its
>using for this query, is cached in memory...
>
>  
>
That's not too surprising.

>>Depending on how big sl_log_1 is, you may want to do a reindex on
>>sl_log_1.
>>
>>    
>>
>hmm well as part of my testing I dropped all the indexes on sl_log_1 and
>recreated them, so this in essence has already been done,if I understand
>the REINDEX command correctly
>
>  
>
Correct.

>>Or maybe even
>>cluster  sl_log_1_idx2 on sl_log_1;
>>
>>    
>>
>
>Thats an idea, I'll give it a try and see if it helps
>  
>
clustering essentially does a reindex and VACUUM FULL in one swell foop,
which, if the table has way too much dead space, is a better deal than
VACUUM FULL.

Based on your other comments, I'd suggest that you wait a while until
the cleanup thread gets around to purging out at least the 5M row "clot"
in sl_log_1.  Once you see that "vacuum verbose analyze
_my_slony_schema.sl_log_1;" is running reasonably quickly and is finding
only a fairly small number of live tuples (ideally, hundreds; more
likely some thousands, hopefully not tens of thousands...), THEN I'd do
the cluster.

1.  Because the VACUUM was just run, and ran quickly, you know the data
is all in cache
2.  Because there are only a limited population of live tuples, it won't
spend forever building the index and ordering the remaining tuples.

>  
>
>>Otoh, if your client side prosessing needs allow it, you could just re-
>>subscribe the biggest tables (preferrably drop all indexes except the PK
>>or UNIQUE index used by slony before doing it). It is often faster than
>>catching up.
>>    
>>
>
>I was hoping to find a way to get replication to catch up (and stay
>resonably caught up) because from what I can tell while this update load
>is unusuall for now, it looks like it could be becoming commonplace
>within the next 6 months.  And while re-subscribing is way faster (takes
>about 8 hours to complete for the entire set) its not a "good" thing to
>do every weekend after the heavy updates run.  If thats what I have to
>do to deal with single query's that change large numbers of rows and
>maintain a decent lag time b/w the two nodes, then I guess thats what I
>have to do.  It would be nice if slony could handle this edge case a
>little nicer, (just replicate the 1 query instead of turning it into 5
>million ) but I certainly can't see any way around it when you use
>triggers, and it fits the rest of our needs well.  
>
>Anyways it appears that it got over the 5 million line update
>transaction finally, and now its flying along the way it usually does.
>
>2005-09-29 08:36:15 CDT DEBUG2 remoteWorkerThread_1: SYNC 1691933 done
>in 0.798 seconds
>
>from the way things look it should be caught up before the end of the
>day :-)
>  
>
It seems to me that you got hit by kind of too much at once, which was
unfortunate.

Some things to consider...

1.  Slony-I 1.2 should handle subscriptions *way* faster as it
deactivates indexing while copying data.  It looks like a common
scenario will be for this to triple the speed of the COPY_SET as
compared to earlier versions.

2.  There is a way to handle the "5 million updates" scenario, albeit in
a manner that requires some planning, and *possibly* an application outage.

You could submit the 5-million-row update via EXECUTE SCRIPT, and the
result would be that the update would be submitted as 1 query to each host.

There is a serious caveat to this (one which is pretty well inevitable
and not a "bug to be fixed"), namely that EXECUTE SCRIPT needs to lock
all the tables in the set for the duration of the queries.

That is, on each node, EXECUTE SCRIPT does the following:

 for each r_table in replication_set_1 do
    lock table r_table
    remove replication triggers from table r_table
 done
 execute the SQL script
 for each r_table in replication_set_1 do
     restore replication triggers to table r_table
     unlock table r_table
 done

That means that every table in the set is locked, then the 5M row update
runs for (oh, say) 8 minutes, and then every table is unlocked.

Applications that try to access the origin node concurrently with this
may assortedly:
 - Lead to deadlocks that prevent this from running at all
 - Get locked out for 8 minutes

If you can set some quasi regular schedule for doing these sorts of mass
updates, then it'll at least not come as a surprise...

I know some people have had difficulty with deadlocks when trying to run
EXECUTE SCRIPT in that they couldn't declare an outage; your milage may
certainly vary :-).

One quasi-devious thought which comes to mind is to try to use LOCK SET
to assert an outage...

Thus, you might loop thus...

#!/bin/bash
signal=1
while (signal == 1) do
   slonik <<EOF
   include 'preamble';
   try {
      lock set (id = 1);
      exit 0;
   } on error {
      exit 1;    # probably means we hit a deadlock
   }
  EOF
  signal=$?
}  # We don't get out of the loop until we successfully lock the set...
# Now, the set is locked...
slonik <<EOF
include 'preamble';
unlock set (id = 1);     # unlock it, then run the script...
execute script ...
EOF