Jeff threshar at torgo.978.org
Tue Jul 15 07:46:45 PDT 2008
So, we've had a problem recently where Slony would get itself into  
trouble because sl_log would grow and grow and of course, our beloved  
readers would back up scanning the data.

The slave machines were mostly idle, replication was mostly in sync  
(although when the backlog got high it would lag a bit - but we're  
talking 5-10 minutes) and events were being produced and confirmed.   
So why was the cleanup thread not giving any love?

There were no long running txns and no idle txns.

So after learning the guts of slony a bit I figured it out (also due  
to another problem that arose where I caught pg_dump with its hand in  
the cookie jar - our db has grown big enough that pg_dump isn't really  
useful anymore, it holds locks on those tables for far too long which  
was also causing backlog (logswitch_finish taking the exclusive lock  
on the config table) - that problem is already documented from what I  
see).

Anyway, I still had the issue of it growing.  so here's the rundown of  
me hunting it down. I'd wager if I haven't been woken up at 3am the  
last few nights I would have come across this sooner.

The cleanup thread kills all but the latest confirmed event and all  
events
previous to that latest confirm.  Then we get the lowest xid from the  
events
we have left and nuke everything before that.  The lowest xid is the  
lowest
xid that was running when the event occured.  So, lets do some digging.

First, what is our lowest xid in the events?

somedb@[local] PRODUCTION # select * from _replication.sl_event order  
by ev_minxid asc limit 1;
  ev_origin | ev_seqno |        ev_timestamp        | ev_minxid  |  
ev_maxxid   
| 
                                                                                         ev_xip 
                                                                                          | 
  ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 |  
ev_data6 | ev_data7 | ev_data8
-----------+----------+----------------------------+------------ 
+------------ 
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
          1 | 27348789 | 2008-07-15 09:53:22.322241 | 2803277293 |  
2804440182 |  
'2803277293 
','2803468364 
','2803476694 
','2804440169 
','2803463327 
','2803481337 
','2804440181 
','2803482209 
','2804392536 
','2803495014','2804288504','2803461584','2803492317','2803435238' |  
SYNC    |          |          |          |          |           
|          |          |


our earliest event has an minxid of 2803277293.

so, lets see what is in the log: (we're using log 2 right now)

somedb@[local] PRODUCTION # select count(*) from _replication.sl_log_2  
where log_xid < '2803277293';
  count
-------
      0
(1 row)

hmmm nothing earlier than that. what about after?

somedb@[local] PRODUCTION # select count(*) from _replication.sl_log_2  
where log_xid > '2803277293';
  count
--------
  374009
(1 row)


thats a good amount of backlog in there.

so what is this txn? lets dig it up.  an open txn will always have at  
least
one lock, so we can use pg_locks which gets us the pid:

somedb@[local] PRODUCTION # select distinct pid from pg_locks where  
transaction = '2803277293';
  pid
------
  8207
(1 row)

ok. pid 8207 is being naughty - now we can look it up in  
pg_stat_activity
somedb@[local] PRODUCTION # select * from pg_stat_activity where  
procpid = 8207;
    datid   |  datname  | procpid | usesysid  |  usename   
|                                    
current_query                                    | waiting |           
query_start          |         backend_start         | client_addr |  
client_port
-----------+-----------+---------+-----------+----------- 
+ 
------------------------------------------------------------------------------------+ 
---------+------------------------------- 
+-------------------------------+-------------+-------------
  447595047 | xxxx |    8207 | 447595046 | xxxx | [some query] |  
f       | 2008-07-15 10:03:55.408807-04 | 2008-07-14  
20:30:01.099457-04 | somebox    |       38380
(1 row)


now, you may be able to identify it via the query that is running, but  
if not,
there is hope.

(further proof). we have the client port as 38380 on somebox


[root at somebox ~]# /usr/sbin/lsof -n  | grep 38380
pgpool    13561          root    9u     IPv4            
14878546                   TCP 1.2.3.4:38380->xxxxx:postgres  
(ESTABLISHED)

we run pgpool, so the app is going through there. If you were direct  
you'd
see whatever process it was.

[root at somebox ~]# lsof  -n -p 13561 | grep PGSQL

look for latest socket:
pgpool  13561 root    8u  unix 0xffff81004f066100         14879580 / 
tmp/.s.PGSQL.5432

next, search for 1 less than that 580 number, which should be our client

[root at somebox ~]# /usr/sbin/lsof -n | grep 14879579
badapp 13573     xxxxxx    4u     unix 0xffff81004f067180               
14879579 socket

there you have it. now you can add more commits to badapp or see wtf  
it is doing or kill it.
Then at your next cleanup event it should be able to nuke the logs.  
(if you had a major backup you may want to start a logswtich by  
calling logswitch_start() so you can get a truncate in there to clear  
out the dead.  Or you could vac full it, but that tis a silly idea).

--
Jeff Trout <jeff at jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/





More information about the Slony1-general mailing list