Tue Jul 15 07:46:45 PDT 2008
- Previous message: [Slony1-general] Schema, Triggers and FK
- Next message: [Slony1-general] Identify backlog in sl_log_1
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
So, we've had a problem recently where Slony would get itself into trouble because sl_log would grow and grow and of course, our beloved readers would back up scanning the data. The slave machines were mostly idle, replication was mostly in sync (although when the backlog got high it would lag a bit - but we're talking 5-10 minutes) and events were being produced and confirmed. So why was the cleanup thread not giving any love? There were no long running txns and no idle txns. So after learning the guts of slony a bit I figured it out (also due to another problem that arose where I caught pg_dump with its hand in the cookie jar - our db has grown big enough that pg_dump isn't really useful anymore, it holds locks on those tables for far too long which was also causing backlog (logswitch_finish taking the exclusive lock on the config table) - that problem is already documented from what I see). Anyway, I still had the issue of it growing. so here's the rundown of me hunting it down. I'd wager if I haven't been woken up at 3am the last few nights I would have come across this sooner. The cleanup thread kills all but the latest confirmed event and all events previous to that latest confirm. Then we get the lowest xid from the events we have left and nuke everything before that. The lowest xid is the lowest xid that was running when the event occured. So, lets do some digging. First, what is our lowest xid in the events? somedb@[local] PRODUCTION # select * from _replication.sl_event order by ev_minxid asc limit 1; ev_origin | ev_seqno | ev_timestamp | ev_minxid | ev_maxxid | ev_xip | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_data7 | ev_data8 -----------+----------+----------------------------+------------ +------------ +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+----------+----------+----------+----------+----------+----------+---------- 1 | 27348789 | 2008-07-15 09:53:22.322241 | 2803277293 | 2804440182 | '2803277293 ','2803468364 ','2803476694 ','2804440169 ','2803463327 ','2803481337 ','2804440181 ','2803482209 ','2804392536 ','2803495014','2804288504','2803461584','2803492317','2803435238' | SYNC | | | | | | | | our earliest event has an minxid of 2803277293. so, lets see what is in the log: (we're using log 2 right now) somedb@[local] PRODUCTION # select count(*) from _replication.sl_log_2 where log_xid < '2803277293'; count ------- 0 (1 row) hmmm nothing earlier than that. what about after? somedb@[local] PRODUCTION # select count(*) from _replication.sl_log_2 where log_xid > '2803277293'; count -------- 374009 (1 row) thats a good amount of backlog in there. so what is this txn? lets dig it up. an open txn will always have at least one lock, so we can use pg_locks which gets us the pid: somedb@[local] PRODUCTION # select distinct pid from pg_locks where transaction = '2803277293'; pid ------ 8207 (1 row) ok. pid 8207 is being naughty - now we can look it up in pg_stat_activity somedb@[local] PRODUCTION # select * from pg_stat_activity where procpid = 8207; datid | datname | procpid | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port -----------+-----------+---------+-----------+----------- + ------------------------------------------------------------------------------------+ ---------+------------------------------- +-------------------------------+-------------+------------- 447595047 | xxxx | 8207 | 447595046 | xxxx | [some query] | f | 2008-07-15 10:03:55.408807-04 | 2008-07-14 20:30:01.099457-04 | somebox | 38380 (1 row) now, you may be able to identify it via the query that is running, but if not, there is hope. (further proof). we have the client port as 38380 on somebox [root at somebox ~]# /usr/sbin/lsof -n | grep 38380 pgpool 13561 root 9u IPv4 14878546 TCP 1.2.3.4:38380->xxxxx:postgres (ESTABLISHED) we run pgpool, so the app is going through there. If you were direct you'd see whatever process it was. [root at somebox ~]# lsof -n -p 13561 | grep PGSQL look for latest socket: pgpool 13561 root 8u unix 0xffff81004f066100 14879580 / tmp/.s.PGSQL.5432 next, search for 1 less than that 580 number, which should be our client [root at somebox ~]# /usr/sbin/lsof -n | grep 14879579 badapp 13573 xxxxxx 4u unix 0xffff81004f067180 14879579 socket there you have it. now you can add more commits to badapp or see wtf it is doing or kill it. Then at your next cleanup event it should be able to nuke the logs. (if you had a major backup you may want to start a logswtich by calling logswitch_start() so you can get a truncate in there to clear out the dead. Or you could vac full it, but that tis a silly idea). -- Jeff Trout <jeff at jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
- Previous message: [Slony1-general] Schema, Triggers and FK
- Next message: [Slony1-general] Identify backlog in sl_log_1
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list