Mon Sep 21 11:23:27 PDT 2009
- Previous message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Next message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 2009-09-21 at 12:47 +0100, Peter Geoghegan wrote: > Hello, > > This is a re-hash of an earlier mail that went unanswered. Doing this > probably doesn't show decorum, but I'm at the end of my tether with > this problem. I'd appreciate any help that you can give to fix this > problem, because it's a real irritation - my production system is > affected by it. At this point, I'd welcome even wild speculation. > > Replication ostensibly works fine. We replicate from a windows Master > (node 1), using Hiroshi Saito's Slony-I 2.0.2 binaries, to 2 OpenSuse > slaves (nodes 2 and 3). It's all fairly standard. > > When I restart a slave database (in the following example node 2), > replication continues to work (at least as far as can be immediately > observed), but sl_status shows: > > st_origin | st_received | st_last_event | st_last_event_ts | > st_last_received | st_last_received_ts | > st_last_received_event_ts | st_lag_num_events | st_lag_time > -------------+-----------------+------------------------------------------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+----------------- > 1 | 3 |38689 | "2009-07-30 12:11:51.796" | > 38688;"2009-07-30 12:12:02.428316" | "2009-07-30 12:11:41.859" > |1 |"00:00:14.015" > 1 | 2 |38689 | "2009-07-30 12:11:51.796" | > 38605;"2009-07-30 11:52:35.119048" | "2009-07-30 11:58:05.734" > |84 |"00:13:50.14" > > Node 2's st_lag_num_events grows and grows, until the slony-I service > (all slon daemons) is restarted on the master, at which time it > returns to zero, just as before. This is very annoying, because > sl_status is how my application monitors the state of the replication > cluster, and when its broken it confuses users. I can restart the slon > services (slon daemons) and have the event lag return to zero, but > that's not acceptable in a production system. > > Bear in mind, replication isn't broken at any point - only sl_status is. > > When I run test_slony_state-dbi.pl on the master while the event lag > continues to grow, it outputs the following: > > peter at peter-development-machine:~/slony1-2.0.2/tools> > ./test_slony_state-dbi.pl --host=10.0.0.80 --database=lustre > --cluster=lustre_cluster --user=postgres --password=my_password > DSN: dbi:Pg:dbname=lustre;host=10.0.0.80;user=postgres;password=my_password; > =========================== > Rummage for DSNs > ============================= > Query: > > select p.pa_server, p.pa_conninfo > from "_lustre_cluster".sl_path p > -- where exists (select * from "_lustre_cluster".sl_subscribe s where > -- (s.sub_provider = p.pa_server or > s.sub_receiver = p.pa_server) and > -- sub_active = 't') > group by pa_server, pa_conninfo; > > > Tests for node 1 - DSN = dbi:Pg:dbname=lustre host=10.0.0.80 > user=postgres password=my_password > ======================================== > pg_listener info: > Pages: 0 > Tuples: 0 > > Size Tests > ================================================ > sl_log_1 0 0.000000 > sl_log_2 0 0.000000 > sl_seqlog 0 0.000000 > > Listen Path Analysis > =================================================== > No problems found with sl_listen > > -------------------------------------------------------------------------------- > Summary of event info > Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age > ================================================================================ > 1 38605 38699 00:00:00 00:15:00 0 > 2 20 20 01:08:00 01:08:00 1 > 3 30 30 01:02:00 01:02:00 1 > > > --------------------------------------------------------------------------------- > Summary of sl_confirm aging > Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of > eldest SYNC > ================================================================================= > 1 2 38605 38605 00:20:00 00:20:00 0 > 1 3 38627 38698 00:00:00 00:11:00 0 > 2 1 20 20 01:03:00 01:03:00 1 > 2 3 20 20 01:02:00 01:02:00 1 > 3 1 30 30 01:02:00 01:02:00 1 > 3 2 30 30 01:08:00 01:08:00 1 > > > ------------------------------------------------------------------------------ > > Listing of old open connections on node 1 > Database PID User Query Age > Query > ================================================================================ > > > Tests for node 3 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.82 > user=postgres password=my_password > ======================================== > pg_listener info: > Pages: 0 > Tuples: 0 > > Size Tests > ================================================ > sl_log_1 0 0.000000 > sl_log_2 0 0.000000 > sl_seqlog 0 0.000000 > > Listen Path Analysis > =================================================== > No problems found with sl_listen > > -------------------------------------------------------------------------------- > Summary of event info > Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age > ================================================================================ > 1 38605 38699 00:00:00 00:15:00 0 > 2 20 20 01:08:00 01:08:00 1 > 3 30 30 01:02:00 01:02:00 1 > > > --------------------------------------------------------------------------------- > Summary of sl_confirm aging > Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of > eldest SYNC > ================================================================================= > 1 2 38605 38605 00:21:00 00:21:00 0 > 1 3 38629 38699 00:00:00 00:11:00 0 > 2 1 20 20 01:03:00 01:03:00 1 > 2 3 20 20 01:03:00 01:03:00 1 > 3 1 30 30 01:03:00 01:03:00 1 > 3 2 30 30 01:08:00 01:08:00 1 > > > ------------------------------------------------------------------------------ > > Listing of old open connections on node 3 > Database PID User Query Age > Query > ================================================================================ > > > Tests for node 2 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.81 > user=postgres password=my_password > ======================================== > pg_listener info: > Pages: 0 > Tuples: 0 > > Size Tests > ================================================ > sl_log_1 0 0.000000 > sl_log_2 0 0.000000 > sl_seqlog 0 0.000000 > > Listen Path Analysis > =================================================== > No problems found with sl_listen > > -------------------------------------------------------------------------------- > Summary of event info > Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age > ================================================================================ > 1 38573 38699 -00:05:00 00:15:00 0 > 2 20 21 00:15:00 01:03:00 0 > 3 30 30 00:57:00 00:57:00 1 > > > --------------------------------------------------------------------------------- > Summary of sl_confirm aging > Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of > eldest SYNC > ================================================================================= > 1 2 38607 38699 00:00:00 00:15:00 0 > 1 3 38573 38698 -00:05:00 00:15:00 0 > 2 1 20 20 00:57:00 00:57:00 1 > 2 3 20 20 00:57:00 00:57:00 1 > 3 1 30 30 00:57:00 00:57:00 1 > 3 2 30 30 01:02:00 01:02:00 1 > > > ------------------------------------------------------------------------------ > > Listing of old open connections on node 2 > Database PID User Query Age > Query > ================================================================================ > > peter at peter-development-machine:~/slony1-2.0.2/tools> > > Why is this happening? It's a bug. The events are making it from the origin to the receiver, but the confirmations are not making it back. We've experienced the same situation many times in an all Linux environment - but that was with Slony 1.1 -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
- Previous message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Next message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list