Mon Sep 21 04:47:15 PDT 2009
- Previous message: [Slony1-general] replicating views with slony
- Next message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello, This is a re-hash of an earlier mail that went unanswered. Doing this probably doesn't show decorum, but I'm at the end of my tether with this problem. I'd appreciate any help that you can give to fix this problem, because it's a real irritation - my production system is affected by it. At this point, I'd welcome even wild speculation. Replication ostensibly works fine. We replicate from a windows Master (node 1), using Hiroshi Saito's Slony-I 2.0.2 binaries, to 2 OpenSuse slaves (nodes 2 and 3). It's all fairly standard. When I restart a slave database (in the following example node 2), replication continues to work (at least as far as can be immediately observed), but sl_status shows: st_origin | st_received | st_last_event | st_last_event_ts | st_last_received | st_last_received_ts | st_last_received_event_ts | st_lag_num_events | st_lag_time -------------+-----------------+------------------------------------------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+----------------- 1 | 3 |38689 | "2009-07-30 12:11:51.796" | 38688;"2009-07-30 12:12:02.428316" | "2009-07-30 12:11:41.859" |1 |"00:00:14.015" 1 | 2 |38689 | "2009-07-30 12:11:51.796" | 38605;"2009-07-30 11:52:35.119048" | "2009-07-30 11:58:05.734" |84 |"00:13:50.14" Node 2's st_lag_num_events grows and grows, until the slony-I service (all slon daemons) is restarted on the master, at which time it returns to zero, just as before. This is very annoying, because sl_status is how my application monitors the state of the replication cluster, and when its broken it confuses users. I can restart the slon services (slon daemons) and have the event lag return to zero, but that's not acceptable in a production system. Bear in mind, replication isn't broken at any point - only sl_status is. When I run test_slony_state-dbi.pl on the master while the event lag continues to grow, it outputs the following: peter at peter-development-machine:~/slony1-2.0.2/tools> ./test_slony_state-dbi.pl --host=10.0.0.80 --database=lustre --cluster=lustre_cluster --user=postgres --password=my_password DSN: dbi:Pg:dbname=lustre;host=10.0.0.80;user=postgres;password=my_password; =========================== Rummage for DSNs ============================= Query: select p.pa_server, p.pa_conninfo from "_lustre_cluster".sl_path p -- where exists (select * from "_lustre_cluster".sl_subscribe s where -- (s.sub_provider = p.pa_server or s.sub_receiver = p.pa_server) and -- sub_active = 't') group by pa_server, pa_conninfo; Tests for node 1 - DSN = dbi:Pg:dbname=lustre host=10.0.0.80 user=postgres password=my_password ======================================== pg_listener info: Pages: 0 Tuples: 0 Size Tests ================================================ sl_log_1 0 0.000000 sl_log_2 0 0.000000 sl_seqlog 0 0.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age ================================================================================ 1 38605 38699 00:00:00 00:15:00 0 2 20 20 01:08:00 01:08:00 1 3 30 30 01:02:00 01:02:00 1 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC ================================================================================= 1 2 38605 38605 00:20:00 00:20:00 0 1 3 38627 38698 00:00:00 00:11:00 0 2 1 20 20 01:03:00 01:03:00 1 2 3 20 20 01:02:00 01:02:00 1 3 1 30 30 01:02:00 01:02:00 1 3 2 30 30 01:08:00 01:08:00 1 ------------------------------------------------------------------------------ Listing of old open connections on node 1 Database PID User Query Age Query ================================================================================ Tests for node 3 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.82 user=postgres password=my_password ======================================== pg_listener info: Pages: 0 Tuples: 0 Size Tests ================================================ sl_log_1 0 0.000000 sl_log_2 0 0.000000 sl_seqlog 0 0.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age ================================================================================ 1 38605 38699 00:00:00 00:15:00 0 2 20 20 01:08:00 01:08:00 1 3 30 30 01:02:00 01:02:00 1 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC ================================================================================= 1 2 38605 38605 00:21:00 00:21:00 0 1 3 38629 38699 00:00:00 00:11:00 0 2 1 20 20 01:03:00 01:03:00 1 2 3 20 20 01:03:00 01:03:00 1 3 1 30 30 01:03:00 01:03:00 1 3 2 30 30 01:08:00 01:08:00 1 ------------------------------------------------------------------------------ Listing of old open connections on node 3 Database PID User Query Age Query ================================================================================ Tests for node 2 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.81 user=postgres password=my_password ======================================== pg_listener info: Pages: 0 Tuples: 0 Size Tests ================================================ sl_log_1 0 0.000000 sl_log_2 0 0.000000 sl_seqlog 0 0.000000 Listen Path Analysis =================================================== No problems found with sl_listen -------------------------------------------------------------------------------- Summary of event info Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age ================================================================================ 1 38573 38699 -00:05:00 00:15:00 0 2 20 21 00:15:00 01:03:00 0 3 30 30 00:57:00 00:57:00 1 --------------------------------------------------------------------------------- Summary of sl_confirm aging Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of eldest SYNC ================================================================================= 1 2 38607 38699 00:00:00 00:15:00 0 1 3 38573 38698 -00:05:00 00:15:00 0 2 1 20 20 00:57:00 00:57:00 1 2 3 20 20 00:57:00 00:57:00 1 3 1 30 30 00:57:00 00:57:00 1 3 2 30 30 01:02:00 01:02:00 1 ------------------------------------------------------------------------------ Listing of old open connections on node 2 Database PID User Query Age Query ================================================================================ peter at peter-development-machine:~/slony1-2.0.2/tools> Why is this happening? Regards, Peter Geoghegan
- Previous message: [Slony1-general] replicating views with slony
- Next message: [Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list