Brad Nicholson bnichols at ca.afilias.info
Mon Sep 21 11:23:27 PDT 2009
On Mon, 2009-09-21 at 12:47 +0100, Peter Geoghegan wrote:
> Hello,
> 
> This is a re-hash of an earlier mail that went unanswered. Doing this
> probably doesn't show decorum, but I'm at the end of my tether with
> this problem. I'd appreciate any help that you can give to fix this
> problem, because it's a real irritation - my production system is
> affected by it. At this point, I'd welcome even wild speculation.
> 
> Replication ostensibly works fine. We replicate from a windows Master
> (node 1), using Hiroshi Saito's Slony-I 2.0.2 binaries, to 2 OpenSuse
> slaves (nodes 2 and 3). It's all fairly standard.
> 
> When I restart a slave database (in the following example node 2),
> replication continues to work (at least as far as can be immediately
> observed), but sl_status shows:
> 
> st_origin | st_received | st_last_event |   st_last_event_ts |
> st_last_received |    st_last_received_ts     |
> st_last_received_event_ts  | st_lag_num_events |   st_lag_time
> -------------+-----------------+------------------------------------------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
> 1            |        3        |38689  |  "2009-07-30 12:11:51.796" |
> 38688;"2009-07-30 12:12:02.428316" | "2009-07-30 12:11:41.859"
> |1                            |"00:00:14.015"
> 1            |        2        |38689  |  "2009-07-30 12:11:51.796" |
> 38605;"2009-07-30 11:52:35.119048" | "2009-07-30 11:58:05.734"
> |84                          |"00:13:50.14"
> 
> Node 2's st_lag_num_events grows and grows, until the slony-I service
> (all slon daemons) is restarted on the master, at which time it
> returns to zero, just as before. This is very annoying, because
> sl_status is how my application monitors the state of the replication
> cluster, and when its broken it confuses users. I can restart the slon
> services (slon daemons) and have the event lag return to zero, but
> that's not acceptable in a production system.
> 
> Bear in mind, replication isn't broken at any point - only sl_status is.
> 
> When I run test_slony_state-dbi.pl on the master while the event lag
> continues to grow, it outputs the following:
> 
> peter at peter-development-machine:~/slony1-2.0.2/tools>
> ./test_slony_state-dbi.pl --host=10.0.0.80 --database=lustre
> --cluster=lustre_cluster --user=postgres --password=my_password
> DSN: dbi:Pg:dbname=lustre;host=10.0.0.80;user=postgres;password=my_password;
> ===========================
> Rummage for DSNs
> =============================
> Query:
> 
>  select p.pa_server, p.pa_conninfo
>  from "_lustre_cluster".sl_path p
> --   where exists (select * from "_lustre_cluster".sl_subscribe s where
> --                          (s.sub_provider = p.pa_server or
> s.sub_receiver = p.pa_server) and
> --                          sub_active = 't')
>  group by pa_server, pa_conninfo;
> 
> 
> Tests for node 1 - DSN = dbi:Pg:dbname=lustre host=10.0.0.80
> user=postgres password=my_password
> ========================================
> pg_listener info:
> Pages: 0
> Tuples: 0
> 
> Size Tests
> ================================================
>      sl_log_1         0  0.000000
>      sl_log_2         0  0.000000
>     sl_seqlog         0  0.000000
> 
> Listen Path Analysis
> ===================================================
> No problems found with sl_listen
> 
> --------------------------------------------------------------------------------
> Summary of event info
>  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
> ================================================================================
>     1     38605     38699     00:00:00     00:15:00    0
>     2        20        20     01:08:00     01:08:00    1
>     3        30        30     01:02:00     01:02:00    1
> 
> 
> ---------------------------------------------------------------------------------
> Summary of sl_confirm aging
>  Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
> eldest SYNC
> =================================================================================
>       1          2      38605      38605      00:20:00      00:20:00    0
>       1          3      38627      38698      00:00:00      00:11:00    0
>       2          1         20         20      01:03:00      01:03:00    1
>       2          3         20         20      01:02:00      01:02:00    1
>       3          1         30         30      01:02:00      01:02:00    1
>       3          2         30         30      01:08:00      01:08:00    1
> 
> 
> ------------------------------------------------------------------------------
> 
> Listing of old open connections on node 1
>      Database             PID            User    Query Age
>    Query
> ================================================================================
> 
> 
> Tests for node 3 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.82
> user=postgres password=my_password
> ========================================
> pg_listener info:
> Pages: 0
> Tuples: 0
> 
> Size Tests
> ================================================
>      sl_log_1         0  0.000000
>      sl_log_2         0  0.000000
>     sl_seqlog         0  0.000000
> 
> Listen Path Analysis
> ===================================================
> No problems found with sl_listen
> 
> --------------------------------------------------------------------------------
> Summary of event info
>  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
> ================================================================================
>     1     38605     38699     00:00:00     00:15:00    0
>     2        20        20     01:08:00     01:08:00    1
>     3        30        30     01:02:00     01:02:00    1
> 
> 
> ---------------------------------------------------------------------------------
> Summary of sl_confirm aging
>  Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
> eldest SYNC
> =================================================================================
>       1          2      38605      38605      00:21:00      00:21:00    0
>       1          3      38629      38699      00:00:00      00:11:00    0
>       2          1         20         20      01:03:00      01:03:00    1
>       2          3         20         20      01:03:00      01:03:00    1
>       3          1         30         30      01:03:00      01:03:00    1
>       3          2         30         30      01:08:00      01:08:00    1
> 
> 
> ------------------------------------------------------------------------------
> 
> Listing of old open connections on node 3
>      Database             PID            User    Query Age
>    Query
> ================================================================================
> 
> 
> Tests for node 2 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.81
> user=postgres password=my_password
> ========================================
> pg_listener info:
> Pages: 0
> Tuples: 0
> 
> Size Tests
> ================================================
>      sl_log_1         0  0.000000
>      sl_log_2         0  0.000000
>     sl_seqlog         0  0.000000
> 
> Listen Path Analysis
> ===================================================
> No problems found with sl_listen
> 
> --------------------------------------------------------------------------------
> Summary of event info
>  Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
> ================================================================================
>     1     38573     38699    -00:05:00     00:15:00    0
>     2        20        21     00:15:00     01:03:00    0
>     3        30        30     00:57:00     00:57:00    1
> 
> 
> ---------------------------------------------------------------------------------
> Summary of sl_confirm aging
>  Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
> eldest SYNC
> =================================================================================
>       1          2      38607      38699      00:00:00      00:15:00    0
>       1          3      38573      38698     -00:05:00      00:15:00    0
>       2          1         20         20      00:57:00      00:57:00    1
>       2          3         20         20      00:57:00      00:57:00    1
>       3          1         30         30      00:57:00      00:57:00    1
>       3          2         30         30      01:02:00      01:02:00    1
> 
> 
> ------------------------------------------------------------------------------
> 
> Listing of old open connections on node 2
>      Database             PID            User    Query Age
>    Query
> ================================================================================
> 
> peter at peter-development-machine:~/slony1-2.0.2/tools>
> 
> Why is this happening?


It's a bug. The events are making it from the origin to the receiver,
but the confirmations are not making it back.  

We've experienced the same situation many times in an all Linux
environment - but that was with Slony 1.1 

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.




More information about the Slony1-general mailing list