[Slony1-general] Why does sl_status event lag grow, even though events *are* replicated?

Mon Sep 21 04:47:15 PDT 2009

Hello,

This is a re-hash of an earlier mail that went unanswered. Doing this
probably doesn't show decorum, but I'm at the end of my tether with
this problem. I'd appreciate any help that you can give to fix this
problem, because it's a real irritation - my production system is
affected by it. At this point, I'd welcome even wild speculation.

Replication ostensibly works fine. We replicate from a windows Master
(node 1), using Hiroshi Saito's Slony-I 2.0.2 binaries, to 2 OpenSuse
slaves (nodes 2 and 3). It's all fairly standard.

When I restart a slave database (in the following example node 2),
replication continues to work (at least as far as can be immediately
observed), but sl_status shows:

st_origin | st_received | st_last_event |   st_last_event_ts |
st_last_received |    st_last_received_ts     |
st_last_received_event_ts  | st_lag_num_events |   st_lag_time
-------------+-----------------+------------------------------------------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
1            |        3        |38689  |  "2009-07-30 12:11:51.796" |
38688;"2009-07-30 12:12:02.428316" | "2009-07-30 12:11:41.859"
|1                            |"00:00:14.015"
1            |        2        |38689  |  "2009-07-30 12:11:51.796" |
38605;"2009-07-30 11:52:35.119048" | "2009-07-30 11:58:05.734"
|84                          |"00:13:50.14"

Node 2's st_lag_num_events grows and grows, until the slony-I service
(all slon daemons) is restarted on the master, at which time it
returns to zero, just as before. This is very annoying, because
sl_status is how my application monitors the state of the replication
cluster, and when its broken it confuses users. I can restart the slon
services (slon daemons) and have the event lag return to zero, but
that's not acceptable in a production system.

Bear in mind, replication isn't broken at any point - only sl_status is.

When I run test_slony_state-dbi.pl on the master while the event lag
continues to grow, it outputs the following:

peter at peter-development-machine:~/slony1-2.0.2/tools>
./test_slony_state-dbi.pl --host=10.0.0.80 --database=lustre
--cluster=lustre_cluster --user=postgres --password=my_password
DSN: dbi:Pg:dbname=lustre;host=10.0.0.80;user=postgres;password=my_password;
===========================
Rummage for DSNs
=============================
Query:

 select p.pa_server, p.pa_conninfo
 from "_lustre_cluster".sl_path p
--   where exists (select * from "_lustre_cluster".sl_subscribe s where
--                          (s.sub_provider = p.pa_server or
s.sub_receiver = p.pa_server) and
--                          sub_active = 't')
 group by pa_server, pa_conninfo;

Tests for node 1 - DSN = dbi:Pg:dbname=lustre host=10.0.0.80
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0

Size Tests
================================================
     sl_log_1         0  0.000000
     sl_log_2         0  0.000000
    sl_seqlog         0  0.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
 Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
================================================================================
    1     38605     38699     00:00:00     00:15:00    0
    2        20        20     01:08:00     01:08:00    1
    3        30        30     01:02:00     01:02:00    1

---------------------------------------------------------------------------------
Summary of sl_confirm aging
 Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
eldest SYNC
=================================================================================
      1          2      38605      38605      00:20:00      00:20:00    0
      1          3      38627      38698      00:00:00      00:11:00    0
      2          1         20         20      01:03:00      01:03:00    1
      2          3         20         20      01:02:00      01:02:00    1
      3          1         30         30      01:02:00      01:02:00    1
      3          2         30         30      01:08:00      01:08:00    1

------------------------------------------------------------------------------

Listing of old open connections on node 1
     Database             PID            User    Query Age
   Query
================================================================================

Tests for node 3 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.82
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0

Size Tests
================================================
     sl_log_1         0  0.000000
     sl_log_2         0  0.000000
    sl_seqlog         0  0.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
 Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
================================================================================
    1     38605     38699     00:00:00     00:15:00    0
    2        20        20     01:08:00     01:08:00    1
    3        30        30     01:02:00     01:02:00    1

---------------------------------------------------------------------------------
Summary of sl_confirm aging
 Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
eldest SYNC
=================================================================================
      1          2      38605      38605      00:21:00      00:21:00    0
      1          3      38629      38699      00:00:00      00:11:00    0
      2          1         20         20      01:03:00      01:03:00    1
      2          3         20         20      01:03:00      01:03:00    1
      3          1         30         30      01:03:00      01:03:00    1
      3          2         30         30      01:08:00      01:08:00    1

------------------------------------------------------------------------------

Listing of old open connections on node 3
     Database             PID            User    Query Age
   Query
================================================================================

Tests for node 2 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.81
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0

Size Tests
================================================
     sl_log_1         0  0.000000
     sl_log_2         0  0.000000
    sl_seqlog         0  0.000000

Listen Path Analysis
===================================================
No problems found with sl_listen

--------------------------------------------------------------------------------
Summary of event info
 Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
================================================================================
    1     38573     38699    -00:05:00     00:15:00    0
    2        20        21     00:15:00     01:03:00    0
    3        30        30     00:57:00     00:57:00    1

---------------------------------------------------------------------------------
Summary of sl_confirm aging
 Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC  Age of
eldest SYNC
=================================================================================
      1          2      38607      38699      00:00:00      00:15:00    0
      1          3      38573      38698     -00:05:00      00:15:00    0
      2          1         20         20      00:57:00      00:57:00    1
      2          3         20         20      00:57:00      00:57:00    1
      3          1         30         30      00:57:00      00:57:00    1
      3          2         30         30      01:02:00      01:02:00    1

------------------------------------------------------------------------------

Listing of old open connections on node 2
     Database             PID            User    Query Age
   Query
================================================================================

peter at peter-development-machine:~/slony1-2.0.2/tools>

Why is this happening?

Regards,
Peter Geoghegan