[Slony1-general] performance problems

Thu Oct 10 13:34:02 PDT 2019

Hello all!

We have Slony 2.1.4 with Postgres 9.3 and Postgres 10 Nodes. There is 
one master (node=1) and 53 slaves (node=2..54). Node 1 is always the 
master - no other node will ever be master, all slaves subscribe 
directly to the master. The slaves subscribe a single set from the master.

Recently we have frequently performance problems with very slow queries 
on our master DB. We think this correlates with the size of the sl_event 
table on the master. With ~10.000 rows everything is fine, but with 
 >50.000 rows we notice rather slow queries, INSERT/SELECT into the 
sl_event table.

So the question is - why does the sl_event table grows? One reason may 
be an offline slave. If the slave is offline too long, we kick him out 
of the cluster - problem solved.

But sometimes all slaves are in sync but the sl_event table grows.

I am not a Slony-expert, but it seems to the related to old 
confirmations in the sl_conf table. For example, today I removed and 
resubscribed node 9. On node 1 I see:

select ev_origin, ev_seqno, ev_timestamp, ev_type from 
_regdnscluster.sl_event order by ev_timestamp;
  ev_origin |  ev_seqno  |         ev_timestamp          | ev_type
-----------+------------+-------------------------------+------------
          9 | 5000000001 | 2019-10-10 14:19:41.514101+00 | STORE_PATH
          9 | 5000000002 | 2019-10-10 14:19:53.097102+00 | SYNC
          9 | 5000000003 | 2019-10-10 14:20:03.118798+00 | SYNC
thousands of 9 | .......| 14:20 - 19:43                 | always SYNC
          9 | 5000001943 | 2019-10-10 19:43:34.923708+00 | SYNC
          9 | 5000001944 | 2019-10-10 19:43:44.933323+00 | SYNC
          9 | 5000001945 | 2019-10-10 19:43:54.943484+00 | SYNC
          9 | 5000001946 | 2019-10-10 19:44:04.952488+00 | SYNC
         21 | 5003835850 | 2019-10-10 19:44:06.106565+00 | SYNC
         47 | 5000021416 | 2019-10-10 19:44:06.148101+00 | SYNC
         12 | 5003464447 | 2019-10-10 19:44:06.212744+00 | SYNC
         51 | 5000021650 | 2019-10-10 19:44:07.262471+00 | SYNC

select ev_origin,count(*) from _regdnscluster.sl_event group by 1 order 
by 2 desc;
  ev_origin | count
-----------+-------
          9 |  2029
          1 |   530
         21 |    89
         45 |    89
         18 |    88
         30 |    87
         16 |    87
         51 |    87
         42 |    87
         14 |    87
         50 |    87

So, events from node 9 a kept although all nodes are in sync and have a 
small lag.

a) Why is node 9 producing events at all? It does not have any 
replications sets and is only a slave.

b) As far as I understand the cleanup function, these old events are not 
deleted as there are older confirmations in sl_confirm.

select * from _regdnscluster.sl_confirm order by con_timestamp ;
  con_origin | con_received | con_seqno  |         con_timestamp
------------+--------------+------------+-------------------------------
           9 |           43 |          0 | 2019-10-10 14:19:41.447874+00
           9 |            8 |          0 | 2019-10-10 14:19:41.456426+00
           9 |           41 |          0 | 2019-10-10 14:19:41.45702+00
           8 |           34 | 5000008516 | 2019-10-10 19:56:34.146168+00
           8 |           31 | 5000008516 | 2019-10-10 19:56:34.262533+00
           8 |           54 | 5000008516 | 2019-10-10 19:56:34.413593+00
           8 |           43 | 5000008516 | 2019-10-10 19:56:34.422288+00
.....

select * from _regdnscluster.sl_confirm where con_origin=9 and 
con_received=43 order by con_timestamp ;
  con_origin | con_received | con_seqno |         con_timestamp
------------+--------------+-----------+-------------------------------
           9 |           43 |         0 | 2019-10-10 14:19:41.447874+00

If I understand it correct, this means that node 43 acknowledges the 
receipt of an event from node 9 (probably the STORE_PATH event) and this 
confirmation is not cleared as there is no newer confirmation from node 
43 about events from node 9.

select con_origin, con_received, count(*) from _regdnscluster.sl_confirm 
WHERE con_received = 43 group by 1,2 order by 3;

  con_origin | con_received | count
------------+--------------+-------
           9 |           43 |     1
           4 |           43 |     7
          30 |           43 |     8
           8 |           43 |     8
          42 |           43 |     9
          54 |           43 |     9
...

c) So, why is node 43 not more confirming other events originated by node 9?

d) How can I solve this problem? My slaves will always be pure slaves. 
So can I blindly delete all events from the events table with ev_origin 
<>1? (or at least all SYNC events)?

Thanks

Klaus