[Slony1-general] slony backup during pg_dump

Wed Jun 12 20:57:47 PDT 2019

On Wed, Jun 12, 2019 at 9:33 AM Christopher Browne <cbbrowne at afilias.info>
wrote:

> On Tue, Jun 11, 2019 at 3:44 PM Tory M Blue <tmblue at gmail.com> wrote:
>
>>
>>
>> On Tue, Jun 11, 2019 at 12:26 PM Christopher Browne <
>> cbbrowne at afilias.info> wrote:
>>
>>>
>>>
>>> On Tue, Jun 11, 2019 at 1:30 PM Tory M Blue <tmblue at gmail.com> wrote:
>>>
>>>> I'm wondering if there is anything I can do to prevent what appears to
>>>> be slony replication backup (not replicating) during a pg_dump. I'm using a
>>>> -N slonschema, so wondering if there is any other techniques to prevent
>>>> slon replication delay during pg_dumps?
>>>>
>>>
>>> Hmm.  I remember the "big deal" being to ensure that the Slony schema
>>> was left out, as that would definitely lead to locking that would cause
>>> replication to fall behind.
>>>
>>> It's possible that something new has come along that would lock in a
>>> troublesome way.
>>>
>>> I suggest that you take a look in pg_catalog.pg_locks to see what
>>> objects Slony is waiting on, thus, looking for:
>>>
>>>   select * from pg_catalog.pg_locks where not granted;
>>>
>>> Some further work needs to be done to trace that to exactly what objects
>>> are locked.
>>>
>>
>> Ya will take a look, I also thought that bypassing the slon schema would
>> solve it but apparently not, we can have 2-10million count in the sl_log
>> before it clears and it's apparent that the replication is not happening,
>> so I'll poke around some more. Glad to see folks are still around :)
>>
>
> Well, growth of sl_log during the backup is fine, and not an indication of
> anything untowards.  Given that you opened a transaction when pg_dump
> began, and that has not completed, that will lead to Slony not trimming
> anything out of sl_log during the backup.  That's normal, routine stuff.
> Nothing's wrong (not on the basis of that, anyways).
>
> On the origin node, you can query the view sl_status, which will show how
> far behind replication appears to be.  (The view is available on all nodes;
> it does not return terribly interesting results on nodes other than the
> origin.)
>
> Here's an example of replication being very far behind:
>
> usaorigin=# select * from _usa.sl_status limit 3;
> -[ RECORD 1 ]-------------+------------------------------
> st_origin                 | 1
> st_received               | 3
> st_last_event             | 5000002124
> st_last_event_ts          | 2019-04-10 20:02:43.833414+00
> st_last_received          | 5000002124
> st_last_received_ts       | 2019-04-10 20:02:43.885806+00
> st_last_received_event_ts | 2019-04-10 20:02:43.833414+00
> st_lag_num_events         | 0
> st_lag_time               | 62 days 20:23:14.029816
> -[ RECORD 2 ]-------------+------------------------------
> st_origin                 | 1
> st_received               | 4
> st_last_event             | 5000002124
> st_last_event_ts          | 2019-04-10 20:02:43.833414+00
> st_last_received          | 5000002122
> st_last_received_ts       | 2019-04-10 20:02:33.870577+00
> st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
> st_lag_num_events         | 2
> st_lag_time               | 62 days 20:23:34.0995
> -[ RECORD 3 ]-------------+------------------------------
> st_origin                 | 1
> st_received               | 5
> st_last_event             | 5000002124
> st_last_event_ts          | 2019-04-10 20:02:43.833414+00
> st_last_received          | 5000002122
> st_last_received_ts       | 2019-04-10 20:02:33.898605+00
> st_last_received_event_ts | 2019-04-10 20:02:23.76373+00
> st_lag_num_events         | 2
> st_lag_time               | 62 days 20:23:34.0995
>
> That is a replication cluster where I shut down slon processes back in
> April :-)
>
> It's only behind by a few events, for the self-same reason :-)
>
> If the last event received coincides with the time at which you started
> the pg_dump run, well, that sure sounds like a smoking gun.
>
> If, on the other hand, SYNC events are being processed, just at slower
> speed than you'd like, then perhaps the problem is that the I/O induced by
> pg_dump is delaying replication, and that's more a hardware problem than a
> software one.
>
> And if replication is a few seconds behind, while the system is busy, as
> indicated by st_lag_time being a few seconds, then it's possible that there
> is no problem there.
>

First postgres 9.x
CentOS 7.2
Replication flows:
primary ->secondary-> 3 query slaves

Okay here it is,

My dump started:

Start export of schema clsdb for user postgres at Wed Jun 12 19:00:48 PDT
2019

So if I'm reading this correct it's 20:45 but something is showing that the
last _received_event_ts was at 19:38? over an hour ago?
sl_lag_time 1 hour?

Primary:

select * from _cls.sl_status limit 3;

 st_origin | st_received | st_last_event |       st_last_event_ts        |
st_last_receive

d |      st_last_received_ts      |   st_last_received_event_ts   |
st_lag_num_events |

st_lag_time

-----------+-------------+---------------+-------------------------------+----------------

--+-------------------------------+-------------------------------+-------------------+---

--------------

        11 |          14 |    5036030946 | 2019-06-12 20:45:29.465785-07 |
    503603013

8 | 2019-06-12 19:38:11.065976-07 | 2019-06-12 19:38:03.311267-07 |
      808 | 01

:07:33.039694

        11 |          15 |    5036030946 | 2019-06-12 20:45:29.465785-07 |
    503603013

8 | 2019-06-12 19:38:12.202539-07 | 2019-06-12 19:38:03.311267-07 |
      808 | 01

:07:33.039694

        11 |          12 |    5036030946 | 2019-06-12 20:45:29.465785-07 |
    503603013

8 | 2019-06-12 19:38:08.342637-07 | 2019-06-12 19:38:03.311267-07 |
      808 | 01

:07:33.039694

(3 rows)

I have 2.4 million rows in the sl_log_1 table, which looks like it started
to hockey stick right around 19:30.

On the secondary (which replicates to my query db's (where the dump is run
from (not run from the primary), it includes these lines

Secondary:

 select * from _cls.sl_status limit 3;

 st_origin | st_received | st_last_event |       st_last_event_ts        |
st_la

st_received |      st_last_received_ts      |   st_last_received_event_ts   |
st

_lag_num_events |   st_lag_time

-----------+-------------+---------------+-------------------------------+------

------------+-------------------------------+-------------------------------+---

----------------+-----------------

        12 |          11 |    5008650405 | 2019-06-12 20:45:35.365757-07 |

 5008650000 | 2019-06-12 19:38:03.815508-07 | 2019-06-12 19:38:03.338486-07
|

            405 | 01:07:35.106539

        12 |          13 |    5008650405 | 2019-06-12 20:45:35.365757-07 |

 5008650403 | 2019-06-12 20:45:17.701455-07 | 2019-06-12 20:45:15.355233-07
|

              2 | 00:00:23.089792

        12 |          14 |    5008650405 | 2019-06-12 20:45:35.365757-07 |

 5008650404 | 2019-06-12 20:45:26.642094-07 | 2019-06-12 20:45:25.360167-07
|

              1 | 00:00:13.084858

(3 rows)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20190612/0ca570f6/attachment.htm