Bug 365 - Replication Lag? - All nodes appear to lag when a single provider node is unreachable
Summary: Replication Lag? - All nodes appear to lag when a single provider node is unr...
Status: NEW
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: devel
Hardware: PC All
: low minor
Assignee: Slony Bugs List
URL:
Depends on:
Blocks:
 
Reported: 2016-09-07 10:19 UTC by Glyn Astill
Modified: 2016-09-07 10:19 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Glyn Astill 2016-09-07 10:19:04 UTC
This bug was reported on Slony1-general back in February here:

    http://lists.slony.info/pipermail/slony1-general/2016-February/013267.html

I read the message and recalled seeing similar behaviour myself, but then got waylaid by something else and forgot about it.

I remembered just now, and could reproduce it in a quick test with 2.2.4 (I'm assuming this hasn't been fixed in 2.2.5) as follows:

 - Have multiple subscribers to a set that are also providers / were subscribed with FORWARD = YES
 - Stop postgres on one of those subscribers

What appears to happen is that changes are still replicated to the remaining subscribers, and confirms are generated on those subscribers but they don't manage to make their way back to the origin until the postgres instance we stopped is started again.

In my test setup I've 4 nodes, as follows (though I'm pretty sure node 5 being subscribed to node 4 is irrelevent to the issue):

node 8 = origin of set
node 7 = forwarding subscriber to set, subscribed to node 8
node 4 = forwarding subscriber to set, subscribed to node 8
node 5 = forwarding subscriber to set, subscribed to node 4

Here is the log from the slon against the origin (node 8):

2016-09-07_155951 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_155951 BSTWARN   remoteListenThread_5: DB connection failed - sleep 10 seconds

2016-09-07_155951 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_155951 BSTERROR  remoteWorkerThread_7: cannot connect to data provider 5 on 'dbname=TEST host=192.168.0.102 user=slony'

2016-09-07_155958 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_155958 BSTERROR  remoteWorkerThread_4: cannot connect to data provider 5 on 'dbname=TEST host=192.168.0.102 user=slony'

2016-09-07_160001 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_160001 BSTWARN   remoteListenThread_5: DB connection failed - sleep 10 seconds

And on another subscriber (node 7):

2016-09-07_155957 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_155957 BSTERROR  remoteWorkerThread_4: cannot connect to data provider 5 on 'dbname=TEST host=192.168.0.102 user=slony'

2016-09-07_160001 BSTERROR  slon_connectdb: PQconnectdb("dbname=TEST host=192.168.0.102 user=slony") failed - could not connect to server: Connection refused
        Is the server running on host "192.168.0.102" and accepting
        TCP/IP connections on port 5432?
2016-09-07_160001 BSTWARN   remoteListenThread_5: DB connection failed - sleep 10 seconds