Bug 126

Summary:	slon sometimes does not recover from a network outage
Product:	Slony-I	Reporter:	Steve Singer <ssinger>
Component:	slon	Assignee:	Steve Singer <ssinger>
Status:	RESOLVED FIXED
Severity:	normal	CC:	slony1-bugs
Priority:	medium
Version:	devel
Hardware:	Other
OS:	Other
Attachments:	Patch for typing issue that gets revaled once bug126 is patched Fix compiler warnings fix for bug126

Description Steve Singer 2010-05-18 12:44:40 UTC

We've received a report of slon not recovering properly from a network outage.

It appears that the remote listener thread  (8431) encountered a network error while the network was done.  No network error for the remote worker threads where observed.    After the error the remote listener for 8431 apparently continued to queue events (but no logs are available).  Replication started to fall behind and did not proceed after the network was restored.

Restarting slon made replication work again.


--


My theory is that we were waiting on a socket read() inside of libpq and the network died.  Since we were not trying to send an event no packets where generated to notify libpq that the network connection died.

Setting KEEPALIVE on the connections to postgres should address this.  We don't appear to be doing that currently.



--


2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112865 received by 8344
2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112865 received by 8346
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: queue event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: UNLISTEN
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: Received event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 calc sync size - last time: 1 last
length: 10001 ideal: 5 proposed size: 3
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157585
processing
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: no sets need
syncing for this event
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event
8344,148724 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event
8346,157585 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteWorker_event: event 8346,157585
ignored - duplicate
2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: UNLISTEN
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: Received event
8344,148724 SYNC
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148724
processing
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: no sets need
syncing for this event
2010-05-16 20:59:38 UTC ERROR  remoteListenThread_8341: "select
con_origin, con_received,     max(con_seqno) as con_seqno,
  max(con_timestamp) as con_timestamp from "_oxrsin".sl_confirm where
con_received <> 8394 group by con_origin, con_received"
 could not receive data from server: Connection timed out
2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8346,157585 received by 8344
2010-05-16 20:59:42 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112866
2010-05-16 20:59:45 UTC DEBUG2 localListenThread: Received event
8394,112866 SYNC
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8394,112866 received by 8344
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8344,148724 received by 8346
2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112866 received by 8346
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: queue event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: UNLISTEN
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: Received event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 calc sync size - last time: 1 last
length: 10002 ideal: 5 proposed size: 3
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157586
processing
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: no sets need
syncing for this event
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event
8344,148725 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event
8346,157586 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteWorker_event: event 8346,157586
ignored - duplicate
2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: UNLISTEN
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: Received event
8344,148725 SYNC
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148725
processing
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: no sets need
syncing for this event
2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8346,157586 received by 8344
2010-05-16 20:59:49 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8344,148725 received by 8346
2010-05-16 20:59:52 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112867
2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8344: LISTEN
2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8346: LISTEN
2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8344: forward confirm
8394,112867 received by 8344
2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8346: forward confirm
8394,112867 received by 8346

Comment 1 Christopher Browne 2010-08-19 12:46:42 UTC

Does this involve:

a) Setting GUC defaults for:

tcp_keepalives_count
tcp_keepalives_idle
tcp_keepalives_interval

b) Something else???

If it's a matter of setting GUC's, then a Gentle User might use the sql_on_connection parameter to set these GUCs.

<http://slony.info/documentation/slon-config-connection.html>

If there are values to be suggested, then let's put this into the documentation, probably in the "best practices" section.

http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=doc/adminguide/bestpractices.sgml

Comment 2 Christopher Browne 2010-08-19 12:47:45 UTC

I'll note that the default GUC values on 8.4 are thus:

org=# show tcp_keepalives_count;
 tcp_keepalives_count
----------------------
 9
(1 row)

org=# show tcp_keepalives_idle ;
 tcp_keepalives_idle
---------------------
 7200
(1 row)

org=# show tcp_keepalives_interval ;
 tcp_keepalives_interval
-------------------------
 75
(1 row)

org=# select version();
                                             version
--------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-6) 4.4.4, 64-bit
(1 row)

Comment 3 Steve Singer 2010-08-23 10:39:43 UTC

No that controls server side keep alives, we want client side keep alives.

This was actually introduced in 9.0, see http://www.postgresql.org/docs/9.0/static/libpq-connect.html#LIBPQ-KEEPALIVES

Prior to 9.0 applications like slony could call PQSocket and get then manually adjust values on the socket which is what we will have to do if we want the feature when running against older versions of PG.

In either case we need to setup a system that replicates a slon detected socket failure to make sure we don't have some other horrible issue waiting in the code path that gets hit once the keep socket times out.

Comment 4 Steve Singer 2011-01-06 12:28:18 UTC

Created attachment 83 [details]
Patch for typing issue that gets revaled once bug126 is patched

This patch fixes a type conversion issue with signed vs unsigned types.  Once bug126 is fixed it exposes a bug that is addressed by this patch.

Comment 5 Steve Singer 2011-01-06 12:29:18 UTC

Created attachment 84 [details]
Fix compiler warnings

This patch fixes other signed related compiler warnings that were found during development of the bug fix.

Comment 6 Steve Singer 2011-01-06 12:29:50 UTC

Created attachment 85 [details]
fix for bug126

This is the actual fix for bug 126.

Comment 7 Steve Singer 2011-01-06 12:37:08 UTC

I have a fix for this bug available in the attached patches or at https://github.com/ssinger/slony1-engine/tree/bug126

The github branch has all three patches applied. (The fix is pretty useless without the typing issue fixed).

An item that showed up in my testing that is worth thinking about.

If the network connection between the slon and the local node goes away the slon will exit (I think this is what we want) but it then restarts and keeps looping trying to connect to the local node (again this is what we want).

However if when the network comes back and it connects to the postgres instance, IF the timeout settings on the postgresql side are such that the  old backend has not yet timed out then a sl_nodelock row will exist.  The slon will then exit on an error (and not retry).

I'm not sure this is the result we want. In bug132 we made the slon loop on getting the node lock IF it could not get the node lock due to a slon requested restart.  In this case we are restarting due to a slon error.  I'm not sure if we want to change this behavior or not.

Comment 8 Christopher Browne 2011-01-10 08:16:03 UTC

I have added a patch to cause all the regression tests to set values for TCP keepalive

https://github.com/cbbrowne/slony1-engine/commit/0c08ba8302efc3795388c6246f6fe1a19f9d7f9e

One may see that this sets values for the parms, per the slon log:


Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [1-1] CONFIG main: slon version 2.1.0 starting up
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [2-1] INFO   slon: watchdog process started
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [3-1] CONFIG slon: watchdog ready - pid = 14449
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [4-1] CONFIG slon: worker process created - pid = 14451
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [4-1] CONFIG main: Integer option vac_frequency = 2
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [5-1] CONFIG main: Integer option log_level = 2
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [6-1] CONFIG main: Integer option sync_interval = 2010
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [7-1] CONFIG main: Integer option sync_interval_timeout = 15000
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [8-1] CONFIG main: Integer option sync_group_maxsize = 8
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [9-1] CONFIG main: Integer option desired_sync_time = 60000
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [10-1] CONFIG main: Integer option syslog = 1
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [11-1] CONFIG main: Integer option quit_sync_provider = 0
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [12-1] CONFIG main: Integer option quit_sync_finalsync = 0
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [13-1] CONFIG main: Integer option sync_max_rowsize = 4096
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [14-1] CONFIG main: Integer option sync_max_largemem = 1048576
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [15-1] CONFIG main: Integer option remote_listen_timeout = 300
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [16-1] CONFIG main: Integer option tcp_keepalive_idle = 5
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [17-1] CONFIG main: Integer option tcp_keepalive_interval = 5
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [18-1] CONFIG main: Integer option tcp_keepalive_count = 5
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [19-1] CONFIG main: Boolean option log_pid = 0
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [20-1] CONFIG main: Boolean option log_timestamp = 1
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [21-1] CONFIG main: Boolean option tcp_keepalive = 1
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [22-1] CONFIG main: Real option real_placeholder = 0.000000
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [23-1] CONFIG main: String option cluster_name = slony_regress1
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [24-1] CONFIG main: String option conn_info = dbname=slonyregress2 host=localhost user=postgres port=7090
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [25-1] CONFIG main: String option pid_file = /tmp/slony-regress.XIjI9L/slon-pid.2
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [26-1] CONFIG main: String option log_timestamp_format = %Y-%m-%d %H:%M:%S %Z
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [27-1] CONFIG main: String option archive_dir = [NULL]
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [28-1] CONFIG main: String option sql_on_connection = SET log_min_duration_statement to '1000';
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [29-1] CONFIG main: String option lag_interval = 2 seconds
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [30-1] CONFIG main: String option command_on_logarchive = [NULL]
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [31-1] CONFIG main: String option syslog_facility = LOCAL0
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [32-1] CONFIG main: String option syslog_ident = slon-slony_regress1-2
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [33-1] CONFIG main: String option cleanup_interval = 30 seconds
Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [34-1] CONFIG main: local node id = 2

I haven't run any deep tests involving inducing network outages, but this certainly plays OK when the network *isn't* broken.

Comment 9 Steve Singer 2011-01-14 09:10:30 UTC

Committed to master

44170107c88836df55519f29a467e4bdbbb1689b
f6aeede568572577e86b81bd758415cbf9bdb3b6
660fa6787bca4690a3ace1fa93c9507a16963c61
a6893dbbaa782d7ca4b22ff1cee2b7953a29e89c