We've received a report of slon not recovering properly from a network outage. It appears that the remote listener thread (8431) encountered a network error while the network was done. No network error for the remote worker threads where observed. After the error the remote listener for 8431 apparently continued to queue events (but no logs are available). Replication started to fall behind and did not proceed after the network was restored. Restarting slon made replication work again. -- My theory is that we were waiting on a socket read() inside of libpq and the network died. Since we were not trying to send an event no packets where generated to notify libpq that the network connection died. Setting KEEPALIVE on the connections to postgres should address this. We don't appear to be doing that currently. -- 2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8344: LISTEN 2010-05-16 20:59:35 UTC DEBUG2 remoteListenThread_8346: LISTEN 2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8394,112865 received by 8344 2010-05-16 20:59:35 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8394,112865 received by 8346 2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: queue event 8346,157585 SYNC 2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8346: UNLISTEN 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: Received event 8346,157585 SYNC 2010-05-16 20:59:38 UTC DEBUG2 calc sync size - last time: 1 last length: 10001 ideal: 5 proposed size: 3 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157585 processing 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8346: no sets need syncing for this event 2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event 8344,148724 SYNC 2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: queue event 8346,157585 SYNC 2010-05-16 20:59:38 UTC DEBUG2 remoteWorker_event: event 8346,157585 ignored - duplicate 2010-05-16 20:59:38 UTC DEBUG2 remoteListenThread_8344: UNLISTEN 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: Received event 8344,148724 SYNC 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148724 processing 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: no sets need syncing for this event 2010-05-16 20:59:38 UTC ERROR remoteListenThread_8341: "select con_origin, con_received, max(con_seqno) as con_seqno, max(con_timestamp) as con_timestamp from "_oxrsin".sl_confirm where con_received <> 8394 group by con_origin, con_received" could not receive data from server: Connection timed out 2010-05-16 20:59:38 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8346,157585 received by 8344 2010-05-16 20:59:42 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112866 2010-05-16 20:59:45 UTC DEBUG2 localListenThread: Received event 8394,112866 SYNC 2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN 2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN 2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8344: LISTEN 2010-05-16 20:59:45 UTC DEBUG2 remoteListenThread_8346: LISTEN 2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm 8394,112866 received by 8344 2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8346: forward confirm 8344,148724 received by 8346 2010-05-16 20:59:45 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8394,112866 received by 8346 2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: queue event 8346,157586 SYNC 2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8346: UNLISTEN 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: Received event 8346,157586 SYNC 2010-05-16 20:59:48 UTC DEBUG2 calc sync size - last time: 1 last length: 10002 ideal: 5 proposed size: 3 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: SYNC 157586 processing 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8346: no sets need syncing for this event 2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event 8344,148725 SYNC 2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: queue event 8346,157586 SYNC 2010-05-16 20:59:48 UTC DEBUG2 remoteWorker_event: event 8346,157586 ignored - duplicate 2010-05-16 20:59:48 UTC DEBUG2 remoteListenThread_8344: UNLISTEN 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: Received event 8344,148725 SYNC 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: SYNC 148725 processing 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: no sets need syncing for this event 2010-05-16 20:59:48 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8346,157586 received by 8344 2010-05-16 20:59:49 UTC DEBUG2 remoteWorkerThread_8346: forward confirm 8344,148725 received by 8346 2010-05-16 20:59:52 UTC DEBUG2 syncThread: new sl_action_seq 1 - SYNC 112867 2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8344: LISTEN 2010-05-16 20:59:55 UTC DEBUG2 remoteListenThread_8346: LISTEN 2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8344: forward confirm 8394,112867 received by 8344 2010-05-16 20:59:55 UTC DEBUG2 remoteWorkerThread_8346: forward confirm 8394,112867 received by 8346
Does this involve: a) Setting GUC defaults for: tcp_keepalives_count tcp_keepalives_idle tcp_keepalives_interval b) Something else??? If it's a matter of setting GUC's, then a Gentle User might use the sql_on_connection parameter to set these GUCs. <http://slony.info/documentation/slon-config-connection.html> If there are values to be suggested, then let's put this into the documentation, probably in the "best practices" section. http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=doc/adminguide/bestpractices.sgml
I'll note that the default GUC values on 8.4 are thus: org=# show tcp_keepalives_count; tcp_keepalives_count ---------------------- 9 (1 row) org=# show tcp_keepalives_idle ; tcp_keepalives_idle --------------------- 7200 (1 row) org=# show tcp_keepalives_interval ; tcp_keepalives_interval ------------------------- 75 (1 row) org=# select version(); version -------------------------------------------------------------------------------------------------- PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-6) 4.4.4, 64-bit (1 row)
No that controls server side keep alives, we want client side keep alives. This was actually introduced in 9.0, see http://www.postgresql.org/docs/9.0/static/libpq-connect.html#LIBPQ-KEEPALIVES Prior to 9.0 applications like slony could call PQSocket and get then manually adjust values on the socket which is what we will have to do if we want the feature when running against older versions of PG. In either case we need to setup a system that replicates a slon detected socket failure to make sure we don't have some other horrible issue waiting in the code path that gets hit once the keep socket times out.
Created attachment 83 [details] Patch for typing issue that gets revaled once bug126 is patched This patch fixes a type conversion issue with signed vs unsigned types. Once bug126 is fixed it exposes a bug that is addressed by this patch.
Created attachment 84 [details] Fix compiler warnings This patch fixes other signed related compiler warnings that were found during development of the bug fix.
Created attachment 85 [details] fix for bug126 This is the actual fix for bug 126.
I have a fix for this bug available in the attached patches or at https://github.com/ssinger/slony1-engine/tree/bug126 The github branch has all three patches applied. (The fix is pretty useless without the typing issue fixed). An item that showed up in my testing that is worth thinking about. If the network connection between the slon and the local node goes away the slon will exit (I think this is what we want) but it then restarts and keeps looping trying to connect to the local node (again this is what we want). However if when the network comes back and it connects to the postgres instance, IF the timeout settings on the postgresql side are such that the old backend has not yet timed out then a sl_nodelock row will exist. The slon will then exit on an error (and not retry). I'm not sure this is the result we want. In bug132 we made the slon loop on getting the node lock IF it could not get the node lock due to a slon requested restart. In this case we are restarting due to a slon error. I'm not sure if we want to change this behavior or not.
I have added a patch to cause all the regression tests to set values for TCP keepalive https://github.com/cbbrowne/slony1-engine/commit/0c08ba8302efc3795388c6246f6fe1a19f9d7f9e One may see that this sets values for the parms, per the slon log: Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [1-1] CONFIG main: slon version 2.1.0 starting up Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [2-1] INFO slon: watchdog process started Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [3-1] CONFIG slon: watchdog ready - pid = 14449 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [4-1] CONFIG slon: worker process created - pid = 14451 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [4-1] CONFIG main: Integer option vac_frequency = 2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [5-1] CONFIG main: Integer option log_level = 2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [6-1] CONFIG main: Integer option sync_interval = 2010 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [7-1] CONFIG main: Integer option sync_interval_timeout = 15000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [8-1] CONFIG main: Integer option sync_group_maxsize = 8 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [9-1] CONFIG main: Integer option desired_sync_time = 60000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [10-1] CONFIG main: Integer option syslog = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [11-1] CONFIG main: Integer option quit_sync_provider = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [12-1] CONFIG main: Integer option quit_sync_finalsync = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [13-1] CONFIG main: Integer option sync_max_rowsize = 4096 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [14-1] CONFIG main: Integer option sync_max_largemem = 1048576 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [15-1] CONFIG main: Integer option remote_listen_timeout = 300 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [16-1] CONFIG main: Integer option tcp_keepalive_idle = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [17-1] CONFIG main: Integer option tcp_keepalive_interval = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [18-1] CONFIG main: Integer option tcp_keepalive_count = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [19-1] CONFIG main: Boolean option log_pid = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [20-1] CONFIG main: Boolean option log_timestamp = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [21-1] CONFIG main: Boolean option tcp_keepalive = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [22-1] CONFIG main: Real option real_placeholder = 0.000000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [23-1] CONFIG main: String option cluster_name = slony_regress1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [24-1] CONFIG main: String option conn_info = dbname=slonyregress2 host=localhost user=postgres port=7090 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [25-1] CONFIG main: String option pid_file = /tmp/slony-regress.XIjI9L/slon-pid.2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [26-1] CONFIG main: String option log_timestamp_format = %Y-%m-%d %H:%M:%S %Z Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [27-1] CONFIG main: String option archive_dir = [NULL] Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [28-1] CONFIG main: String option sql_on_connection = SET log_min_duration_statement to '1000'; Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [29-1] CONFIG main: String option lag_interval = 2 seconds Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [30-1] CONFIG main: String option command_on_logarchive = [NULL] Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [31-1] CONFIG main: String option syslog_facility = LOCAL0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [32-1] CONFIG main: String option syslog_ident = slon-slony_regress1-2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [33-1] CONFIG main: String option cleanup_interval = 30 seconds Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [34-1] CONFIG main: local node id = 2 I haven't run any deep tests involving inducing network outages, but this certainly plays OK when the network *isn't* broken.
Committed to master 44170107c88836df55519f29a467e4bdbbb1689b f6aeede568572577e86b81bd758415cbf9bdb3b6 660fa6787bca4690a3ace1fa93c9507a16963c61 a6893dbbaa782d7ca4b22ff1cee2b7953a29e89c