Summary: | slon sometimes does not recover from a network outage | ||
---|---|---|---|
Product: | Slony-I | Reporter: | Steve Singer <ssinger> |
Component: | slon | Assignee: | Steve Singer <ssinger> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | slony1-bugs |
Priority: | medium | ||
Version: | devel | ||
Hardware: | Other | ||
OS: | Other | ||
Attachments: |
Patch for typing issue that gets revaled once bug126 is patched
Fix compiler warnings fix for bug126 |
Description
Steve Singer
2010-05-18 12:44:40 UTC
Does this involve: a) Setting GUC defaults for: tcp_keepalives_count tcp_keepalives_idle tcp_keepalives_interval b) Something else??? If it's a matter of setting GUC's, then a Gentle User might use the sql_on_connection parameter to set these GUCs. <http://slony.info/documentation/slon-config-connection.html> If there are values to be suggested, then let's put this into the documentation, probably in the "best practices" section. http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=doc/adminguide/bestpractices.sgml I'll note that the default GUC values on 8.4 are thus: org=# show tcp_keepalives_count; tcp_keepalives_count ---------------------- 9 (1 row) org=# show tcp_keepalives_idle ; tcp_keepalives_idle --------------------- 7200 (1 row) org=# show tcp_keepalives_interval ; tcp_keepalives_interval ------------------------- 75 (1 row) org=# select version(); version -------------------------------------------------------------------------------------------------- PostgreSQL 8.4.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-6) 4.4.4, 64-bit (1 row) No that controls server side keep alives, we want client side keep alives. This was actually introduced in 9.0, see http://www.postgresql.org/docs/9.0/static/libpq-connect.html#LIBPQ-KEEPALIVES Prior to 9.0 applications like slony could call PQSocket and get then manually adjust values on the socket which is what we will have to do if we want the feature when running against older versions of PG. In either case we need to setup a system that replicates a slon detected socket failure to make sure we don't have some other horrible issue waiting in the code path that gets hit once the keep socket times out. Created attachment 83 [details] Patch for typing issue that gets revaled once bug126 is patched This patch fixes a type conversion issue with signed vs unsigned types. Once bug126 is fixed it exposes a bug that is addressed by this patch. Created attachment 84 [details]
Fix compiler warnings
This patch fixes other signed related compiler warnings that were found during development of the bug fix.
Created attachment 85 [details] fix for bug126 This is the actual fix for bug 126. I have a fix for this bug available in the attached patches or at https://github.com/ssinger/slony1-engine/tree/bug126 The github branch has all three patches applied. (The fix is pretty useless without the typing issue fixed). An item that showed up in my testing that is worth thinking about. If the network connection between the slon and the local node goes away the slon will exit (I think this is what we want) but it then restarts and keeps looping trying to connect to the local node (again this is what we want). However if when the network comes back and it connects to the postgres instance, IF the timeout settings on the postgresql side are such that the old backend has not yet timed out then a sl_nodelock row will exist. The slon will then exit on an error (and not retry). I'm not sure this is the result we want. In bug132 we made the slon loop on getting the node lock IF it could not get the node lock due to a slon requested restart. In this case we are restarting due to a slon error. I'm not sure if we want to change this behavior or not. I have added a patch to cause all the regression tests to set values for TCP keepalive https://github.com/cbbrowne/slony1-engine/commit/0c08ba8302efc3795388c6246f6fe1a19f9d7f9e One may see that this sets values for the parms, per the slon log: Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [1-1] CONFIG main: slon version 2.1.0 starting up Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [2-1] INFO slon: watchdog process started Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [3-1] CONFIG slon: watchdog ready - pid = 14449 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14449]: [4-1] CONFIG slon: worker process created - pid = 14451 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [4-1] CONFIG main: Integer option vac_frequency = 2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [5-1] CONFIG main: Integer option log_level = 2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [6-1] CONFIG main: Integer option sync_interval = 2010 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [7-1] CONFIG main: Integer option sync_interval_timeout = 15000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [8-1] CONFIG main: Integer option sync_group_maxsize = 8 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [9-1] CONFIG main: Integer option desired_sync_time = 60000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [10-1] CONFIG main: Integer option syslog = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [11-1] CONFIG main: Integer option quit_sync_provider = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [12-1] CONFIG main: Integer option quit_sync_finalsync = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [13-1] CONFIG main: Integer option sync_max_rowsize = 4096 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [14-1] CONFIG main: Integer option sync_max_largemem = 1048576 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [15-1] CONFIG main: Integer option remote_listen_timeout = 300 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [16-1] CONFIG main: Integer option tcp_keepalive_idle = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [17-1] CONFIG main: Integer option tcp_keepalive_interval = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [18-1] CONFIG main: Integer option tcp_keepalive_count = 5 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [19-1] CONFIG main: Boolean option log_pid = 0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [20-1] CONFIG main: Boolean option log_timestamp = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [21-1] CONFIG main: Boolean option tcp_keepalive = 1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [22-1] CONFIG main: Real option real_placeholder = 0.000000 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [23-1] CONFIG main: String option cluster_name = slony_regress1 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [24-1] CONFIG main: String option conn_info = dbname=slonyregress2 host=localhost user=postgres port=7090 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [25-1] CONFIG main: String option pid_file = /tmp/slony-regress.XIjI9L/slon-pid.2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [26-1] CONFIG main: String option log_timestamp_format = %Y-%m-%d %H:%M:%S %Z Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [27-1] CONFIG main: String option archive_dir = [NULL] Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [28-1] CONFIG main: String option sql_on_connection = SET log_min_duration_statement to '1000'; Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [29-1] CONFIG main: String option lag_interval = 2 seconds Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [30-1] CONFIG main: String option command_on_logarchive = [NULL] Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [31-1] CONFIG main: String option syslog_facility = LOCAL0 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [32-1] CONFIG main: String option syslog_ident = slon-slony_regress1-2 Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [33-1] CONFIG main: String option cleanup_interval = 30 seconds Jan 10 11:09:18 cbbrowne slon-slony_regress1-2[14451]: [34-1] CONFIG main: local node id = 2 I haven't run any deep tests involving inducing network outages, but this certainly plays OK when the network *isn't* broken. Committed to master 44170107c88836df55519f29a467e4bdbbb1689b f6aeede568572577e86b81bd758415cbf9bdb3b6 660fa6787bca4690a3ace1fa93c9507a16963c61 a6893dbbaa782d7ca4b22ff1cee2b7953a29e89c |