[Slony1-commit] By cbbrowne: 1.

Mon Sep 27 16:06:45 PDT 2004

Log Message:
-----------
1. Better described the "it's falling behind because pg_listener needs
vacuuming" problem.

2. Added in the "node stopped listening to replication events" problem

Modified Files:
--------------
    slony1-engine/doc/howto:
        helpitsbroken.txt (r1.10 -> r1.11)

-------------- next part --------------
Index: helpitsbroken.txt
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/howto/helpitsbroken.txt,v
retrieving revision 1.10
retrieving revision 1.11
diff -Ldoc/howto/helpitsbroken.txt -Ldoc/howto/helpitsbroken.txt -u -w -r1.10 -r1.11

--- doc/howto/helpitsbroken.txt
+++ doc/howto/helpitsbroken.txt
@@ -266,7 +266,7 @@
 ddlscript() / EXECUTE SCRIPT, thus eliminating the sequence everywhere
 "at once."  Or they may be applied by hand to each of the nodes.
 
-13.  Performance Sucks after a while
+13.  Some nodes start consistently falling behind
 
 I have been running Slony-I on a node for a while, and am seeing
 system performance suffering.
@@ -276,7 +276,9 @@
    fetch 100 from LOG;
 
 This is characteristic of pg_listener (which is the table containing
-NOTIFY data) having plenty of dead tuples in it.
+NOTIFY data) having plenty of dead tuples in it.  That makes NOTIFY
+events take a long time, and causes the affected node to gradually
+fall further and further behind.
 
 You quite likely need to do a VACUUM FULL on pg_listener, to
 vigorously clean it out, and need to vacuum pg_listener really
@@ -285,7 +287,7 @@
 Slon daemons already vacuum a bunch of tables, and cleanup_thread.c
 contains a list of tables that are frequently vacuumed automatically.
 In Slony-I 1.0.2, pg_listener is not included.  In later versions, it
-will be, so that you probably don't need to worry about this anymore.
+will be, so this may be an obsolete problem.
 
 14.  I started doing a backup using pg_dump, and suddenly Slony stops
 replicating anything.
@@ -341,3 +343,42 @@
 
 Conclusion: Even if there is not going to be a subscriber around, you
 _really_ want to have a slon running to service the "master" node.
+
+16.  I pointed a subscribing node to a different parent and it stopped
+replicating
+
+We noticed this happening when we wanted to re-initialize a node,
+where we had configuration thus:
+
+ Node 1 - master
+ Node 2 - child of node 1 - the node we're reinitializing
+ Node 3 - child of node 3 - node that should keep replicating
+
+The subscription for node 3 was changed to have node 1 as provider,
+and we did DROP SET/SUBSCRIBE SET for node 2 to get it repopulating.
+
+Unfortunately, replication suddenly stopped to node 3.
+
+The problem was that there was not a suitable set of "listener paths"
+in sl_listen to allow the events from node 1 to propagate to node 3.
+The events were going through node 2, and blocking behind the
+SUBSCRIBE SET event that node 2 was working on.
+
+The following slonik script dropped out the listen paths where node 3
+had to go through node 2, and added in direct listens between nodes 1
+and 3.
+
+cluster name = oxrslive;
+ node 1 admin conninfo='host=32.85.68.220 dbname=oxrslive user=postgres port=5432';
+ node 2 admin conninfo='host=32.85.68.216 dbname=oxrslive user=postgres port=5432';
+ node 3 admin conninfo='host=32.85.68.244 dbname=oxrslive user=postgres port=5432';
+ node 4 admin conninfo='host=10.28.103.132 dbname=oxrslive user=postgres port=5432';
+try {
+      store listen (origin = 1, receiver = 3, provider = 1);
+      store listen (origin = 3, receiver = 1, provider = 3);
+      drop listen (origin = 1, receiver = 3, provider = 2);
+      drop listen (origin = 3, receiver = 1, provider = 2);
+}
+
+Immediately after this script was run, SYNC events started propagating
+again to node 3.