[Slony1-commit] By cbbrowne: Documentation augmented...

Tue Jul 25 17:36:04 PDT 2006

Log Message:
-----------
Documentation augmented...

elein pointed out the good question "When is it OK / NOT OK to kill off
slons?"

I've added some comments on this to the best practices and FAQ.  And
added a link to the "generate_sync.sh" function...

Modified Files:
--------------
    slony1-engine/doc/adminguide:
        bestpractices.sgml (r1.21 -> r1.22)
        faq.sgml (r1.59 -> r1.60)
        maintenance.sgml (r1.22 -> r1.23)

-------------- next part --------------
Index: maintenance.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/maintenance.sgml,v
retrieving revision 1.22
retrieving revision 1.23
diff -Ldoc/adminguide/maintenance.sgml -Ldoc/adminguide/maintenance.sgml -u -w -r1.22 -r1.23

--- doc/adminguide/maintenance.sgml
+++ doc/adminguide/maintenance.sgml
@@ -82,7 +82,8 @@
 thereby your whole day.</para>
 
 </sect2>
-<sect2><title>Parallel to Watchdog: generate_syncs.sh</title>
+
+<sect2 id="gensync"><title>Parallel to Watchdog: generate_syncs.sh</title>
 
 <para>A new script for &slony1; 1.1 is
 <application>generate_syncs.sh</application>, which addresses the following kind of
Index: bestpractices.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/bestpractices.sgml,v
retrieving revision 1.21
retrieving revision 1.22
diff -Ldoc/adminguide/bestpractices.sgml -Ldoc/adminguide/bestpractices.sgml -u -w -r1.21 -r1.22
--- doc/adminguide/bestpractices.sgml
+++ doc/adminguide/bestpractices.sgml
@@ -154,7 +154,6 @@
 
 <para> In practice, strewing &lslon; processes and configuration
 across a dozen servers turns out to be inconvenient to manage.</para>
-
 </listitem>
 
 <listitem><para> &lslon; processes should run in the same
@@ -175,6 +174,31 @@
 condition. </para>
 </listitem>
 
+<listitem><para> Before getting too excited about having fallen into
+some big problem, consider killing and restarting all the &lslon;
+processes.  Historically, this has frequently been able to
+resolve <quote>stickiness.</quote> </para>
+
+<para> With a very few exceptions, it is generally not a big deal to
+kill off and restart the &lslon; processes.  Each &lslon; connects to
+one database for which it is the manager, and then connects to other
+databases as needed to draw in events.  If you kill off a &lslon;, all
+you do is to interrupt those connections.  If
+a <command>SYNC</command> or other event is sitting there
+half-processed, there's no problem: the transaction will roll back,
+and when the &lslon; restarts, it will restart that event from
+scratch.</para>
+
+<para> The exception, where it is undesirable to restart a &lslon;, is
+where a <command>COPY_SET</command> is running on a large replication
+set, such that stopping the &lslon; may discard several hours worth of
+load work. </para>
+
+<para> In early versions of &slony1;, it was frequently the case that
+connections could get a bit <quote>deranged</quote> which restarting
+&lslon;s would clean up.  This has become much more rare, but it has
+occasionally proven useful to restart the &lslon;.</para> </listitem>
+
 <listitem>
 <para>The <link linkend="ddlchanges"> Database Schema Changes </link>
 section outlines some practices that have been found useful for
Index: faq.sgml
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/doc/adminguide/faq.sgml,v
retrieving revision 1.59
retrieving revision 1.60
diff -Ldoc/adminguide/faq.sgml -Ldoc/adminguide/faq.sgml -u -w -r1.59 -r1.60
--- doc/adminguide/faq.sgml
+++ doc/adminguide/faq.sgml
@@ -462,6 +462,63 @@
 threaten the entire server.  </para></answer>
 </qandaentry>
 
+<qandaentry>
+<question><para> When can I shut down &lslon; processes?</para></question>
+
+<question><para> Are there risks to doing so?  How about
+benefits?</para></question>
+
+<answer><para> Generally, it's no big deal to shut down a &lslon;
+process.  Each one is <quote>merely</quote> a &postgres; client,
+managing one node, which spawns threads to manage receiving events
+from other nodes.  </para>
+
+<para>The <quote>event listening</quote> threads are no big deal; they
+are doing nothing fancier than periodically checking remote nodes to
+see if they have work to be done on this node.  If you kill off the
+&lslon; these threads will be closed, which should have little or no
+impact on much of anything.  Events generated while the &lslon; is
+down will be picked up when it is restarted.</para>
+
+<para> The <quote>node managing</quote> thread is a bit more
+interesting; most of the time, you can expect, on a subscriber, for
+this thread to be processing <command>SYNC</command> events.  If you
+shut off the &lslon; during an event, the transaction
+will fail, and be rolled back, so that when the &lslon; restarts, it
+will have to go back and reprocess the event.</para>
+
+<para> The only situation where this will
+cause <emphasis>particular</emphasis> <quote>heartburn</quote> is if
+the event being processed was one which takes a long time to process,
+such as <command>COPY_SET</command> for a large replication
+set. </para>
+
+<para> The other thing that <emphasis>might</emphasis> cause trouble
+is if the &lslon; runs fairly distant from nodes that it connects to;
+you could discover that database connections are left <command>idle in
+transaction</command>.  This would normally only occur if the network
+connection is destroyed without either &lslon; or database being made
+aware of it.  In that case, you may discover
+that <quote>zombied</quote> connections are left around for as long as
+two hours if you don't go in by hand and kill off the &postgres;
+backends.</para>
+
+<para> There is one other case that could cause trouble; when the
+&lslon; managing the origin node is not running,
+no <command>SYNC</command> events run against that node.  If the
+&lslon; stays down for an extended period of time, and something
+like <xref linkend="gensync"> isn't running, you could be left
+with <emphasis>one big <command>SYNC</command></emphasis> to process
+when it comes back up.  But that is only a concern if that &lslon; is
+down for an extended period of time; shutting it down for a few
+seconds shouldn't cause any great problem. </para> </answer>
+
+<answer><para> In short, if you don't have something like an 18
+hour <command>COPY_SET</command> under way, it's normally not at all a
+big deal to take a &lslon; down for a little while, or perhaps even
+cycle <emphasis>all</emphasis> the &lslon;s. </para> </answer>
+</qandaentry>
+
 </qandadiv>
 
 <qandadiv id="faqconfiguration"> <title> &slony1; FAQ: Configuration Issues </title>