Bug 176

Summary: Cluster Analysis Tool
Product: Slony-I Reporter: Christopher Browne <cbbrowne>
Component: core scriptsAssignee: Christopher Browne <cbbrowne>
Status: ASSIGNED    
Severity: enhancement CC: slony1-bugs
Priority: medium    
Version: devel   
Hardware: All   
OS: All   
URL: http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master
Attachments: Subscription sample
Listen Paths diagram sample
Connection Paths for Slons - sample diagram

Description Christopher Browne 2010-12-06 09:52:29 PST
There is an existing tool that does some analysis of cluster configuration; see
test_slony_state.pl.

It is desirable to have something that generates diagrams of the relationships
between nodes, capturing:
# Nodes
# Subscription Sets, and the paths they take
# Paths between nodes
# Listen paths

It would be nice for the Subscription Set diagram to include indication of
replication state/lag for each node, indicating things like:
# Event Number
# Events Behind Parent
# Time Behind Parent
# Events Behind Origin
# Time Behind Origin
Comment 1 Christopher Browne 2010-12-06 13:26:19 PST
Set up branch on Github:  https://github.com/cbbrowne/slony1-engine/tree/bug176

Have drafted up a simple example script:

https://github.com/cbbrowne/slony1-engine/blob/bug176/tools/analyze-cluster.sh

Will attach sample diagrams
Comment 2 Christopher Browne 2010-12-06 13:28:30 PST
Created an attachment (id=76) [details]
Subscription sample

Here is a sample diagram that shows the sets and the subscription paths.

It does not, at this point, show anything about status of the sets; that
shouldn't be difficult to add, later, if this seems a viable approach.
Comment 3 Christopher Browne 2010-12-06 13:29:09 PST
Created an attachment (id=77) [details]
Listen Paths diagram sample

Sample output of draft script
Comment 4 Christopher Browne 2010-12-06 13:29:42 PST
Created an attachment (id=78) [details]
Connection Paths for Slons - sample diagram

Sample diagram for slon connection paths
Comment 5 Christopher Browne 2011-02-08 12:25:57 PST
More diagrams that might be useful...

- For each set, what nodes are permissible failover targets?

- For each set, which nodes are vulnerable to loss on failover?
Comment 6 Christopher Browne 2011-12-16 10:07:13 PST
I have committed a "--text" output revision to the analyze-cluster.sh script

https://github.com/cbbrowne/slony1-engine/commit/6b3f2ff10f826a6c544129dac8624be69074cdd8

-> % ./analyze-cluster.sh --help
analyze-slony-cluster [options]

  --text                  - Do not generate any graphics or HTML
  --help                  - Request help
  --cluster=clustername   - Optional specification of cluster to be used
  --output-directory=/tmp/somewhere  Indicates destination for graphics/HTML
output

Additionally, uses libpq environment variables
(PGHOST/PGPORT/PGDATABASE/...) to indicate the database to check

WARNINTERVAL used to indicate intervals of event confirmation delay that
indicate WARNING
DANGERINTERVAL used to indicate intervals of event confirmation delay that
indicate DANGER

Here's a sample of running it in --text mode:

postgres@cbbrowne [12:56:27] [~/slony1-engine.github/tools] [bug176 *]
-> % ./analyze-cluster.sh --text
# analyze-cluster.sh running
# Text output only, to STDOUT
Generating output according to node [1]
Nodes in cluster
 node |    description    |    event_lag    |   Timeliness
------+-------------------+-----------------+----------------
    1 | Regress test node | 00:00:00        | Up To Date
    2 | node 2            | 00:14:03.607481 | Behind:Danger!
(2 rows)

If nodes have Timeliness marked as Behind:Warning events have not propagated in
> 30 seconds, and status for the node may not be completely up to date.
If nodes have Timeliness marked as Behind:Danger events have not propagated in
> 5 minutes, and status for the node is considered dangerously out of date

Connections used by slon processes to manage inter-node communications
 From Server | To Client |                          conninfo                   
       | Retry Time
-------------+-----------+-------------------------------------------------------------+------------
           1 |         2 | dbname=slonyregress1 host=localhost user=postgres
port=7091 |         10
           2 |         1 | dbname=slonyregress2 host=localhost user=postgres
port=7091 |         10
(2 rows)


Replication Sets
 Set ID | Origin Node |   Description    | Tables | Sequences
--------+-------------+------------------+--------+-----------
      1 |           1 | All test1 tables |      4 |         0
(1 row)

Subscriptions that node 1 is aware of
 Set | Receiver | Provider | Does Receiver Forward? | Considered Active? |
Provider is Origin? | Origin Confirmation Aging
-----+----------+----------+------------------------+--------------------+---------------------+---------------------------
   1 |        2 |        1 | t                      | t                  | t   
               | 00:19:07.5263
(1 row)

Origin Confirmation Aging approximates how far behind subscriptions may be,
according to this node.
Activity going on in node 1's database
        Thread        | Slon PID | Node Serviced | DB Connection PID | Thread
Activity  |   Event    | Event Type |   Start of Activity
----------------------+----------+---------------+-------------------+------------------+------------+------------+------------------------
 local_cleanup        |    24473 |             0 |             24489 |
cleanupEvent     |            | n/a        | 2011-12-16 17:50:07+00
 local_monitor        |    24473 |             0 |             24485 | thread
main loop |            | n/a        | 2011-12-16 17:39:00+00
 local_sync           |    24473 |             0 |             24488 | thread
main loop |            | n/a        | 2011-12-16 17:54:52+00
 local_listen         |    24473 |             1 |             24479 | thread
main loop |            | n/a        | 2011-12-16 17:54:51+00
 remote listener      |    24473 |             2 |             24487 |
receiving events |            | n/a        | 2011-12-16 17:54:44+00
 remoteWorkerThread_2 |    24473 |             2 |             24486 | SYNC    
        | 5000000177 | SYNC       | 2011-12-16 17:54:37+00
(6 rows)

Note:
   local_monitor only reports in once when slon starts up
   local_cleanup only reports in when it does a cleanup

Event summary
 Origin Node | Event Type | Count | Max Event # |       Latest Occurrence      
|      Aging
-------------+------------+-------+-------------+-------------------------------+-----------------
           1 | SYNC       |    47 |  5000000205 | 2011-12-16 17:54:51.675061+00
| 00:03:15.444986
           2 | SYNC       |     1 |  5000000110 | 2011-12-16 17:38:49.39947+00 
| 00:19:17.720577
(2 rows)

The notion here is to generate some useful "dumps" of the state of the
replication cluster.

This is *only* looking at this from the perspective of a single node (e.g. -
it's using the usual libpq environment variables to control The Single Database
that it connects to).

It gives some indication as to what bits of the data might be out of date.

Another way of looking at this that would lead to a substantially different
implementation would be to try to do the following:

1.  Get conninfo information for *all* the nodes.
2.  Connect to all the nodes, and pull data about nodes, sets, subscriptions,
and such.
3.  Display the stuff that they all agree on, which should typically be the
case for *all* the configuration.
4.  Display separately the stuff that they disagree on.  The disagreements are
likely to fall into two categories:
 a) Configuration that is in progress, not yet propagated everywhere
 b) Configuration that has broken