Summary: | Cluster Analysis Tool | ||
---|---|---|---|
Product: | Slony-I | Reporter: | Christopher Browne <cbbrowne> |
Component: | core scripts | Assignee: | Christopher Browne <cbbrowne> |
Status: | ASSIGNED --- | ||
Severity: | enhancement | CC: | slony1-bugs |
Priority: | medium | ||
Version: | devel | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master | ||
Attachments: |
Subscription sample
Listen Paths diagram sample Connection Paths for Slons - sample diagram |
Description
Christopher Browne
2010-12-06 09:52:29 UTC
Set up branch on Github: https://github.com/cbbrowne/slony1-engine/tree/bug176 Have drafted up a simple example script: https://github.com/cbbrowne/slony1-engine/blob/bug176/tools/analyze-cluster.sh Will attach sample diagrams Created attachment 76 [details]
Subscription sample
Here is a sample diagram that shows the sets and the subscription paths.
It does not, at this point, show anything about status of the sets; that shouldn't be difficult to add, later, if this seems a viable approach.
Created attachment 77 [details]
Listen Paths diagram sample
Sample output of draft script
Created attachment 78 [details]
Connection Paths for Slons - sample diagram
Sample diagram for slon connection paths
More diagrams that might be useful... - For each set, what nodes are permissible failover targets? - For each set, which nodes are vulnerable to loss on failover? I have committed a "--text" output revision to the analyze-cluster.sh script https://github.com/cbbrowne/slony1-engine/commit/6b3f2ff10f826a6c544129dac8624be69074cdd8 -> % ./analyze-cluster.sh --help analyze-slony-cluster [options] --text - Do not generate any graphics or HTML --help - Request help --cluster=clustername - Optional specification of cluster to be used --output-directory=/tmp/somewhere Indicates destination for graphics/HTML output Additionally, uses libpq environment variables (PGHOST/PGPORT/PGDATABASE/...) to indicate the database to check WARNINTERVAL used to indicate intervals of event confirmation delay that indicate WARNING DANGERINTERVAL used to indicate intervals of event confirmation delay that indicate DANGER Here's a sample of running it in --text mode: postgres@cbbrowne [12:56:27] [~/slony1-engine.github/tools] [bug176 *] -> % ./analyze-cluster.sh --text # analyze-cluster.sh running # Text output only, to STDOUT Generating output according to node [1] Nodes in cluster node | description | event_lag | Timeliness ------+-------------------+-----------------+---------------- 1 | Regress test node | 00:00:00 | Up To Date 2 | node 2 | 00:14:03.607481 | Behind:Danger! (2 rows) If nodes have Timeliness marked as Behind:Warning events have not propagated in > 30 seconds, and status for the node may not be completely up to date. If nodes have Timeliness marked as Behind:Danger events have not propagated in > 5 minutes, and status for the node is considered dangerously out of date Connections used by slon processes to manage inter-node communications From Server | To Client | conninfo | Retry Time -------------+-----------+-------------------------------------------------------------+------------ 1 | 2 | dbname=slonyregress1 host=localhost user=postgres port=7091 | 10 2 | 1 | dbname=slonyregress2 host=localhost user=postgres port=7091 | 10 (2 rows) Replication Sets Set ID | Origin Node | Description | Tables | Sequences --------+-------------+------------------+--------+----------- 1 | 1 | All test1 tables | 4 | 0 (1 row) Subscriptions that node 1 is aware of Set | Receiver | Provider | Does Receiver Forward? | Considered Active? | Provider is Origin? | Origin Confirmation Aging -----+----------+----------+------------------------+--------------------+---------------------+--------------------------- 1 | 2 | 1 | t | t | t | 00:19:07.5263 (1 row) Origin Confirmation Aging approximates how far behind subscriptions may be, according to this node. Activity going on in node 1's database Thread | Slon PID | Node Serviced | DB Connection PID | Thread Activity | Event | Event Type | Start of Activity ----------------------+----------+---------------+-------------------+------------------+------------+------------+------------------------ local_cleanup | 24473 | 0 | 24489 | cleanupEvent | | n/a | 2011-12-16 17:50:07+00 local_monitor | 24473 | 0 | 24485 | thread main loop | | n/a | 2011-12-16 17:39:00+00 local_sync | 24473 | 0 | 24488 | thread main loop | | n/a | 2011-12-16 17:54:52+00 local_listen | 24473 | 1 | 24479 | thread main loop | | n/a | 2011-12-16 17:54:51+00 remote listener | 24473 | 2 | 24487 | receiving events | | n/a | 2011-12-16 17:54:44+00 remoteWorkerThread_2 | 24473 | 2 | 24486 | SYNC | 5000000177 | SYNC | 2011-12-16 17:54:37+00 (6 rows) Note: local_monitor only reports in once when slon starts up local_cleanup only reports in when it does a cleanup Event summary Origin Node | Event Type | Count | Max Event # | Latest Occurrence | Aging -------------+------------+-------+-------------+-------------------------------+----------------- 1 | SYNC | 47 | 5000000205 | 2011-12-16 17:54:51.675061+00 | 00:03:15.444986 2 | SYNC | 1 | 5000000110 | 2011-12-16 17:38:49.39947+00 | 00:19:17.720577 (2 rows) The notion here is to generate some useful "dumps" of the state of the replication cluster. This is *only* looking at this from the perspective of a single node (e.g. - it's using the usual libpq environment variables to control The Single Database that it connects to). It gives some indication as to what bits of the data might be out of date. Another way of looking at this that would lead to a substantially different implementation would be to try to do the following: 1. Get conninfo information for *all* the nodes. 2. Connect to all the nodes, and pull data about nodes, sets, subscriptions, and such. 3. Display the stuff that they all agree on, which should typically be the case for *all* the configuration. 4. Display separately the stuff that they disagree on. The disagreements are likely to fall into two categories: a) Configuration that is in progress, not yet propagated everywhere b) Configuration that has broken |