Bug 175

Summary:	Monitoring cluster better
Product:	Slony-I	Reporter:	Christopher Browne <cbbrowne>
Component:	core scripts	Assignee:	Christopher Browne <cbbrowne>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	slony1-bugs
Priority:	low
Version:	devel
Hardware:	All
OS:	All
URL:	http://wiki.postgresql.org/wiki/SlonyBrainstorming#Slon_monitoring
Bug Depends on:	167
Bug Blocks:

Description Christopher Browne 2010-12-03 13:45:56 UTC

* slon records in a queryable form what it's working on
* Requires writing (+COMMIT) at the start of the event loop
* [As noted by Yehuda] This ought to also be useful with slonik, to allow indicating "whazzup"?

Debate took place surrounding various mechanisms.  The only one without dramatically unacceptable properties was to add a table to the Slony-I schema.

; SNMP
: Existing standard for network-based monitoring
: Downside - requires extra infrastructure including libraries and possibly additional tooling to use it
; [http://www.spread.org/ Spread]
: Fast
: Downside - requires extra infrastructure that's not particularly "standard"
; NOTIFY/LISTEN
: Built-in to Postgres - no new infrastructure needed
: In Postgres 9.0+ can carry payload
: Only becomes visible upon COMMIT
: Results are not stored; listeners must pre-declare their interest
; SQL table
: Built-in to Postgres - no new infrastructure needed
: Only becomes visible upon COMMIT
: Another table to manage and clean up

==== Monitoring Requirements ====

Crucial facts that we want to know about:
# Is replication behind?
# What are components (e.g. - slon, slonik) doing? <BR> Note that slon has numerous threads
# Recent abnormal states for events (e.g. - error messages)
# Are any non-SYNC events outstanding?
# Backlog volume?
# What is the cluster's configuration?

===== Replication Behind? =====

The existing view sl_status captures this from a what-is-confirmed perspective.  That is not perfect, but it is not obvious that there is high priority to enhancing this.

===== What are components doing? =====

Nothing relevant is captured in a usable fashion.

It is thought that what we may do is to add a table where each thread would capture ''what am I doing?'' (which would replace whatever was previously being done)

This table would contain a tuple for:
# Each remote worker thread
# Cleanup thread
# Each remote listener thread
# Local SYNC thread

It would track things such as:
# Time processing started
# What thread/process am I?
# What node am I for?
# What am I doing?  <BR> Possibly in several pieces, to cover the following sorts of facts:
## Event ID
## Event type <BR> Though this could be pulled from sl_event, given node and event ID
## Additional event activity <BR> Again, could be pulled from sl_event, given node and event ID

Note that the contents of this table should be quite tiny; a tuple per slon thread on a node.

This also needs to be able to capture what '''slonik''' is doing; this seems more troublesome.
# It is possible to have multiple slonik instances acting concurrently - multiple concurrent events!
# There is no natural "event loop" such that slonik activities would be expected to clean themselves up over time

====== Suggested slon implementation ======

Two approaches emerged for establishing connections to capture this monitoring data
# Each thread opens its own DB connection <BR> Unacceptable: Leads to ''enormous'' increase in use of DB connections that are mostly basically idle
# Establish a "monitoring thread"
## A message queue allows other threads to stow entries (complete with timestamps) that the monitoring thread periodically flushes to the database
## It is plausible that this thread could be merged into the existing local SYNC thread which isn't terribly busy

===== Recent abnormal states for events  =====

This captures messages about the most recent problem that occurred, storing:
# Time of abnormality
# Event ID
# Node ID
# Description / Error Message

===== non-SYNC events outstanding? =====

This information is already captured, and may be revealed by running a query that asks, on the source node, for all events that are:
# Not SYNC events
# Have not been confirmed by the subscriber

===== Backlog volume =====

[http://www.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166]

This seems troublesome; calculating the number of sl_log_* tuples involved in a particular SYNC requires running the same complex query that the remote_worker thread uses to determine which tuples are to be applied.

This is a query that is complex to generate that is fairly expensive to run.

Note that [http://www.slony.info/bugzilla/show_bug.cgi?id=167 Bug #167] is changing this query.

===== Cluster configuration =====

There is an existing tool that does some analysis of cluster configuration; see [http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master test_slony_state.pl]

It is desirable to have something that generates diagrams of the relationships between nodes, capturing:
# Nodes
# Subscription Sets, and the paths they take
# Paths between nodes
# Listen paths

It would be nice for the Subscription Set diagram to include indication of replication state/lag for each node, indicating things like:
# Event Number
# Events Behind Parent
# Time Behind Parent
# Events Behind Origin
# Time Behind Origin

Comment 1 Christopher Browne 2010-12-06 14:02:16 UTC

Note that the cluster analysis portion has been put into bug #176

Comment 2 Christopher Browne 2010-12-14 10:16:56 UTC

Set up a branch for bug #175
https://github.com/cbbrowne/slony1-engine/tree/bug175

Comment 3 Christopher Browne 2011-01-27 09:23:06 UTC

More diagrams that might be useful...

- For each set, what nodes are permissible failover targets?

- For each set, which nodes are vulnerable to loss on failover?

Comment 4 Christopher Browne 2011-02-14 10:03:51 UTC

Sample output:

-> % psql -d slonyregress2 -c "select * from _slony_regress1.sl_components ;"
       co_actor       | co_pid | co_node | co_connection_pid | co_activity |      co_starttime      |  co_event  | co_eventtype
----------------------+--------+---------+-------------------+-------------+------------------------+------------+--------------
 local_sync           |   1399 |       0 |              1410 |             | 2011-02-14 17:37:32+00 |            |
 local_monitor        |   1399 |       0 |              1411 |             | 2011-02-14 17:37:32+00 |            |
 remote listener      |   1399 |       1 |              1412 |             | 2011-02-14 17:37:32+00 |            |
 local_cleanup        |   1399 |       0 |              1413 |             | 2011-02-14 17:37:32+00 |            |
 local_listen         |   1399 |       2 |              1403 | SYNC        | 2011-02-14 17:37:50+00 |            | SYNC
 remoteWorkerThread_1 |   1399 |       1 |              1409 | SYNC        | 2011-02-14 17:37:56+00 | 5000000053 | SYNC
(6 rows)

Comment 5 Christopher Browne 2011-02-17 12:29:50 UTC

Suggestion from Jan...

Implement as a stack.

When pushing, see if the entry is for "my" thread; if so, update rather than pushing a new entry.

Comment 6 Christopher Browne 2011-02-18 15:23:36 UTC

I started taking a poke at adding this monitoring to Slonik, and that looks like it's liable to be troublesome and/or overly intrusive.

The trouble is that for this to be useful, it needs to generate extra connections to databases, and do many COMMITs during the course of a Slonik script, which is going to lead to a flurry of extra DB connections.

I'm not sure but that this is more trouble than the would-be gain.  It also looks like it requires quite a lot of extra instrumentation code in slonik.c.  So the "first shot" on this involves not changing slonik.

Comment 7 Christopher Browne 2011-02-25 13:50:28 UTC

I have run pgindent against this, to apply typical PG indentation policy.

Should be ready for review.

Comment 8 Steve Singer 2011-03-14 06:48:38 UTC

(In reply to comment #7)
> I have run pgindent against this, to apply typical PG indentation policy.
> 
> Should be ready for review.

Review at 

http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html

Comment 9 Christopher Browne 2011-03-15 10:16:15 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > I have run pgindent against this, to apply typical PG indentation policy.
> > 
> > Should be ready for review.
> 
> Review at 
> 
> http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html

I have responded to this on the list; see:
http://lists.slony.info/pipermail/slony1-hackers/2011-March/000447.html

There are patches for virtually all of the items mentioned in the review:

https://github.com/cbbrowne/slony1-engine/commit/c6082bc77f460e985c7a2891c45269c9648347c9

https://github.com/cbbrowne/slony1-engine/commit/72ed0337641eefda095cfcb706562d71d9c03413

https://github.com/cbbrowne/slony1-engine/commit/7a3c5f54dcf72bcf67a49f6ed5b80ed673fc0ed5

https://github.com/cbbrowne/slony1-engine/commit/0543c96506eb55012c4366f78009f886b9464d59

https://github.com/cbbrowne/slony1-engine/commit/03f52daf1fb7ae2d819ecd0c411342a594f6884d

https://github.com/cbbrowne/slony1-engine/commit/d00574d2d1793a372467bc103a7772030fd939ab

There were two other issues mentioned which I'll quote as items to specifically respond to.

Comment 10 Christopher Browne 2011-03-15 10:17:30 UTC

Per Steve Singer: <http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html>

A general comment:
This monitor thread calls slon_retry() a bunch of times.  We've created a whole bunch of new failure points for slon.  I'm not sure issues recording the monitor state should result in a slon falling over. I feel that if the monitor thread hits an error it can reset the database connections for the monitor thread but it should leave the threads doing real work alone.   Having said that we might also want a maximum size of the queue.  If the queue exceeds a certain size (maybe because the monitor thread keeps hitting issues) then we stop putting new  elements on the queue or discard the oldest ones.

Comment 11 Christopher Browne 2011-03-15 10:18:07 UTC

Per Steve Singer:
<http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html>

I would also like to know what the performance impact in terms of system load is of running the monitoring stuff.  Ie run some tests without the monitoring stuff measure average cpu load and run the same test with the monitoring code and measure average cpu load.  I don't think the impact is going to be high but we should still measure it.

Comment 12 Christopher Browne 2011-03-16 13:46:31 UTC

Per Steve's recommendation, monitoring thread problems aren't necessarily serious problems warranting restarting the slon.

This patch implements that:
https://github.com/cbbrowne/slony1-engine/commit/d1a5be153e51cd10ac12a3dcb9cbb63856a18306

Note that there are still some slon_retry() calls remaining, in the scenario where the slon process runs out of memory.  That seems pretty fatal to me...

I realized that there are cases where we have a horrid memory leak; if the monitoring thread isn't working, then we shouldn't be collecting monitoring entries.  Thus, made two changes:

1.  When we exit from the monitoring thread main loop, turn monitoring off.

2.  In the "push" function, if monitoring is shut off, return immediately, do not attempt to queue up monitoring data.

https://github.com/cbbrowne/slony1-engine/commit/a61f4642ed29bc727d2361a4000500b81573ea11

I realized that the one entry for the monitoring thread will indicate the time when the slon daemon started.  Document that...

https://github.com/cbbrowne/slony1-engine/commit/d3f66cccb54168ca8d6a80261337a66228720cd8

Comment 13 Christopher Browne 2011-03-17 10:42:33 UTC

I don't see a particularly good way to readily analyze CPU loading from this addition, and I'm not sure that's even the best measurement, in any case.

As an alternative, I instrumented the code in two places, with the following patch:
diff --git a/src/slon/monitor_thread.c b/src/slon/monitor_thread.c
index 108b338..d88abfd 100644
--- a/src/slon/monitor_thread.c
+++ b/src/slon/monitor_thread.c
@@ -180,6 +180,8 @@ monitorThread_main(void *dummy)
                                                free(state.activity);
                                        if (state.event_type != NULL)
                                                free(state.event_type);
+                                       slon_log(SLON_INFO, "monitor_thread - record - %s\n",
+                                                        dstring_data(&monquery));
                                        res = PQexec(dbconn, dstring_data(&monquery));
                                        if (PQresultStatus(res) != PGRES_TUPLES_OK)
                                        {
@@ -384,6 +386,9 @@ monitor_state(const char *actor, int node, pid_t conn_pid, /* @null@ */ const ch
        {
                tos->event_type = NULL;
        }
+       slon_log(SLON_INFO, "monitor_state(pid:%d node:%d connpid:%d event:%lld actor[%s] activity[%s] event_type[%s]\n",
+                        tos->pid, tos->node, tos->conn_pid, tos->event,
+                        tos->actor, tos->activity, tos->event_type);
        pthread_mutex_unlock(&stack_lock);
 }


a) Each time an entry is pushed onto the stack, it is recorded in the log
b) Each time an entry is to be submitted to the DB, the query is recorded in the log.

In running a typical regression test (e.g. - test1), I saw the following results...

Test ran from  Mar 17 13:18:00 to Mar 17 13:19:06

 grep monitor_thread /var/log/messages | grep select  | wc -l

-> There were 214 calls to the component_state() function, across two nodes, so...

214 / (66*2) = 1.6 calls per second

-> There were 471 calls to monitor_state(), across those nodes, that induced these 214 calls.

That suggests to me that Jan's recommendation to optimize by allowing successive calls to overwrite a repeated entry for the same thread is doing a lot of good, reducing the DB load by more than half.

I find it compelling to expect this to be a pretty nominal addition to system load for Slony.

Poking at the table:

slonyregress1@localhost->  analyze verbose _slony_regress1.sl_components ;
INFO:  analyzing "_slony_regress1.sl_components"
INFO:  "sl_components": scanned 1 of 1 pages, containing 6 live rows and 17 dead rows; 6 rows in sample, 6 estimated total rows
ANALYZE
slonyregress1@localhost->  \c slonyregress2
You are now connected to database "slonyregress2" as user "postgres".
slonyregress2@localhost->  analyze verbose _slony_regress1.sl_components ;
INFO:  analyzing "_slony_regress1.sl_components"
INFO:  "sl_components": scanned 1 of 1 pages, containing 6 live rows and 11 dead rows; 6 rows in sample, 6 estimated total rows
ANALYZE

Comment 14 Christopher Browne 2011-03-17 10:43:06 UTC

Thing remembered...

sl_components should be added to the set of tables that the cleanup thread considers maintaining.

https://github.com/cbbrowne/slony1-engine/commit/cd3519bba5a03ce2f577747324f4a9ebc697d317

Comment 15 Steve Singer 2011-03-18 07:44:48 UTC

Some additional comments based on the latest changes (these are mostly issues that I just didn't catch in my first review)

1. We also need to include support in the upgradeSchema() function to take a 2.0.x slony install and add in the new tables for monitoring

2. In monitor thread - dbconn is never closed. Especially in the cases where we 'stop' the monitoring thread we should shutdown the connection (if we are able). We also MIGHT want to think about emptying the queue at this point. We aren't going to empty it later and if the database connection was stuck for a while before it timed out leading to the error the queue might have grown in size.

I also still wonder if we shouldn't be putting a maximum size on the queue, I know we discussed this the other day and it started to sound complicated. I think
there must be a way of doing this with throwing away old entries that isn't complicated. If I can find some time I'll see if I can come up with something. Until then I don't think that needs to hold this patch up.

I also bet we will get complaints from people
"my slon has been running fine for 3 months but slon_component table stopped updating a few weeks ago". Where some network blimp disabled the monitor_thread but the rest of slony continued to chug along. Though i think this situation is better htan slon restarting as we had before.

3. Line 189 dstring_free(monquery) before breaking out of the loop on the error.

The discussion on how we should be searching for performance regressions in slony is maybe outside the scope of this bug other than we should deal with this before we release 2.1 (looking to see if any of the 2.1 changes are a performance issue. I have ideas on how to do this but Jan might already have a setup for testing this).

Comment 16 Christopher Browne 2011-03-18 09:21:30 UTC

(In reply to comment #15)
> Some additional comments based on the latest changes (these are mostly issues
> that I just didn't catch in my first review)
> 
> 
> 1.  We also need to include support in the upgradeSchema() function to take a
> 2.0.x slony install and add in the new tables for monitoring

I thought I had done that.  I'll add in the table if it's missing.

https://github.com/cbbrowne/slony1-engine/commit/d51049b54f014ee4f00f974fc9629a7d64ca9727
 
> 2.  In monitor thread - dbconn is never closed.  Especially in the cases where
> we 'stop' the monitoring thread we should shutdown the connection (if we are
> able). We also MIGHT want to think about emptying the queue at this point.  We
> aren't going to empty it later and if the database connection was stuck for a
> while before it timed out leading to the error the queue might have grown in
> size.  

dbconn isn't closed directly - it's closed via slon_disconnect(conn).

As for emptying the queue, I don't see much value in it.  The condition on the loop is on the waits returning SCHED_STATUS_OK.  I don't think you get to ask it to terminate, or is there a signal you could throw that would kick just the one thread?

> I also still wonder if we shouldn't be putting a maximum size on the queue, I
> know we discussed this the other day and it started to sound complicated.  I
> think
> there must be a way of doing this with throwing away old entries that isn't
> complicated.  If I can find some time I'll see if I can come up with something.
> Until then I don't think that needs to hold this patch up.

Idle thought: pick two numbers, m and n, and, once we've reached size m, generate a warning every time the queue size mod n = 0.

For instance, m=100, n=10.  Once the queue is of size >100, complain every time the size increases and (size % 10) = 0.

Actually, we're doubling the size of the array each time, so it's perfectly reasonable to gripe in the logs any time we resize and the size > 100.  I'll do that...
 
> I also bet we will get complaints from people 
> "my slon has been running fine for 3 months but slon_component table stopped
> updating a few weeks ago".   Where some network blimp disabled the
> monitor_thread but the rest of slony continued to chug along.  Though i think
> this situation is better htan slon restarting as we had before.

That's the risk any time we add anything :-).
 
> 3. Line 189 dstring_free(monquery) before breaking out of the loop on the
> error.

Fixed.  https://github.com/cbbrowne/slony1-engine/commit/24da0a137120f4032858230836b39ed14da755fb

> The discussion on how we should be searching for performance regressions in
> slony is maybe outside the scope of this bug other than we should deal with
> this before we release 2.1  (looking to see if any of the 2.1 changes are a
> performance issue.  I have ideas on how to do this but Jan might already have a
> setup for testing this).

I'm happy with that.

I have done a little bit of analytics that suggest that this change should have minimal impact on performance, and that's certainly what we expected.

And it seems to make sense to look at performance more globally.  It's acceptable for us to lose a little here and there on one feature or another if:

a) We gain material reliability improvements (e.g. - reasonable value exchange)
b) There are other changes that provide offsetting performance increases.  (Bug #167, I'm looking at you!  :-))

Comment 17 Steve Singer 2011-03-18 11:01:56 UTC

> 
> > 2.  In monitor thread - dbconn is never closed.  Especially in the cases where
> > we 'stop' the monitoring thread we should shutdown the connection (if we are
> > able). We also MIGHT want to think about emptying the queue at this point.  We
> > aren't going to empty it later and if the database connection was stuck for a
> > while before it timed out leading to the error the queue might have grown in
> > size.  
> 
> dbconn isn't closed directly - it's closed via slon_disconnect(conn).

Didn't see that.  We then need to call slon_disconnect on line 92


> 
> As for emptying the queue, I don't see much value in it.  The condition on the
> loop is on the waits returning SCHED_STATUS_OK.  I don't think you get to ask
> it to terminate, or is there a signal you could throw that would kick just the
> one thread?
> 

Lines: 121, 191, and to a lesser extend 212 and 95 are all on code paths where we break out of the loop but don't free the space already used by the queue.


> Actually, we're doubling the size of the array each time, so it's perfectly
> reasonable to gripe in the logs any time we resize and the size > 100.  I'll do
> that...

For the moment I guess this is fine to commit, At some point I will try to find time to implement a properly bounded queue based on the number of threads
such that old entries get thrown away when the queue hits the bound size.

Comment 18 Christopher Browne 2011-03-18 12:11:16 UTC

Committed to head

http://git.postgresql.org/gitweb?p=slony1-engine.git;a=commit;h=59fa974c06a1cae9f355809ea81eaa6d443f0a16