Bug 175

Summary: Monitoring cluster better
Product: Slony-I Reporter: Christopher Browne <cbbrowne>
Component: core scriptsAssignee: Christopher Browne <cbbrowne>
Status: RESOLVED FIXED    
Severity: enhancement CC: slony1-bugs
Priority: low    
Version: devel   
Hardware: All   
OS: All   
URL: http://wiki.postgresql.org/wiki/SlonyBrainstorming#Slon_monitoring
Bug Depends on: 167    
Bug Blocks:    

Description Christopher Browne 2010-12-03 13:45:56 PST
* slon records in a queryable form what it's working on
* Requires writing (+COMMIT) at the start of the event loop
* [As noted by Yehuda] This ought to also be useful with slonik, to allow
indicating "whazzup"?

Debate took place surrounding various mechanisms.  The only one without
dramatically unacceptable properties was to add a table to the Slony-I schema.

; SNMP
: Existing standard for network-based monitoring
: Downside - requires extra infrastructure including libraries and possibly
additional tooling to use it
; [http://www.spread.org/ Spread]
: Fast
: Downside - requires extra infrastructure that's not particularly "standard"
; NOTIFY/LISTEN
: Built-in to Postgres - no new infrastructure needed
: In Postgres 9.0+ can carry payload
: Only becomes visible upon COMMIT
: Results are not stored; listeners must pre-declare their interest
; SQL table
: Built-in to Postgres - no new infrastructure needed
: Only becomes visible upon COMMIT
: Another table to manage and clean up

==== Monitoring Requirements ====

Crucial facts that we want to know about:
# Is replication behind?
# What are components (e.g. - slon, slonik) doing? <BR> Note that slon has
numerous threads
# Recent abnormal states for events (e.g. - error messages)
# Are any non-SYNC events outstanding?
# Backlog volume?
# What is the cluster's configuration?

===== Replication Behind? =====

The existing view sl_status captures this from a what-is-confirmed perspective.
 That is not perfect, but it is not obvious that there is high priority to
enhancing this.

===== What are components doing? =====

Nothing relevant is captured in a usable fashion.

It is thought that what we may do is to add a table where each thread would
capture ''what am I doing?'' (which would replace whatever was previously being
done)

This table would contain a tuple for:
# Each remote worker thread
# Cleanup thread
# Each remote listener thread
# Local SYNC thread

It would track things such as:
# Time processing started
# What thread/process am I?
# What node am I for?
# What am I doing?  <BR> Possibly in several pieces, to cover the following
sorts of facts:
## Event ID
## Event type <BR> Though this could be pulled from sl_event, given node and
event ID
## Additional event activity <BR> Again, could be pulled from sl_event, given
node and event ID

Note that the contents of this table should be quite tiny; a tuple per slon
thread on a node.

This also needs to be able to capture what '''slonik''' is doing; this seems
more troublesome.
# It is possible to have multiple slonik instances acting concurrently -
multiple concurrent events!
# There is no natural "event loop" such that slonik activities would be
expected to clean themselves up over time

====== Suggested slon implementation ======

Two approaches emerged for establishing connections to capture this monitoring
data
# Each thread opens its own DB connection <BR> Unacceptable: Leads to
''enormous'' increase in use of DB connections that are mostly basically idle
# Establish a "monitoring thread"
## A message queue allows other threads to stow entries (complete with
timestamps) that the monitoring thread periodically flushes to the database
## It is plausible that this thread could be merged into the existing local
SYNC thread which isn't terribly busy

===== Recent abnormal states for events  =====

This captures messages about the most recent problem that occurred, storing:
# Time of abnormality
# Event ID
# Node ID
# Description / Error Message

===== non-SYNC events outstanding? =====

This information is already captured, and may be revealed by running a query
that asks, on the source node, for all events that are:
# Not SYNC events
# Have not been confirmed by the subscriber

===== Backlog volume =====

[http://www.slony.info/bugzilla/show_bug.cgi?id=166 Bug #166]

This seems troublesome; calculating the number of sl_log_* tuples involved in a
particular SYNC requires running the same complex query that the remote_worker
thread uses to determine which tuples are to be applied.

This is a query that is complex to generate that is fairly expensive to run.

Note that [http://www.slony.info/bugzilla/show_bug.cgi?id=167 Bug #167] is
changing this query.

===== Cluster configuration =====

There is an existing tool that does some analysis of cluster configuration; see
[http://git.postgresql.org/gitweb?p=slony1-engine.git;a=blob;f=tools/test_slony_state.pl;h=fdc9dcc060229f39a1e1ac8608e33d63054658bf;hb=refs/heads/master
test_slony_state.pl]

It is desirable to have something that generates diagrams of the relationships
between nodes, capturing:
# Nodes
# Subscription Sets, and the paths they take
# Paths between nodes
# Listen paths

It would be nice for the Subscription Set diagram to include indication of
replication state/lag for each node, indicating things like:
# Event Number
# Events Behind Parent
# Time Behind Parent
# Events Behind Origin
# Time Behind Origin
Comment 1 Christopher Browne 2010-12-06 14:02:16 PST
Note that the cluster analysis portion has been put into bug #176
Comment 2 Christopher Browne 2010-12-14 10:16:56 PST
Set up a branch for bug #175
https://github.com/cbbrowne/slony1-engine/tree/bug175
Comment 3 Christopher Browne 2011-01-27 09:23:06 PST
More diagrams that might be useful...

- For each set, what nodes are permissible failover targets?

- For each set, which nodes are vulnerable to loss on failover?
Comment 4 Christopher Browne 2011-02-14 10:03:51 PST
Sample output:

-> % psql -d slonyregress2 -c "select * from _slony_regress1.sl_components ;"
       co_actor       | co_pid | co_node | co_connection_pid | co_activity |   
  co_starttime      |  co_event  | co_eventtype
----------------------+--------+---------+-------------------+-------------+------------------------+------------+--------------
 local_sync           |   1399 |       0 |              1410 |             |
2011-02-14 17:37:32+00 |            |
 local_monitor        |   1399 |       0 |              1411 |             |
2011-02-14 17:37:32+00 |            |
 remote listener      |   1399 |       1 |              1412 |             |
2011-02-14 17:37:32+00 |            |
 local_cleanup        |   1399 |       0 |              1413 |             |
2011-02-14 17:37:32+00 |            |
 local_listen         |   1399 |       2 |              1403 | SYNC        |
2011-02-14 17:37:50+00 |            | SYNC
 remoteWorkerThread_1 |   1399 |       1 |              1409 | SYNC        |
2011-02-14 17:37:56+00 | 5000000053 | SYNC
(6 rows)
Comment 5 Christopher Browne 2011-02-17 12:29:50 PST
Suggestion from Jan...

Implement as a stack.

When pushing, see if the entry is for "my" thread; if so, update rather than
pushing a new entry.
Comment 6 Christopher Browne 2011-02-18 15:23:36 PST
I started taking a poke at adding this monitoring to Slonik, and that looks
like it's liable to be troublesome and/or overly intrusive.

The trouble is that for this to be useful, it needs to generate extra
connections to databases, and do many COMMITs during the course of a Slonik
script, which is going to lead to a flurry of extra DB connections.

I'm not sure but that this is more trouble than the would-be gain.  It also
looks like it requires quite a lot of extra instrumentation code in slonik.c. 
So the "first shot" on this involves not changing slonik.
Comment 7 Christopher Browne 2011-02-25 13:50:28 PST
I have run pgindent against this, to apply typical PG indentation policy.

Should be ready for review.
Comment 8 Steve Singer 2011-03-14 06:48:38 PDT
(In reply to comment #7)
> I have run pgindent against this, to apply typical PG indentation policy.
> 
> Should be ready for review.

Review at 

http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html
Comment 9 Christopher Browne 2011-03-15 10:16:15 PDT
(In reply to comment #8)
> (In reply to comment #7)
> > I have run pgindent against this, to apply typical PG indentation policy.
> > 
> > Should be ready for review.
> 
> Review at 
> 
> http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html

I have responded to this on the list; see:
http://lists.slony.info/pipermail/slony1-hackers/2011-March/000447.html

There are patches for virtually all of the items mentioned in the review:

https://github.com/cbbrowne/slony1-engine/commit/c6082bc77f460e985c7a2891c45269c9648347c9

https://github.com/cbbrowne/slony1-engine/commit/72ed0337641eefda095cfcb706562d71d9c03413

https://github.com/cbbrowne/slony1-engine/commit/7a3c5f54dcf72bcf67a49f6ed5b80ed673fc0ed5

https://github.com/cbbrowne/slony1-engine/commit/0543c96506eb55012c4366f78009f886b9464d59

https://github.com/cbbrowne/slony1-engine/commit/03f52daf1fb7ae2d819ecd0c411342a594f6884d

https://github.com/cbbrowne/slony1-engine/commit/d00574d2d1793a372467bc103a7772030fd939ab

There were two other issues mentioned which I'll quote as items to specifically
respond to.
Comment 10 Christopher Browne 2011-03-15 10:17:30 PDT
Per Steve Singer:
<http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html>

A general comment:
This monitor thread calls slon_retry() a bunch of times.  We've created a whole
bunch of new failure points for slon.  I'm not sure issues recording the
monitor state should result in a slon falling over. I feel that if the monitor
thread hits an error it can reset the database connections for the monitor
thread but it should leave the threads doing real work alone.   Having said
that we might also want a maximum size of the queue.  If the queue exceeds a
certain size (maybe because the monitor thread keeps hitting issues) then we
stop putting new  elements on the queue or discard the oldest ones.
Comment 11 Christopher Browne 2011-03-15 10:18:07 PDT
Per Steve Singer:
<http://lists.slony.info/pipermail/slony1-hackers/2011-March/000446.html>

I would also like to know what the performance impact in terms of system load
is of running the monitoring stuff.  Ie run some tests without the monitoring
stuff measure average cpu load and run the same test with the monitoring code
and measure average cpu load.  I don't think the impact is going to be high but
we should still measure it.
Comment 12 Christopher Browne 2011-03-16 13:46:31 PDT
Per Steve's recommendation, monitoring thread problems aren't necessarily
serious problems warranting restarting the slon.

This patch implements that:
https://github.com/cbbrowne/slony1-engine/commit/d1a5be153e51cd10ac12a3dcb9cbb63856a18306

Note that there are still some slon_retry() calls remaining, in the scenario
where the slon process runs out of memory.  That seems pretty fatal to me...

I realized that there are cases where we have a horrid memory leak; if the
monitoring thread isn't working, then we shouldn't be collecting monitoring
entries.  Thus, made two changes:

1.  When we exit from the monitoring thread main loop, turn monitoring off.

2.  In the "push" function, if monitoring is shut off, return immediately, do
not attempt to queue up monitoring data.

https://github.com/cbbrowne/slony1-engine/commit/a61f4642ed29bc727d2361a4000500b81573ea11

I realized that the one entry for the monitoring thread will indicate the time
when the slon daemon started.  Document that...

https://github.com/cbbrowne/slony1-engine/commit/d3f66cccb54168ca8d6a80261337a66228720cd8
Comment 13 Christopher Browne 2011-03-17 10:42:33 PDT
I don't see a particularly good way to readily analyze CPU loading from this
addition, and I'm not sure that's even the best measurement, in any case.

As an alternative, I instrumented the code in two places, with the following
patch:
diff --git a/src/slon/monitor_thread.c b/src/slon/monitor_thread.c
index 108b338..d88abfd 100644
--- a/src/slon/monitor_thread.c
+++ b/src/slon/monitor_thread.c
@@ -180,6 +180,8 @@ monitorThread_main(void *dummy)
                                                free(state.activity);
                                        if (state.event_type != NULL)
                                                free(state.event_type);
+                                       slon_log(SLON_INFO, "monitor_thread -
record - %s\n",
+                                                       
dstring_data(&monquery));
                                        res = PQexec(dbconn,
dstring_data(&monquery));
                                        if (PQresultStatus(res) !=
PGRES_TUPLES_OK)
                                        {
@@ -384,6 +386,9 @@ monitor_state(const char *actor, int node, pid_t conn_pid,
/* @null@ */ const ch
        {
                tos->event_type = NULL;
        }
+       slon_log(SLON_INFO, "monitor_state(pid:%d node:%d connpid:%d event:%lld
actor[%s] activity[%s] event_type[%s]\n",
+                        tos->pid, tos->node, tos->conn_pid, tos->event,
+                        tos->actor, tos->activity, tos->event_type);
        pthread_mutex_unlock(&stack_lock);
 }


a) Each time an entry is pushed onto the stack, it is recorded in the log
b) Each time an entry is to be submitted to the DB, the query is recorded in
the log.

In running a typical regression test (e.g. - test1), I saw the following
results...

Test ran from  Mar 17 13:18:00 to Mar 17 13:19:06

 grep monitor_thread /var/log/messages | grep select  | wc -l

-> There were 214 calls to the component_state() function, across two nodes,
so...

214 / (66*2) = 1.6 calls per second

-> There were 471 calls to monitor_state(), across those nodes, that induced
these 214 calls.

That suggests to me that Jan's recommendation to optimize by allowing
successive calls to overwrite a repeated entry for the same thread is doing a
lot of good, reducing the DB load by more than half.

I find it compelling to expect this to be a pretty nominal addition to system
load for Slony.

Poking at the table:

slonyregress1@localhost->  analyze verbose _slony_regress1.sl_components ;
INFO:  analyzing "_slony_regress1.sl_components"
INFO:  "sl_components": scanned 1 of 1 pages, containing 6 live rows and 17
dead rows; 6 rows in sample, 6 estimated total rows
ANALYZE
slonyregress1@localhost->  \c slonyregress2
You are now connected to database "slonyregress2" as user "postgres".
slonyregress2@localhost->  analyze verbose _slony_regress1.sl_components ;
INFO:  analyzing "_slony_regress1.sl_components"
INFO:  "sl_components": scanned 1 of 1 pages, containing 6 live rows and 11
dead rows; 6 rows in sample, 6 estimated total rows
ANALYZE
Comment 14 Christopher Browne 2011-03-17 10:43:06 PDT
Thing remembered...

sl_components should be added to the set of tables that the cleanup thread
considers maintaining.

https://github.com/cbbrowne/slony1-engine/commit/cd3519bba5a03ce2f577747324f4a9ebc697d317
Comment 15 Steve Singer 2011-03-18 07:44:48 PDT
Some additional comments based on the latest changes (these are mostly issues
that I just didn't catch in my first review)


1.  We also need to include support in the upgradeSchema() function to take a
2.0.x slony install and add in the new tables for monitoring

2.  In monitor thread - dbconn is never closed.  Especially in the cases where
we 'stop' the monitoring thread we should shutdown the connection (if we are
able). We also MIGHT want to think about emptying the queue at this point.  We
aren't going to empty it later and if the database connection was stuck for a
while before it timed out leading to the error the queue might have grown in
size.  

I also still wonder if we shouldn't be putting a maximum size on the queue, I
know we discussed this the other day and it started to sound complicated.  I
think
there must be a way of doing this with throwing away old entries that isn't
complicated.  If I can find some time I'll see if I can come up with something.
Until then I don't think that needs to hold this patch up.


I also bet we will get complaints from people 
"my slon has been running fine for 3 months but slon_component table stopped
updating a few weeks ago".   Where some network blimp disabled the
monitor_thread but the rest of slony continued to chug along.  Though i think
this situation is better htan slon restarting as we had before.


3. Line 189 dstring_free(monquery) before breaking out of the loop on the
error.

The discussion on how we should be searching for performance regressions in
slony is maybe outside the scope of this bug other than we should deal with
this before we release 2.1  (looking to see if any of the 2.1 changes are a
performance issue.  I have ideas on how to do this but Jan might already have a
setup for testing this).
Comment 16 Christopher Browne 2011-03-18 09:21:30 PDT
(In reply to comment #15)
> Some additional comments based on the latest changes (these are mostly issues
> that I just didn't catch in my first review)
> 
> 
> 1.  We also need to include support in the upgradeSchema() function to take a
> 2.0.x slony install and add in the new tables for monitoring

I thought I had done that.  I'll add in the table if it's missing.

https://github.com/cbbrowne/slony1-engine/commit/d51049b54f014ee4f00f974fc9629a7d64ca9727

> 2.  In monitor thread - dbconn is never closed.  Especially in the cases where
> we 'stop' the monitoring thread we should shutdown the connection (if we are
> able). We also MIGHT want to think about emptying the queue at this point.  We
> aren't going to empty it later and if the database connection was stuck for a
> while before it timed out leading to the error the queue might have grown in
> size.  

dbconn isn't closed directly - it's closed via slon_disconnect(conn).

As for emptying the queue, I don't see much value in it.  The condition on the
loop is on the waits returning SCHED_STATUS_OK.  I don't think you get to ask
it to terminate, or is there a signal you could throw that would kick just the
one thread?

> I also still wonder if we shouldn't be putting a maximum size on the queue, I
> know we discussed this the other day and it started to sound complicated.  I
> think
> there must be a way of doing this with throwing away old entries that isn't
> complicated.  If I can find some time I'll see if I can come up with something.
> Until then I don't think that needs to hold this patch up.

Idle thought: pick two numbers, m and n, and, once we've reached size m,
generate a warning every time the queue size mod n = 0.

For instance, m=100, n=10.  Once the queue is of size >100, complain every time
the size increases and (size % 10) = 0.

Actually, we're doubling the size of the array each time, so it's perfectly
reasonable to gripe in the logs any time we resize and the size > 100.  I'll do
that...

> I also bet we will get complaints from people 
> "my slon has been running fine for 3 months but slon_component table stopped
> updating a few weeks ago".   Where some network blimp disabled the
> monitor_thread but the rest of slony continued to chug along.  Though i think
> this situation is better htan slon restarting as we had before.

That's the risk any time we add anything :-).

> 3. Line 189 dstring_free(monquery) before breaking out of the loop on the
> error.

Fixed. 
https://github.com/cbbrowne/slony1-engine/commit/24da0a137120f4032858230836b39ed14da755fb

> The discussion on how we should be searching for performance regressions in
> slony is maybe outside the scope of this bug other than we should deal with
> this before we release 2.1  (looking to see if any of the 2.1 changes are a
> performance issue.  I have ideas on how to do this but Jan might already have a
> setup for testing this).

I'm happy with that.

I have done a little bit of analytics that suggest that this change should have
minimal impact on performance, and that's certainly what we expected.

And it seems to make sense to look at performance more globally.  It's
acceptable for us to lose a little here and there on one feature or another if:

a) We gain material reliability improvements (e.g. - reasonable value exchange)
b) There are other changes that provide offsetting performance increases.  (Bug
#167, I'm looking at you!  :-))
Comment 17 Steve Singer 2011-03-18 11:01:56 PDT
> 
> > 2.  In monitor thread - dbconn is never closed.  Especially in the cases where
> > we 'stop' the monitoring thread we should shutdown the connection (if we are
> > able). We also MIGHT want to think about emptying the queue at this point.  We
> > aren't going to empty it later and if the database connection was stuck for a
> > while before it timed out leading to the error the queue might have grown in
> > size.  
> 
> dbconn isn't closed directly - it's closed via slon_disconnect(conn).

Didn't see that.  We then need to call slon_disconnect on line 92


> 
> As for emptying the queue, I don't see much value in it.  The condition on the
> loop is on the waits returning SCHED_STATUS_OK.  I don't think you get to ask
> it to terminate, or is there a signal you could throw that would kick just the
> one thread?
> 

Lines: 121, 191, and to a lesser extend 212 and 95 are all on code paths where
we break out of the loop but don't free the space already used by the queue.


> Actually, we're doubling the size of the array each time, so it's perfectly
> reasonable to gripe in the logs any time we resize and the size > 100.  I'll do
> that...

For the moment I guess this is fine to commit, At some point I will try to find
time to implement a properly bounded queue based on the number of threads
such that old entries get thrown away when the queue hits the bound size.
Comment 18 Christopher Browne 2011-03-18 12:11:16 PDT
Committed to head

http://git.postgresql.org/gitweb?p=slony1-engine.git;a=commit;h=59fa974c06a1cae9f355809ea81eaa6d443f0a16