[Slony1-bugs] [Bug 175] Monitoring cluster better

Fri Mar 18 09:21:30 PDT 2011

http://www.slony.info/bugzilla/show_bug.cgi?id=175

--- Comment #16 from Christopher Browne <cbbrowne at ca.afilias.info> 2011-03-18 09:21:30 PDT ---
(In reply to comment #15)
> Some additional comments based on the latest changes (these are mostly issues
> that I just didn't catch in my first review)
> 
> 
> 1.  We also need to include support in the upgradeSchema() function to take a
> 2.0.x slony install and add in the new tables for monitoring

I thought I had done that.  I'll add in the table if it's missing.

https://github.com/cbbrowne/slony1-engine/commit/d51049b54f014ee4f00f974fc9629a7d64ca9727

> 2.  In monitor thread - dbconn is never closed.  Especially in the cases where
> we 'stop' the monitoring thread we should shutdown the connection (if we are
> able). We also MIGHT want to think about emptying the queue at this point.  We
> aren't going to empty it later and if the database connection was stuck for a
> while before it timed out leading to the error the queue might have grown in
> size.  

dbconn isn't closed directly - it's closed via slon_disconnect(conn).

As for emptying the queue, I don't see much value in it.  The condition on the
loop is on the waits returning SCHED_STATUS_OK.  I don't think you get to ask
it to terminate, or is there a signal you could throw that would kick just the
one thread?

> I also still wonder if we shouldn't be putting a maximum size on the queue, I
> know we discussed this the other day and it started to sound complicated.  I
> think
> there must be a way of doing this with throwing away old entries that isn't
> complicated.  If I can find some time I'll see if I can come up with something.
> Until then I don't think that needs to hold this patch up.

Idle thought: pick two numbers, m and n, and, once we've reached size m,
generate a warning every time the queue size mod n = 0.

For instance, m=100, n=10.  Once the queue is of size >100, complain every time
the size increases and (size % 10) = 0.

Actually, we're doubling the size of the array each time, so it's perfectly
reasonable to gripe in the logs any time we resize and the size > 100.  I'll do
that...

> I also bet we will get complaints from people 
> "my slon has been running fine for 3 months but slon_component table stopped
> updating a few weeks ago".   Where some network blimp disabled the
> monitor_thread but the rest of slony continued to chug along.  Though i think
> this situation is better htan slon restarting as we had before.

That's the risk any time we add anything :-).

> 3. Line 189 dstring_free(monquery) before breaking out of the loop on the
> error.

Fixed. 
https://github.com/cbbrowne/slony1-engine/commit/24da0a137120f4032858230836b39ed14da755fb

> The discussion on how we should be searching for performance regressions in
> slony is maybe outside the scope of this bug other than we should deal with
> this before we release 2.1  (looking to see if any of the 2.1 changes are a
> performance issue.  I have ideas on how to do this but Jan might already have a
> setup for testing this).

I'm happy with that.

I have done a little bit of analytics that suggest that this change should have
minimal impact on performance, and that's certainly what we expected.

And it seems to make sense to look at performance more globally.  It's
acceptable for us to lose a little here and there on one feature or another if:

a) We gain material reliability improvements (e.g. - reasonable value exchange)
b) There are other changes that provide offsetting performance increases.  (Bug
#167, I'm looking at you!  :-))

-- 
Configure bugmail: http://www.slony.info/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.