[Slony1-hackers] [Slony1-bugs] [Bug 175] Monitoring cluster better

Tue Mar 15 10:12:16 PDT 2011

On Mon, Mar 14, 2011 at 9:43 AM, Steve Singer <ssinger at ca.afilias.info> wrote:
> On 11-02-25 04:50 PM, bugzilla-daemon at main.slony.info wrote:
>>
>> http://www.slony.info/bugzilla/show_bug.cgi?id=175
>>
>> Christopher Browne<cbbrowne at ca.afilias.info>  changed:
>>
>>            What    |Removed                     |Added
>>
>> ----------------------------------------------------------------------------
>>          Depends on|176                         |
>>
>> --- Comment #7 from Christopher Browne<cbbrowne at ca.afilias.info>
>>  2011-02-25 13:50:28 PST ---
>> I have run pgindent against this, to apply typical PG indentation policy.
>>
>> Should be ready for review.
>>
>
> I've looked this over. See below
>
>
>
>
> monitoring.sgml
> ---------------------
> The paragraphs "It might seem somewhat desirable...." I don't think is
> really needed in the documentation.  I think that is more approriate as a
> comment in the code.  Users shouldn't be bothered with the reasons why
> we didn't make it work some other way.
>
>
> I would make the paragraphs "In order for timestamps..." and "If the host
> where"
> the same paragraph.

No objections
https://github.com/cbbrowne/slony1-engine/commit/72ed0337641eefda095cfcb706562d71d9c03413

> prerequisites.sgml
> ------------------
> - We also capture timestamps in sl_event.  I don't think we  use these
> timestamps for much but the fact that we store them here is probably
> worth a mention.
> -We also log timestamps in the slon log, and syncing them up is annoying
> without the timestamps matching between slon servers.

No objections
https://github.com/cbbrowne/slony1-engine/commit/72ed0337641eefda095cfcb706562d71d9c03413

> slonconf.sgml
> ----------------
> How do I disable the monitoring thread 0? -1? The documentation should
> mention this.
>
> slony1_base.sql
> -------------------
> co_starttime has a default of now().  We we using the time/clock on the
> database server or the time/clock on the server slon runs on.  I don't think
> we should be mixing them.

No objections
https://github.com/cbbrowne/slony1-engine/commit/c6082bc77f460e985c7a2891c45269c9648347c9

> slony1_funcs.sql
> -----------------------
>
> On the UPDATE statement in component_state  the where clause won't
> update the row if co_starttime >= i_starttime.  I don't like this.
> Let's say the clock stays at the same time for two messages because
> NTP is adjusting things.  I would still want the second value to be
> stored and overwrite the value from the first call.
>
> A function that is supposed to store something but then doesn't because
> one of my parameters (i_starttime) was too small and then returns
> without an error return code/exception is also very unexpected.

Unfortunately, we're working with a *stack*, which means we pull
entries off in reverse order.  It's definitely troublesome to change
this.  Per conversation, I'll not change this for now...

> confoptions.c
> --------------------
> I sort of hinted at this above.  I'd like a way to disable the monitor
> thread ie allow 0 to mean disabled.

Yep, will do.

> monitor_thread.c
> ------------------------------
> Line 91:  Here we delete any rows from sl_components for backends that are
> no longer active.  This sort of confuses me.
>
> slon 1 - unix pid 1234
>   localListener - connects to backend pid 2345 on host db1
>   remoteListenerThread_2  connnects to pid 3456 on host db2
>   remoteListenerThread_3  connects to pid 4567 on host db3
>
> 3456 and 4567 won't be in pg_stat_activity on host db1.  They are backends
> running on host db2.   I'm not sure we need/want this delete statement.

Removed.

> Line 113:  I don't like this #define BEGINQUERY.
> I'd rather see
> const char * begin_query="start transaction;"
> I don't see what the macro buys you that a const char * doesn't and the
> const char plays nicer with debuggers and is less suspicious to read.
> Having said that on line 85 you create a slon string beginquery.  I'd rather
> see the const char * over macro or the slondstring

Fair enough.

> line 120: The comment here made me think that something
> after stack_pop() was doing the locking and that an explicit
> unlock is required.  I'd just skip that comment since stack_pop() always
> releases the lock it obtains.

Fair enough.

> Line 173 AND line 177 :
> Free monquery (dstring_terminate)

Added an else clause for this.

> Line 184:
> As I mentioned above I don't think the delete query does what we want. I
> think it probably needs to be eliminated.

Removed.

> Line 195:
> I'm not sure I see the point in making COMMITQUERY be an alias for "commit;"
> but if there is one I'd rather see a const char * used.

Changed to slon dstring, as with begin;

> Line 219:
> I think monquery needs to be freed in the while loop (I mention it above)
> not outside of it since you do a dstring_init in the while loop
> (alternativly you could allocate it outside the while loop, free it here and
> just reset it in the loop)

Yep, it's allocated and deallocated inside the while loop.  I have
added a free on commitquery, to parallel the one for beginquery.

> Line 227:
> I'd rather see this (#define INITIAL_STACK_SIZE) as a static int defined at
> the top of the file. Easier on debuggers.

Done.

> Line 246:
> actor ,activity and event_type can be declared 'const char *"

Good call, done.

> Line 318:
> ns[len]=(char)0;  would normally be written as ns[len]=NULL; in C.

Hmm.  Warning here..
monitor_thread.c: In function 'monitor_state':
monitor_thread.c:307: warning: assignment makes integer from pointer
without a cast

This isn't really stating a pointer; we are just wanting to assign a
string null value, which isn't quite the same as pointer NULL.

Using...
ns[len] = '\0';

> Line 416:
> These functions are not used but my real concern is that they access the
> shared variables without aquiring the mutex and they don't specifiy in the
> comments that the caller is responsible for aquiring the mutex.  If we are
> going to keep the functions we should add a comment to that effect.

I'll add the mutex to stack_dump(), and comment on the need for the
mutex to guard entry_dump().

Stopping here, for the moment...

The above issues are addressed via patches up to
https://github.com/cbbrowne/slony1-engine/commit/d00574d2d1793a372467bc103a7772030fd939ab

I'll attach the following two issues as two further comments to the bug.

> A general comment:
> This monitor thread calls slon_retry() a bunch of times.  We've created a
> whole bunch of new failure points for slon.  I'm not sure issues recording
> the monitor state should result in a slon falling over. I feel that if the
> monitor thread hits an error it can reset the database connections for the
> monitor thread but it should leave the threads doing real work alone.
> Having said that we might also want a maximum size of the queue.  If the
> queue exceeds a certain size (maybe because the monitor thread keeps hitting
> issues) then we stop putting new  elements on the queue or discard the
> oldest ones.
>
> I would also like to know what the performance impact in terms of system
> load is of running the monitoring stuff.  Ie run some tests without the
> monitoring stuff measure average cpu load and run the same test with the
> monitoring code and measure average cpu load.  I don't think the impact is
> going to be high but we should still measure it.