Florian G. Pflug fgp
Mon Mar 6 16:00:03 PST 2006
Christopher Browne wrote:
>>>I think I need to write up a function to evaluate sl_listen in order to
>>>have a proper test...
>>
>>If you just want to check that every node is reachable, that would be
>>quite
>>easy - e.g. with
>>"select sum(1) from @NAMESPACE at .sl_listen group by sl_receiver,
>>sl_origin".
>>If this doesn't return N*(N-1), with N being the number of nodes, then
>>something
>>is fishy. Of course, this will only work if the algorihtm is take to
>>be correct,
>>so it's not really a test of the algorithm...
>>
> That only tests counts of things.
Not necessarily. If
a) sl_listen has a foreign key from sl_origin to sl_node.no_id, and
from sl_receiver to sl_node.no_id
b) The algorithm in RebuildListenNodes is correect,
the it follows that the existance of N*(N-1) records in sl_listen, which
pairwise differ in either sl_origin or sl_receiver implies a working
flow of events from every node to every other node.

As I said, this is of course _not_ a test for the correctness of the
algorithm, but rather a test for the correctness of a given slony
configuration (I'll detect missing sl_path entries, for example.)

> The trouble that we had with the former algorithm was that node 20 was
> listening for events originating from node 10/11 from node 21.  That was
> actually completely infeasible; 21 would never get those events.
But this situation cannot possibly occur using the "new" algorithmn, given
that is is correct.

>>A real test would probably mean implementing another algorithm that
>>should lead
>>to the same result, and then comparing those... Which sounds like a
>>lot of work :-(
>>
> I don't think it's forcibly all that bad.  We iterate up to (select
> count(*) from sl_node) times, extending a network by a level each time,
> so as to make sure that each origin and destination winds up covered.
Well, this is quite similar to what ReachableFromNode(origin, blacklist) does.
The problem with test-by-reimplementation is that nobody guarantees that
the seconds implementation is better, or more bug-free than the first one.

Don't get me wrong - I'm all for testing. But in case of a single algorithm
(compared to a complex software system), a prove of correctness is a viable
alternative to testing. The basic idea of the algorithm is "For receiver x,
use any neighbour as event-provider for a given origin that can receive events
from that origin under the assumption that x does not exist". Now, that sounds
so simple that I'm ready to assume that it leads to a correct algorithmn. I
just can't imagine any case in which the above could lead to a wrong result.

What remains to be checked IMHO is if the implementation really implements that
idea, and if it works correctly for every corner case. The first assumption is hard
to check, but in our simply case I'd say peer-review is probably sufficient. The
second one is easier, because there are not many corner cases - initial creation
of a cluster is probably the most important one.

>>Anyway, I trust the algorithm itself much more than I trust my
>>subscribed-set-special-case.
>>This part is what really needs to be checked, IMHO.
> I'll see about a "feasibility check"...
That'd be great.

greetings, Florian Pflug



More information about the Slony1-general mailing list