Christopher Browne cbbrowne at
Tue Mar 8 13:17:17 PST 2011
I have been taking a poke at this code in Steve's branch:

I like that the code for doing this is mostly controlled in just a
couple of places, slonik_submitEvent(), slonik_wait_caughtup().

It seems to me that there's value in renaming things a little bit; in
most cases, what we want slonik to wait for is the verification that
the node is caught up with respect to configuration events.  For
(virtually?) all cases, we don't actually care if replication is up to
date; the thing that causes replication to get broken is if the nodes
disagree on the state of the configuration of the cluster.

Possibly the name should be slonik_wait_config_caughtup(), for
instance?  That makes it clearer that it's not waiting for SYNCs to
get through, just for the essential configuration events.

Aside from that, the code is looking fairly reasonable to me.

I'm running the clustertest at present, and it's running fine, with
some caveats.  It has been running the Fail Node Test for rather a
while (1h 15min), which seems a bit long.  Two steps on Clone Node
failed, which I think isn't a surprise:

-> % cat ../Clone\ Node/testDetail.txt
pass,slonik - creating nodes+paths+sets,146
pass,slonik - adding tables to set,147
pass,slonik - subscribing set,148
pass,db6 created okay,149
pass,database restored okay,150
pass,clone finish succeeded,151
fail,sync did not finish in the timelimit:true,false,152
fail,slonik completed on success:143.0,0.0,153
pass,slonik - uninstalling nodes,154

I think that CLONE NODE PREPARE needs to call slonik_wait_caughtup()
to make sure that the source node is up to date vis-a-vis
configuration events, which fits in with the earlier suggestion.

More information about the Slony1-hackers mailing list