Jan Wieck JanWieck at Yahoo.com
Tue Jun 17 11:08:04 PDT 2008
I just committed a small fix to the remote worker. The bug was actually 
revealed after a change I made to the ducttape test #2. I added wait for 
event commands there in order to start subscribing node 3, which 
cascades from node 2, as soon as node 2 had finished its copy set.

The problem was that node 3 as a "not subscribed to anything at all" 
node was listening on node 1 for events originating from node 1. That is 
fine under normal circumstances. However, in this specific setup the 
attempt is to subscribe a set, originating on 1, cascaded with node 2 as 
data provider which at this point is for sure lagging behind (it just 
started to catch up after the copy set). What happens is that the 
SUBSCRIBE_SET event originates on node 2 (data provider) and travels to 
node 1 (origin). There it causes the ENABLE_SUBSCRIPTION event to be 
generated. This event is received by node 3 "directly", which causes 
node 3 to wait and check in 5 second intervals if node 2 finally has 
caught up to at least that ENABLE_SUBSCRIPTION event.

In that wait loop, it never processed any confirm forward messages, 
which were added to the end of the internal message loop. I changed a 
few things to make sure that confirm forward messages are kept at the 
head of the remote worker internal message queue.

There have been repeated comments that wait for event does not work in 
connection with subscribe set. This bug may have been one, the other 
might be that people don't realize that subscribing to a set internally 
does create two events, and both need to be waited for in the right order.

The correct sequence of slonik commands to wait for a subscribe is:

     subscribe set (...);
     wait for event (origin = <data provider>, confirmed = <set origin>,
             wait on = <set origin>, timeout = 0);
     sync (id = <set origin>);
     wait for event (origin = <set origin>, confirmed = <new subscriber>,
             wait on = <new subscriber>, timeout = 0);

The first "wait for event" waits until the actual subscribe set command 
has been processed by the origin on the data set. The following "sync" 
command is necessary to update slonik's idea of what the last event 
sequence on the set origin is. The second "wait for event" now will wait 
until that very sync has been confirmed by the new subscriber, which 
means that it has finished not only the copy set, but also the very 
first sync operation thereafter.

The "wait for event" has a timeout. In case of subscribe set operations, 
which are known to lead to hours or in some cases even days of lag, such 
timeout is for sure unwanted. It is disabled with timeout=0.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



More information about the Slony1-general mailing list