[Slony1-hackers] automatic WAIT FOR proposal

Wed Dec 22 13:30:21 PST 2010

On 10-12-22 10:45 AM, Steve Singer wrote:
> On 10-12-10 04:54 PM, Christopher Browne wrote:

>
> I will write a design proposal that explores what things might look like
> if we went that route.
>

First I will define some notation.

Let an event with sequence number s originating at node n be known as 
E(n,s)  (ie 1,1234)

An event E(n,s) has been confirmed by node a if the function
C(a,E(n,s)) returns true.

If a direct path exists between two nodes a,b then the function P(a,b) 
returns true.  Note P(a,b) != P(b,a)

I will now define a function that returns the events from other nodes 
confirmed on node a when E(a,s) is generated.

Visible(a,E(a,s) = [E(1,s1),E(2,s2)....E(z,sz)]

For example:  If  when event 1,100 was generated on node 1 node 1 had 
sl_confirm entires indicating that it has confirmed events 2,5 3,10 and 
4,50.
Visible(1,E(1,100)) = [E(1,100), E(2,5), E(3,10), E(4,50) ]

The order in which a set of events  leading up to an Event E(n,s) is 
processed on a node can be expressed as an ordered list returned by the 
Order(a,n,s) function:  Order(3,2,2)= E(1,1), E(2,1),E(1,2),E(3,1),E(2,2)

This means that on node 3, the events leading up to processing of event 
E(2,2) were  E(2,1) then E(1,2) then E(3,1).

A race condition is said to exists when ever the system allows two 
different orderings for  Order(a,n,s) that produce a different resulting 
action.

Two different orderings leading to an event being processed might not 
lead to a race condition if their are no dependencies between the events 
that were processed in a different order.

Two events conflict  Conflict(E(n1,s1),E(n2,s2))=TRUE
If a different result is produced by
Order(n3,E(n1,s1)=E(n2,s2),E(n1,s1) compared with
Order(n3,E(n2,s2)=E(n1,s1),E(n2,s2)

Whether two events conflict  is dependent on the event type and the 
arguments to those events.

Proposition P1:  If an event E(1,1) followed by an event E(1,1+x) is 
submitted to node 1 by slonik then  these two events will not cause a 
race condition with each other.

Proof:
The createEvent() function obtains an exclusive lock on sl_event and 
gets the next sequence number for the event id.  Because the exclusive 
lock on sl_event is held until the transaction is committed it is 
impossible for a row to be visible in sl_event with a higher sequence 
number than another row that will be committed later.  (assumes all 
events get generated while holding the lock on sl_event).

This means that on the event node  E(1,1+x) will always happen after 
E(1,1).   It also means that if a remote node sees E(1,1+x) in the 
sl_event table then it will also see (or have already processed) E(1,1) 
because E(1,1) would have committed before and we don't delete sl_event 
rows that have not been processed.    Therefore as long as the remote 
nodes process the events from a particular node in event id order (which 
they do) then the remote node will always process E(1,1) before it 
processes E(1,1+x).

Proposition 2:

If an Event (n1,s1) is only applied on node n3 when
Visible(n1,E(n1,s1)) CONTAINED IN Order(n3,E(n1,s1))
then no race condition is possible.

Counter Proof:
Consider the following sequence of events:
E(1,2) : Visible(1,2)=>E(2,1) E(3,1)
E(2,2):  Visible(2,2)=>E(1,1) E(3,1)

gives
Order(1,E(2,2)=> E(1,2), E(2,2)
Order(2,E(1,2)=> E(2,2), E(1,2)

At node three E(2,2) might get applied before or after E(1,2).

The second part of the counter proof is to show that there exist event 
types that can be generated on nodes 1,2 such that
Conflict(E(1,2),E(2,2)=TRUE.

create set(id=1,origin=1);  #E(1,2)
create set(id=1,origin=2); #E(2,2)

The behavior on node 3 is clearly differ based on which command arrives 
at node 3 first. Furthermore node 1 and node 2 will be left
in different states as well.

Proposition 3:
If an event E(n1,s1) is not applied on node n3 until all events in 
Visible(n1,E(n1,s1) have all been confirmed on n3 then no event E(n2,s1) 
exists that produces a race condition on n3 unless the conflict existed 
on the event nodes n1 and n2.

Proof:
Assume Visible(1,E(1,1)) does not include an event E(2,1) such that
there exists a race condition at node 3.

This means that
Order(3,E(1,1)==> E(2,1),E(1,1) produces different result than
Order(3,E(1,1)===> E(1,1),E(2,1)
this means that Conflict(E(1,1), E(2,1))=TRUE
and that conflict would exist on node 1 and node 2.
If E(2,1) had reached node 1 before E(1,1) was created then E(2,1) would 
be part of Visible(1,E(1,1)) and would always be confirmed on node 3 
before E(2,1) gets applied the assumptions in the proposition.

If E(1,1) had reached node 2 before E(2,1) was generated then 
Visible(2,E(2,1)) would include E(1,1) and E(1,1) would always get 
applied on node 3 before E(2,1) because of the assumptions we made in 
our proposition.

This means that if IF  a) we can somehow avoid race conditions at the 
event nodes AND  b) implement the rule "apply all events that were 
visible on the event origin before applying the event" then we can 
eliminate race conditions.

i) I feel (a) needs to be handled by slonik and can be handled by slonik 
waiting for the next event node to be caught up to previous event nodes 
before submitting an event to it (discussed in a previous email on this 
thread, though I have not formally tried to prove this

ii) can be implemented by adding a new column to sl_event that stores an 
array of event tuples - which consist of the highest events from each 
node confirmed on the node at the time sl_event is generated.  The a 
remoteWorker won't apply an event to its local node until the 
pre-conditions have been met.

Questions:  What is the performance impact of getting the highest 
confirmed event id values? This is unknown - but will probably involve 
querying sl_event and/or sl_confirm.

Is it possible for an event to be included  in Visible(1,E(1,1)) that 
might never make it to a node n3 that is trying to process E(1,1)?
I think no.  Every event  E(x,y) that is in Visible(1,E(1,1)) is in the 
sl_event table on node 1 by virtue of it having been confirmed on node 1 
but not confirmed elsewhere.  If slon always pulls events (no matter 
what the event origin) into node 3 from the node 1 sl_event table then 
it will include E(x,y).  Even if node 3 isn't directly talking to node 
1, any sl_event table in the cluster that contains E(1,1) would also 
contain E(x,y) by virtue of that event being on node 1 at the time.

I feel this produces a solution to wait for where slonik only needs to 
wait to make sure the next event node has received all events so far.

This does NOT deal with race conditions that can be created by multiple 
concurrent slonik scripts using different event nodes but I might be 
able to solve this with an sl_sloniklock type of table.

Issues relating to TRY blocks are still somewhat outstanding - as 
discussed in the other email.