196 – snapshot id values appear to travel backwards in sl_event

Bug 196 - snapshot id values appear to travel backwards in sl_event

Summary: snapshot id values appear to travel backwards in sl_event

Status:	RESOLVED FIXED

Alias:	None

Product:	Slony-I
Classification:	Unclassified
Component:	slon (show other bugs)
Version:	devel
Hardware:	PC Linux

Importance:	low normal
Assignee:	Christopher Browne

URL:

Depends on:
Blocks:

Reported:	2011-02-16 12:55 UTC by Steve Singer
Modified:	2011-04-12 13:23 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Steve Singer 2011-02-16 12:55:58 UTC

^
test1=# select ev_origin,ev_seqno,ev_type,ev_timestamp,ev_snapshot FROM _disorder_replica.sl_event where ev_seqno>=5000000050  AND ev_seqno<=5000000054;
 ev_origin |  ev_seqno  | ev_type |        ev_timestamp        |   ev_snapshot    
-----------+------------+---------+----------------------------+------------------
         1 | 5000000050 | SYNC    | 2011-02-15 17:50:27.10242  | 3719843:3719843:
         1 | 5000000051 | SYNC    | 2011-02-15 17:50:27.180308 | 3719845:3719845:
         1 | 5000000052 | SYNC    | 2011-02-15 17:50:28.505496 | 3719843:3719843:
         1 | 5000000053 | SYNC    | 2011-02-15 17:50:33.125079 | 3719917:3719917:
         1 | 5000000054 | SYNC    | 2011-02-15 17:50:35.159158 | 3719921:3719921:
(5 rows)


1,5000000050  and 1,5000000052 have the same event snapshot.  1,5000000051 is a later snapshot than 1,5000000052. 

This resulted in a slon trying to insert the same row into a table twice on the remote node.


This was seen in 2.0.6+ with the SlonKilling.js test from the disorder test suite.

The problem this bug is concerned with is that the exclusive lock on sl_event that we take in slony_funcs.c comes too late.  THe snapshot was already allocated.  We need to get that exclusive lock before the snapshot is allocated. 

The solution proposed is to modify slonik to get the exclsuive lock before calling any stored functions (since calling the stored functions creates the snapshot).

Comment 1 Christopher Browne 2011-03-01 12:10:46 UTC

Preliminary implementation...

https://github.com/cbbrowne/slony1-engine/tree/bug196

--------------------------------------------------------------------------------
Add sl_event_lock table that is what gets locked to prevent mis-serialization of creation of sl_event entries.

Add pre_event_create() function that takes out that lock which will be called by anything that does an event create.

Add code to upgrade schema function which creates sl_event_lock if it is missing.

Slon SYNC thread needs to call pre_event_create() before calling createEvent().

Various functions for slonik actions call pre_event_create() before calling createEvent().
--------------------------------------------------------------------------------

Comment 2 Steve Singer 2011-03-01 12:14:16 UTC

(In reply to comment #1)
> Preliminary implementation...
> 
> https://github.com/cbbrowne/slony1-engine/tree/bug196
> 
> --------------------------------------------------------------------------------
> Add sl_event_lock table that is what gets locked to prevent mis-serialization
> of creation of sl_event entries.
> 
> Add pre_event_create() function that takes out that lock which will be called
> by anything that does an event create.


I think that function needs to be a C function not a stored procedure.  The act of calling the stored procedure creates a snapshot - when the SELECT starts.  THe lock doesn't get created until the 'LOCK statement' inside the stored procedure runs.

Comment 3 Christopher Browne 2011-03-02 10:04:46 UTC

OK, updated to expressly submit LOCK TABLE sl_event_lock, rather than allowing race condition by doing it inside a stored function.

There a few more places to adjust, as there are stored functions that generate events:

a) failednode2() generates an event, so the table needs to be locked before calling this function in src/slonik/slonik.c

b) failoverSet_int() generates an event, hence src/slon/remote_worker.c needs to lock the table first.

https://github.com/cbbrowne/slony1-engine/commit/215054a2ff4f3833aec9cd7f39fcb45a1ee00455

It has been speculated that there's an event created as part of subscription that needs to be guarded; I still need to seek that out, if it exists.

Comment 4 Christopher Browne 2011-03-02 11:44:09 UTC

OK, found the extra spot that needs the guard.

https://github.com/cbbrowne/slony1-engine/commit/38e62558d5917f0b56634d28f3d920fe19ff4e7e

I also added docs to the slon README file.

Comment 5 Steve Singer 2011-04-12 07:05:17 UTC

Testing has encountered


2011-04-12 09:56:41 EDTINFO   main: running scheduler mainloop
2011-04-12 09:56:41 EDTINFO   monitorThread: thread starts
2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312
2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312
2011-04-12 09:56:41 EDTCONFIG version for "dbname=test5 host=localhost port=5432 user=slon password=slon" is 80312
2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312
2011-04-12 09:56:41 EDTCONFIG storeSet: set_id=1 set_origin=3 set_comment='<unchanged>' - update set

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5ba6700 (LWP 31602)]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:31
31	../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
	in ../sysdeps/x86_64/multiarch/../strlen.S
(gdb) up
#1  0x000000000042294e in slon_appendquery_int (dsp=0x7ffff5ba5b40, fmt=0x42b21c "s.failoverSet_int(%d, %d, %d, %s); ", ap=0x7ffff5b92130)
    at dbutils.c:540
540							dstring_append(dsp, s);
(gdb) 
#2  0x0000000000422837 in slon_appendquery (dsp=0x7ffff5ba5b40, 
    fmt=0x42b1f8 "lock table %s.sl_event_lock;select %s.failoverSet_int(%d, %d, %d, %s); ") at dbutils.c:511
511		slon_appendquery_int(dsp, fmt, ap);
(gdb) up
#3  0x000000000040dba9 in remoteWorkerThread_main (cdata=0x6466f0) at remote_worker.c:1202
1202					slon_appendquery(&query1,
(gdb) 

Proposed fix at https://github.com/ssinger/slony1-engine/tree/bug196 

https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d

Comment 6 Christopher Browne 2011-04-12 08:16:05 UTC

(In reply to comment #5)

> https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d

Good catch, I have merged that in, thanks!

Comment 7 Steve Singer 2011-04-12 11:53:04 UTC

(In reply to comment #6)
> (In reply to comment #5)
> 
> > https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d
> 
> Good catch, I have merged that in, thanks!


This now seems fine to commit to 2.1

Comment 8 Christopher Browne 2011-04-12 13:23:18 UTC

Ran a regression test to make sure there aren't any dumb errors, and merged into master.