^ test1=# select ev_origin,ev_seqno,ev_type,ev_timestamp,ev_snapshot FROM _disorder_replica.sl_event where ev_seqno>=5000000050 AND ev_seqno<=5000000054; ev_origin | ev_seqno | ev_type | ev_timestamp | ev_snapshot -----------+------------+---------+----------------------------+------------------ 1 | 5000000050 | SYNC | 2011-02-15 17:50:27.10242 | 3719843:3719843: 1 | 5000000051 | SYNC | 2011-02-15 17:50:27.180308 | 3719845:3719845: 1 | 5000000052 | SYNC | 2011-02-15 17:50:28.505496 | 3719843:3719843: 1 | 5000000053 | SYNC | 2011-02-15 17:50:33.125079 | 3719917:3719917: 1 | 5000000054 | SYNC | 2011-02-15 17:50:35.159158 | 3719921:3719921: (5 rows) 1,5000000050 and 1,5000000052 have the same event snapshot. 1,5000000051 is a later snapshot than 1,5000000052. This resulted in a slon trying to insert the same row into a table twice on the remote node. This was seen in 2.0.6+ with the SlonKilling.js test from the disorder test suite. The problem this bug is concerned with is that the exclusive lock on sl_event that we take in slony_funcs.c comes too late. THe snapshot was already allocated. We need to get that exclusive lock before the snapshot is allocated. The solution proposed is to modify slonik to get the exclsuive lock before calling any stored functions (since calling the stored functions creates the snapshot).
Preliminary implementation... https://github.com/cbbrowne/slony1-engine/tree/bug196 -------------------------------------------------------------------------------- Add sl_event_lock table that is what gets locked to prevent mis-serialization of creation of sl_event entries. Add pre_event_create() function that takes out that lock which will be called by anything that does an event create. Add code to upgrade schema function which creates sl_event_lock if it is missing. Slon SYNC thread needs to call pre_event_create() before calling createEvent(). Various functions for slonik actions call pre_event_create() before calling createEvent(). --------------------------------------------------------------------------------
(In reply to comment #1) > Preliminary implementation... > > https://github.com/cbbrowne/slony1-engine/tree/bug196 > > -------------------------------------------------------------------------------- > Add sl_event_lock table that is what gets locked to prevent mis-serialization > of creation of sl_event entries. > > Add pre_event_create() function that takes out that lock which will be called > by anything that does an event create. I think that function needs to be a C function not a stored procedure. The act of calling the stored procedure creates a snapshot - when the SELECT starts. THe lock doesn't get created until the 'LOCK statement' inside the stored procedure runs.
OK, updated to expressly submit LOCK TABLE sl_event_lock, rather than allowing race condition by doing it inside a stored function. There a few more places to adjust, as there are stored functions that generate events: a) failednode2() generates an event, so the table needs to be locked before calling this function in src/slonik/slonik.c b) failoverSet_int() generates an event, hence src/slon/remote_worker.c needs to lock the table first. https://github.com/cbbrowne/slony1-engine/commit/215054a2ff4f3833aec9cd7f39fcb45a1ee00455 It has been speculated that there's an event created as part of subscription that needs to be guarded; I still need to seek that out, if it exists.
OK, found the extra spot that needs the guard. https://github.com/cbbrowne/slony1-engine/commit/38e62558d5917f0b56634d28f3d920fe19ff4e7e I also added docs to the slon README file.
Testing has encountered 2011-04-12 09:56:41 EDTINFO main: running scheduler mainloop 2011-04-12 09:56:41 EDTINFO monitorThread: thread starts 2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312 2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312 2011-04-12 09:56:41 EDTCONFIG version for "dbname=test5 host=localhost port=5432 user=slon password=slon" is 80312 2011-04-12 09:56:41 EDTCONFIG version for "host=localhost dbname=test3" is 80312 2011-04-12 09:56:41 EDTCONFIG storeSet: set_id=1 set_origin=3 set_comment='<unchanged>' - update set Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff5ba6700 (LWP 31602)] __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:31 31 ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory. in ../sysdeps/x86_64/multiarch/../strlen.S (gdb) up #1 0x000000000042294e in slon_appendquery_int (dsp=0x7ffff5ba5b40, fmt=0x42b21c "s.failoverSet_int(%d, %d, %d, %s); ", ap=0x7ffff5b92130) at dbutils.c:540 540 dstring_append(dsp, s); (gdb) #2 0x0000000000422837 in slon_appendquery (dsp=0x7ffff5ba5b40, fmt=0x42b1f8 "lock table %s.sl_event_lock;select %s.failoverSet_int(%d, %d, %d, %s); ") at dbutils.c:511 511 slon_appendquery_int(dsp, fmt, ap); (gdb) up #3 0x000000000040dba9 in remoteWorkerThread_main (cdata=0x6466f0) at remote_worker.c:1202 1202 slon_appendquery(&query1, (gdb) Proposed fix at https://github.com/ssinger/slony1-engine/tree/bug196 https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d
(In reply to comment #5) > https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d Good catch, I have merged that in, thanks!
(In reply to comment #6) > (In reply to comment #5) > > > https://github.com/ssinger/slony1-engine/commit/c1dbb32a1a993e0d3d32b49526d1ab3fb18ec68d > > Good catch, I have merged that in, thanks! This now seems fine to commit to 2.1
Ran a regression test to make sure there aren't any dumb errors, and merged into master.