Bug 229 - creating a set then adding a table to a different set can deadlock
Summary: creating a set then adding a table to a different set can deadlock
Status: RESOLVED FIXED
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slonik (show other bugs)
Version: devel
Hardware: PC Linux
: low enhancement
Assignee: Steve Singer
URL:
Depends on:
Blocks:
 
Reported: 2011-08-08 06:31 UTC by Steve Singer
Modified: 2011-11-08 11:00 UTC (History)
1 user (show)

See Also:


Attachments
test case (1.16 KB, patch)
2011-08-08 11:16 UTC, Steve Singer
Details
fix for bug (3.44 KB, patch)
2011-08-08 11:17 UTC, Steve Singer
Details
create the WaitForTest (4.11 KB, patch)
2011-08-16 04:54 UTC, Steve Singer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Singer 2011-08-08 06:31:35 UTC
A tester has reported in In 2.1.0 beta3:

A slonik script such as 

create set(id=2, origin=2 ,comment='second set');
create set(id=3, origin=3 ,comment='second set');
set add table(set id=2,origin=2,  tables='disorder.do_item_*');

----

can hang slonik where slonik keeps waiting for an event confirmation that it never sees.

The issue seems to be that the add table code will:

1. Open a transaction and obtain a lock on sl_config_lock on the origin node (node 2)
2. perform an implicit 'wait for event' to ensure that the last event submitted to node 3 (the previous origin) has propagated.

However because slonik has a lock on sl_config_lock on node 2, the slon for node 2 (the remote worker) might not be able to replicate events to node 2 (the store node) because applying a many events to node 2 requires obtaining the sl_config_lock (and sometimes the sl_event_lock).

slonik should not be invoking the implicit 'wait for' behaviour when it is holding an open transaction.
Comment 1 Steve Singer 2011-08-08 11:16:52 UTC
Created attachment 118 [details]
test case

Test case bug
Comment 2 Steve Singer 2011-08-08 11:17:08 UTC
Created attachment 119 [details]
fix for bug
Comment 3 Steve Singer 2011-08-08 11:18:50 UTC
Please arrange for a review to this fix
Comment 4 Steve Singer 2011-08-08 13:20:28 UTC
I don't know if this is related to this change, another recent change but

I am occasionally (as in 1 or two tests will hit this through a full disorder run ) getting something like

@5ef4b65d - 2011-08-08 14:37:58 EDTDEBUG2 remoteWorkerThread_4: Received event #4 from 5000000002 type:STORE_PATH
2011-08-08 14:37:58,028 [slonik add pathsstderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonikScript slonik add pa
ths:java.lang.UNIXProcess@5d79a22d - <stdin>:19: PGRES_FATAL_ERROR lock table "_disorder_replica".sl_event_lock, "_disorder_re
plica".sl_config_lock;select "_disorder_replica".storePath(5, 2, 'dbname=test5 host=localhost port=5432 user=slon password=slo
n', 10);  - ERROR:  duplicate key value violates unique constraint "sl_node-pkey"
2011-08-08 14:37:58,028 [slonik add pathsstderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonikScript slonik add pa
ths:java.lang.UNIXProcess@5d79a22d - CONTEXT:  SQL statement "insert into "_disorder_replica".sl_node (no_id, no_active, no_co
mment) values ( $1 , 'f',  $2 )"
2011-08-08 14:37:58,028 [slonik add pathsstderr] ERROR info.slony.clustertest.testcoordinator.slony.SlonikScript slonik add pa
ths:java.lang.UNIXProcess@5d79a22d - PL/pgSQL function "storenode_int" line 22 at SQL statement


as I test this change.
Comment 5 Jan Wieck 2011-08-12 14:14:34 UTC
Unless the whole sequence is done in a try block, I am wondering if the automatic wait caused by switching event nodes is happening at the wrong moment here. It appears that it takes place AFTER the transaction for the "set add table" has taken out the lock.
Comment 6 Steve Singer 2011-08-15 06:04:30 UTC
(In reply to comment #5)
> Unless the whole sequence is done in a try block, I am wondering if the
> automatic wait caused by switching event nodes is happening at the wrong moment
> here. It appears that it takes place AFTER the transaction for the "set add
> table" has taken out the lock.

The point of the patch is supposed to avoid that.

With this patch the intended behaviour is

1. Open a transaction against the new origin and lock sl_event_lock
2. Get ready to submit a SYNC event to the next origin. This triggers an 
   implicit wait for waiting on the old origin until it sees the sl_confirm.
3. Once the old origin is caught up the SYNC is submtted to the new origin.
4. The transaction commits
5. Open a new transaction getting ready for the set add table.

Steps 1-4 should ensure that everything is caught up before the transaction in step 5 is opened.

Does my patch not do what I think? Or should I commit it
Comment 7 Christopher Browne 2011-08-15 12:50:03 UTC
I'll take a peek.
Comment 8 Christopher Browne 2011-08-15 14:56:55 UTC
Having some trouble "taking a look" - the patch refers to clustertest/disorder/tests/WaitForTest.js, which doesn't seem to be in any tree I can see.

Possibly you've got a private branch that's not visible?
Comment 9 Steve Singer 2011-08-16 04:54:40 UTC
Created attachment 122 [details]
create the WaitForTest
Comment 10 Steve Singer 2011-08-16 04:55:14 UTC
Sorry I have added the missing patch.
You can also see this at https://github.com/ssinger/slony1-engine/tree/bug229
Comment 11 Christopher Browne 2011-08-24 10:28:19 UTC
(In reply to comment #10)
> Sorry I have added the missing patch.
> You can also see this at https://github.com/ssinger/slony1-engine/tree/bug229

I had success with that...

- Failures of the test without commit #7a8aa60e9a881e7df337ddf6c662a87801f6372b

- Test succeeded with that patch.
Comment 12 Steve Singer 2011-11-08 11:00:15 UTC
This was fixed in 2.1.0 with commit 95d82dc387deaaa6d47bf97087dac32176abf648