Bug 218 - slon often has to retry transactions due to concurrent update in 2.1.0.b2
Summary: slon often has to retry transactions due to concurrent update in 2.1.0.b2
Status: RESOLVED FIXED
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: devel
Hardware: PC Linux
: low enhancement
Assignee: Steve Singer
URL:
Depends on:
Blocks:
 
Reported: 2011-06-07 13:52 UTC by Steve Singer
Modified: 2011-08-11 05:02 UTC (History)
1 user (show)

See Also:


Attachments
fix for bug 218 (1.07 KB, patch)
2011-06-08 10:15 UTC, Steve Singer
Details
Patch to fix order of locking (2.02 KB, patch)
2011-08-09 07:37 UTC, Steve Singer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Singer 2011-06-07 13:52:56 UTC
Often I see slon having to rollback/retry transactions in 2.1.0.b2 due to 

 could not serialize access due to concurrent update
CONTEXT:  SQL statement "delete from "_slonyregress".sl_listen"
	PL/pgSQL function "rebuildlistenentries" line 5 at SQL statement
	SQL statement "SELECT "_slonyregress".RebuildListenEntries()"
	PL/pgSQL function "storepath_int" line 48 at PERFORM
STATEMENT:  select "_slonyregress".storePath_int(2, 3, 'dbname=slonyregress2 host=localhost port=9432 user=slony password=test', 10); insert into "_slonyregress".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4    ) values ('3', '5000000002', '2011-06-07 16:49:12.861808-04', '217337:217337:', 'STORE_PATH', '2', '3', 'dbname=slonyregress2 host=localhost port=9432 user=slony password=test', '10'); insert into "_slonyregress".sl_confirm 	(con_origin, con_received, con_seqno, con_timestamp)    values (3, 1, '5000000002', now()); commit transaction;


type errors.  

A run of the regression tests encountered this issue 43 types.  When I run those tests against REL_2_0_STABLE (2.0.7rc1) I don't get any of those errors.

In one instance of testing a slon kept aborting, retrying and failing on serialize access errors for many minutes before being lucky enough to make progress.
Comment 1 Steve Singer 2011-06-08 08:46:13 UTC
The following is an example from the postgresql query log that shows the issue.


slon.remoteWorkerThread_3 slonyregress1 4def9413.2f41 0LOG:  statement: begin transaction; set transaction isolation level serializable; 

slon.remoteWorkerThread_3 slonyregress1 4def9413.2f41 0LOG:  statement: select "_slonyregress".storePath_int(1, 3, 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', 10); insert into "_slonyregress".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4    ) values ('3', '5000000001', '2011-06-08 11:24:03.10834-04', '43245:43245:', 'STORE_PATH', '1', '3', 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', '10'); insert into "_slonyregress".sl_confirm       (con_origin, con_received, con_seqno, con_timestamp)    values (3, 1, '5000000001', now()); commit transaction;

slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 0LOG:  statement: begin transaction; set transaction isolation level serializable; 

slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 0LOG:  statement: select "_slonyregress".storePath_int(1, 2, 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', 10); insert into "_slonyregress".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4    ) values ('2', '5000000001', '2011-06-08 11:24:03.018181-04', '43243:43243:', 'STORE_PATH', '1', '2', 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', '10'); insert into "_slonyregress".sl_confirm      (con_origin, con_received, con_seqno, con_timestamp)    values (2, 1, '5000000001', now()); commit transaction;


slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 43258ERROR:  could not serialize access due to read/write dependencies among transactions
slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 43258DETAIL:  Cancelled on conflict out to pivot 43254, during read.
slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 43258HINT:  The transaction might succeed if retried.
slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 43258CONTEXT:  SQL statement "delete from "_slonyregress".sl_listen"
slon.remoteWorkerThread_2 slonyregress1 4def9413.2f3e 43258STATEMENT:  select "_slonyregress".storePath_int(1, 2, 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', 10); insert into "_slonyregress".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4    ) values ('2', '5000000001', '2011-06-08 11:24:03.018181-04', '43243:43243:', 'STORE_PATH', '1', '2', 'dbname=slonyregress1 host=localhost port=9432 user=slony password=test', '10'); insert into "_slonyregress".sl_confirm       (con_origin, con_received, con_seqno, con_timestamp)    values (2, 1, '5000000001', now()); commit transaction;
slon.remoteWorkerThread_3 slonyregress1 4def9413.2f41 0LOG:  statement: select li_origin, li_provider from "_slonyregress".sl_listen where li_receiver = 1


I am not sure why the lock table 'sl_config_lock' at the top of storePath_int isn't stopping this.
Comment 2 Steve Singer 2011-06-08 08:54:41 UTC
The answer to my previous question is in the PostgreSQL documentation for 'lock table'
----------
To achieve a similar effect when running a transaction at the Serializable isolation level, you have to execute the LOCK TABLE statement before executing any SELECT or data modification statement. A serializable transaction's view of data will be frozen when its first SELECT or data modification statement begins. A LOCK TABLE later in the transaction will still prevent concurrent writes — but it won't ensure that what the transaction reads corresponds to the latest committed values. 
-----------------

We should move the lock table on sl_config_lock to be outside of the storePath_int call, similar to what we did for sl_event_lock.
Comment 3 Steve Singer 2011-06-08 10:15:18 UTC
Created attachment 104 [details]
fix for bug 218

Lock sl_config_lock before calling storePath_int
Comment 4 Christopher Browne 2011-06-08 11:55:44 UTC
Looks like a good change to me.

I kicked off regression tests against an instance with this patch; tests are running fine thus far.

If there's a particular test that tends to expose failures, it would be a fine thing to point out which one it is...
Comment 6 Jan Wieck 2011-06-09 11:56:45 UTC
Does this actually solve the problem or does it only lower the chance of running into it?

You remember we had the discussion about LOCK TABLE with respect to snapshots and realized, that LOCK TABLE statements at the beginning of a transaction actually cause the lock to be taken out BEFORE the transaction snapshot is generated. However, this LOCK TABLE cannot be done from inside of a stored procedure because SELECTing that SP causes the snapshot to be generated before the LOCK TABLE statement itself.

My gut feeling is that we need slonik to do the LOCK on the sl_config_log at the beginning of any transaction, where it intends to call SPs that change the configuration.


Jan
Comment 7 Steve Singer 2011-06-09 12:13:21 UTC
(In reply to comment #6)
> Does this actually solve the problem or does it only lower the chance of
> running into it?
> 
> You remember we had the discussion about LOCK TABLE with respect to snapshots
> and realized, that LOCK TABLE statements at the beginning of a transaction
> actually cause the lock to be taken out BEFORE the transaction snapshot is
> generated. However, this LOCK TABLE cannot be done from inside of a stored
> procedure because SELECTing that SP causes the snapshot to be generated before
> the LOCK TABLE statement itself.
> 
> My gut feeling is that we need slonik to do the LOCK on the sl_config_log at
> the beginning of any transaction, where it intends to call SPs that change the
> configuration.
> 
> 
> Jan

I am inclined to agree with you.  I don't think we should be taking locks in the stored procedures, so slonik + a few parts of slon might need to do this.
Comment 8 Jan Wieck 2011-06-09 12:23:19 UTC
For what release would you consider that?
Comment 9 Steve Singer 2011-06-09 12:29:27 UTC
(In reply to comment #8)
> For what release would you consider that?

I see no reason to fix this in 2.0.

In 2.1 we removed the 'lock sl_config_lock' at the top of the slon remote worker thread, we  did this because it could cause deadlocks with the sl_event_lock that slonik now obtains.   An argument can be made that since we are messing with locking behaviour in 2.1 anyway, and some of our chances exposed some of the behaviour issues under testing then we should properly fix the problem in 2.1 

Thoughts?
Comment 10 Jan Wieck 2011-06-09 13:42:28 UTC
I agree. But that causes us to go back to development and out of BETA.


Jan
Comment 11 Christopher Browne 2011-06-09 14:33:04 UTC
I don't think this destroys "beta"; it should be a moderate amount of work to shift the locks on sl_config_lock into the first query run.

I'll take a look at this.
Comment 12 Jan Wieck 2011-06-09 14:43:18 UTC
Attempting to LOCK and already locked table only wastes some CPU cycles, so doing it in a rather generic fashion would be OK, IMHO.
Comment 13 Christopher Browne 2011-06-10 09:44:50 UTC
I have opened up my own "bug218" branch, and added the following patch:

https://github.com/cbbrowne/slony1-engine/commit/9bd169a5f872ddef48d73168e66970c53b39110d

It adds requests for locks on sl_config_lock to remote_worker.c and slonik.c.

The interesting part is that, in many cases, the initial statement locks both sl_config_lock *and* sl_event_lock, which seems likely to be quite an improvement in safety.

I'm running regression tests against this right now; if there's some other test load that is better to try, let me know.
Comment 14 Steve Singer 2011-06-10 10:31:12 UTC
Also remember to remove the 'LOCK @NAMESPACE@.sl_config_lock' lines from slony_funcs.sql - since a) the locking is now being done elsewhere, no point in paying a penalty twice.  b) Leaving it in there is a bit confusing since someone else might come along and think that it is okay to obtain sl_config_lock in the stored functions
Comment 15 Christopher Browne 2011-06-10 13:21:46 UTC
(In reply to comment #14)
> Also remember to remove the 'LOCK @NAMESPACE@.sl_config_lock' lines from
> slony_funcs.sql - since a) the locking is now being done elsewhere, no point in
> paying a penalty twice.  b) Leaving it in there is a bit confusing since
> someone else might come along and think that it is okay to obtain
> sl_config_lock in the stored functions

Good point, though the price isn't too terribly high.

Have done so, and had a successful run of all the regression tests against this.

https://github.com/cbbrowne/slony1-engine/commit/35fbb13b44e2136b808a5cd7e8457f616fcc5ccb
Comment 16 Steve Singer 2011-07-04 07:50:44 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Also remember to remove the 'LOCK @NAMESPACE@.sl_config_lock' lines from
> > slony_funcs.sql - since a) the locking is now being done elsewhere, no point in
> > paying a penalty twice.  b) Leaving it in there is a bit confusing since
> > someone else might come along and think that it is okay to obtain
> > sl_config_lock in the stored functions
> 
> Good point, though the price isn't too terribly high.
> 
> Have done so, and had a successful run of all the regression tests against
> this.
> 
> https://github.com/cbbrowne/slony1-engine/commit/35fbb13b44e2136b808a5cd7e8457f616fcc5ccb

in cleanupThread.c  there is no explicit transaction (that I can see)

This means that the LOCK TABLE  a and the 'select ..cleanupEvent(...) are done
in separate transactions, so the lock is already released before the call to cleanupEvent.

in remote_worker.c you seem to lock config_lock before event_lock while in slonik.c you lock event_lock before config_lock.  We should try to always get these locks in the same order (I am not sure which order is better or if it matters)
Comment 17 Steve Singer 2011-07-05 13:45:19 UTC
I am starting to suspect that commit 9f0ff6c21eb9f7493de01a4bda16f178e97c80aa
is causing the disorder tests (Move Set, when run as part of a larger set) to fail.

It *seems* that prior to this commit it passes while after I get failures.
Comment 18 Christopher Browne 2011-07-06 08:41:48 UTC
(In reply to comment #16)
> in cleanupThread.c  there is no explicit transaction (that I can see)
> 
> This means that the LOCK TABLE  a and the 'select ..cleanupEvent(...) are done
> in separate transactions, so the lock is already released before the call to
> cleanupEvent.

Eek, yes, that's doubtless so.

There wasn't an explicit transaction because the cleanup thread used to have, as a major component, running VACUUM against Slony's tables, and that can't take place inside a transaction.

We clearly have activity that needs a transaction, so this needs to complicate a little bit.
Comment 19 Christopher Browne 2011-07-07 14:58:51 UTC
I have committed the BEGIN/COMMIT addition...  Over to you, Steve...

https://github.com/cbbrowne/slony1-engine/commit/f3e0776f33c345227df64f06275679ed15bf1a83
Comment 20 Steve Singer 2011-07-08 11:21:15 UTC
(In reply to comment #19)
> I have committed the BEGIN/COMMIT addition...  Over to you, Steve...
> 
> https://github.com/cbbrowne/slony1-engine/commit/f3e0776f33c345227df64f06275679ed15bf1a83

This looks okay to me, this type of change requires a fair amount of testing but it is probably best to commit this to master so it can be part of all the other beta testing.
Comment 21 Christopher Browne 2011-07-11 11:40:32 UTC
Committed latest changes...

http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=30023c6980f09cba689ca3ae4bc694ba54be559d

http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=commit;h=ab1853d75628fa483328a4e54a82895d6ef93a19

That's not a very nice set of patches, unfortunately.  I didn't "squash" it hard enough :-(.
Comment 22 Steve Singer 2011-08-08 13:39:12 UTC
It looks like the merge issues on this ended up merging in the version of this fix that has remote_worker obtain sl_config_lock then sl_event_lock  compared to slonik that obtains sl_event_lock then sl_config_lock.

See my review up thread dated "2011-06-08 08:46:13 PDT" on this bug.

I see the occasional deadlock with the disorder tests due to this.
Comment 23 Steve Singer 2011-08-09 07:37:02 UTC
Created attachment 120 [details]
Patch to fix order of locking
Comment 24 Steve Singer 2011-08-10 14:20:16 UTC
Chris please my patch to fix the locking ordering. You can also see that commit https://github.com/ssinger/slony1-engine/commit/6cf2ba44eccdd10de07d64e8526ed5bcb1cd80c0
Comment 25 Christopher Browne 2011-08-10 15:13:03 UTC
(In reply to comment #24)
> Chris please my patch to fix the locking ordering. You can also see that commit
> https://github.com/ssinger/slony1-engine/commit/6cf2ba44eccdd10de07d64e8526ed5bcb1cd80c0

Ah, quite right.

It's easy to see all the cases, and with the patch, they are now all consistent in locking sl_event_lock before sl_config_lock.

slonik/slonpostgres@cbbrowne [06:11:41] [~/slony1-engine.github/src] [master *]
-> % grep sl_config */*.c | grep sl_event
slon/remote_worker.c:                                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;",
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;"
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock, %s.sl_config_lock;"
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonpostgres@cbbrowne [06:11:41] [~/slony1-engine.github/src] [master *]
-> % grep sl_config */*.c | grep sl_event
zsh: correct 'sl_event' to 'slevent' [nyae]? n
slon/remote_worker.c:                                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;",
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;"
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock, %s.sl_config_lock;"
slon/remote_worker.c:                                                            "lock table %s.sl_event_lock,%s.sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
ik.c:                                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
ik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;",
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"
slonik/slonik.c:                                 "lock table \"_%s\".sl_event_lock, \"_%s\".sl_config_lock;"