[Slony1-general] bug in deadlock handling?

Tue Sep 25 07:33:26 PDT 2007

On Mon, 24 Sep 2007, Andrew Hammond wrote:

>>>>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE
>>>>> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart')));
>>>>>
>>>>> Or of course it might mess things up even more. :)
>>>> The one change to suggest that comes to *my* mind is that we perhaps
>> ought
>>>> to change Slony-I to aggressively lock the replicated tables ASAP at
>> the
>>>> start of the DDL script event.
>>>>
>>>> That will either immediately succeed, and eliminate any possibility of
>>>> future deadlock, or immediately fail, before we even try to alter
>> anything.
>>>>
>>>> This same change was made to the SUBSCRIBE SET process (well,
>> "COPY_SET",
>>>> strictly speaking, but that's not at all user visible...), as we saw
>> cases
>>>> where gradually escalating locks on tables led to deadlocks that could
>>>> waste hours worth of subscription work.
>>>>
>>>> But it seems likely to me that there's more to the problem than we're
>>>> hearing, because deadlock shouldn't cause any corruption of anything -
>> to
>>>> the contrary, it may be expected to prevent it.
>>>
>>> this is the latest SQL that caused the problem (note there is not a
>> COMMIT in
>>> the sql):
>>>
>>> --------
>>> CREATE TABLE orders.amazon_items
>>> (
>>>  id serial NOT NULL,
>>>  order_id integer NOT NULL,
>>>  item_id integer NOT NULL,
>>>  amazon_item_id character varying(14) NOT NULL,
>>>  CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
>>>  CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
>>>      REFERENCES orders.orders (id) MATCH SIMPLE
>>>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>>>  CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
>>>      REFERENCES orders.items (id) MATCH SIMPLE
>>>      ON UPDATE NO ACTION ON DELETE NO ACTION
>>> )
>>> WITH OIDS;
>>> ALTER TABLE orders.amazon_items OWNER TO thenerds;
>>> --------
>>>
>>> It was called by the following slonik script:
>>>
>>> --------
>>> #!/usr/bin/slonik
>>> include </nerds/preamble.slonik>;
>>>
>>>        EXECUTE SCRIPT (
>>>                SET ID = 1,
>>>                FILENAME = '/nerds/thenerds.sql',
>>>                EVENT NODE = 1
>>>        );
>>> --------
>>>
>>> and caused the following deadlock to occur:
>>>
>>> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
>>> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
>>> _complete_int(1, -1); " PGRES_FATAL_ERROR
>>> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
>>> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits
>> for
>>> AccessExclusiveLock on relation 121589880 of databas
>>> e 121589046; blocked by process 12096.
>>> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
>>> AccessShareLock on relation 121589817 of database 121589046;
>>> blocked by process 12263.
>>>
>>> Which then left the some of the tables on that slave in a bad state
>> breaking
>>> replication:
>>>
>>> 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
>>> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
>>> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered
>> state
>>> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
>>> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>>>
>>> Note that it's just an AccessShareLock that's killing us.  Looks like
>> that's
>>> caused by a select query which does searches.  Our application does not
>>> produce any extraneous locking, it simply does SELECTS on that server.
>>>
>>> Interestingly, before we started using the slave for queries, we would
>> have
>>> the deadlocks happen on the master when doing DDL changes, but this
>> never
>>> caused the tables on the master to get into a bad state.  You could just
>>> re-run your EXECUTE SCRIPT and it would usually work fine the second
>> time.
>>>
>>> What other info can I provide?
>>
>> So, what I'm getting from all this is that while the deadlocks can occur,
>> slony should gracefully error out and return the tables to their
>> previously
>> altered state, but that doesn't seem to happen on these nodes.  Note that
>> it's
>> only a problem when there's a deadlock on the slave, not on the master.
>> Should I file this as a bug?  If so, do I just need to send an email to
>> slony1-bugs?
>>
>
> If you can reproduce the problem then yes, absolutely. Ideally, write a new
> test-case that demonstrates it.

It seems quite reproducible on this particular setup.  Basically, run any 
slonik command which uses EXECUTE SCRIPT and if you get a deadlock on node2, 
it will leave some of node2's tables in the non modified state (i.e. the 
triggers are missing).

Problem is, I'm not sure how to write a test case which would guarantee a 
deadlock on node2 only.  I'll write this up and send it to 
slony1-bugs at lists.slony.info.

-- 
Jeff Frost, Owner 	<jeff at frostconsultingllc.com>
Frost Consulting, LLC 	http://www.frostconsultingllc.com/
Phone: 650-780-7908	FAX: 650-649-1954