[Slony1-general] bug in deadlock handling?

Fri Sep 21 08:37:33 PDT 2007

On Wed, 19 Sep 2007, Jeff Frost wrote:

> On Wed, 19 Sep 2007, Christopher Browne wrote:
>
>>> Bails out is the wrong description. Clobbered by the pgsql deadlock 
>>> detection system, leaving the cluster in an unstable state would be more 
>>> accurate, if that's what happened. I don't know that there's a way to 
>>> catch that clobberage and run a "finally" type thing.
>> The thing is, that's not what happens.
>> 
>> Unless there's a COMMIT somewhere in the DDL script, in which case all bets 
>> are off, everywhere, the deadlock should lead to one of two things 
>> happening:
>> 
>> 1. The other process that was holding onto the tables might fail and roll 
>> back, and the DDL script would complete, or
>> 
>> 2.  The DDL script will fail and roll back, leading to the state of the 
>> tables falling back to what it was before DDL script processing began.
>> 
>> In either case, the results should leave the node in a consistent state, 
>> either:
>> a) With the DDL request having gone in, or
>> b) With the DDL request *not* having gone in.
>> 
>> Unless there's an extra COMMIT in the code (and I just looked at the code, 
>> and Did Not See One), the failure resulting from a deadlock should be 
>> benign, restoring the tables to the "previously altered state."
>>>
>>>     So, is there a reasonable way to fix this without
>>>     droppping/resubscribing the
>>>     node?
>>> 
>>> 
>>> Well, to start with, you might want to figure out why your application is 
>>> taking such aggressive locks. And make sure in the future that it doesn't 
>>> happen again (not much point fixing it if it's just gonna re-occur). If 
>>> you are using a separate superuser account to connect to your database and 
>>> run your slons (generally the "slony" user) then this is really easy to 
>>> do: tweak your pg_hba to only allow connections from slony and then kick 
>>> all active non-slony connections. Revert your pg_hba at the end of the 
>>> maintenance.
>>> 
>>> If you're willing to experiment with using slony internal functions, you 
>>> could put the table in question into altered state. Something like this on 
>>> the offending node might work.
>>> 
>>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE 
>>> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart')));
>>> 
>>> Or of course it might mess things up even more. :)
>> The one change to suggest that comes to *my* mind is that we perhaps ought 
>> to change Slony-I to aggressively lock the replicated tables ASAP at the 
>> start of the DDL script event.
>> 
>> That will either immediately succeed, and eliminate any possibility of 
>> future deadlock, or immediately fail, before we even try to alter anything.
>> 
>> This same change was made to the SUBSCRIBE SET process (well, "COPY_SET", 
>> strictly speaking, but that's not at all user visible...), as we saw cases 
>> where gradually escalating locks on tables led to deadlocks that could 
>> waste hours worth of subscription work.
>> 
>> But it seems likely to me that there's more to the problem than we're 
>> hearing, because deadlock shouldn't cause any corruption of anything - to 
>> the contrary, it may be expected to prevent it.
>
> this is the latest SQL that caused the problem (note there is not a COMMIT in 
> the sql):
>
> --------
> CREATE TABLE orders.amazon_items
> (
>  id serial NOT NULL,
>  order_id integer NOT NULL,
>  item_id integer NOT NULL,
>  amazon_item_id character varying(14) NOT NULL,
>  CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
>  CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
>      REFERENCES orders.orders (id) MATCH SIMPLE
>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>  CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
>      REFERENCES orders.items (id) MATCH SIMPLE
>      ON UPDATE NO ACTION ON DELETE NO ACTION
> )
> WITH OIDS;
> ALTER TABLE orders.amazon_items OWNER TO thenerds;
> --------
>
> It was called by the following slonik script:
>
> --------
> #!/usr/bin/slonik
> include </nerds/preamble.slonik>;
>
>        EXECUTE SCRIPT (
>                SET ID = 1,
>                FILENAME = '/nerds/thenerds.sql',
>                EVENT NODE = 1
>        );
> --------
>
> and caused the following deadlock to occur:
>
> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
> _complete_int(1, -1); " PGRES_FATAL_ERROR
> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits for
> AccessExclusiveLock on relation 121589880 of databas
> e 121589046; blocked by process 12096.
> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
> AccessShareLock on relation 121589817 of database 121589046;
> blocked by process 12263.
>
> Which then left the some of the tables on that slave in a bad state breaking 
> replication:
>
> 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered state
> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>
> Note that it's just an AccessShareLock that's killing us.  Looks like that's 
> caused by a select query which does searches.  Our application does not 
> produce any extraneous locking, it simply does SELECTS on that server.
>
> Interestingly, before we started using the slave for queries, we would have 
> the deadlocks happen on the master when doing DDL changes, but this never 
> caused the tables on the master to get into a bad state.  You could just 
> re-run your EXECUTE SCRIPT and it would usually work fine the second time.
>
> What other info can I provide?

So, what I'm getting from all this is that while the deadlocks can occur, 
slony should gracefully error out and return the tables to their previously 
altered state, but that doesn't seem to happen on these nodes.  Note that it's 
only a problem when there's a deadlock on the slave, not on the master. 
Should I file this as a bug?  If so, do I just need to send an email to 
slony1-bugs?

-- 
Jeff Frost, Owner 	<jeff at frostconsultingllc.com>
Frost Consulting, LLC 	http://www.frostconsultingllc.com/
Phone: 650-780-7908	FAX: 650-649-1954