Tue Sep 25 07:33:26 PDT 2007
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] How far is a receiver from the provider
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 24 Sep 2007, Andrew Hammond wrote: >>>>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE >>>>> tab_reloid = (SELECT oid FROM pg_class WHERE relname='cart'))); >>>>> >>>>> Or of course it might mess things up even more. :) >>>> The one change to suggest that comes to *my* mind is that we perhaps >> ought >>>> to change Slony-I to aggressively lock the replicated tables ASAP at >> the >>>> start of the DDL script event. >>>> >>>> That will either immediately succeed, and eliminate any possibility of >>>> future deadlock, or immediately fail, before we even try to alter >> anything. >>>> >>>> This same change was made to the SUBSCRIBE SET process (well, >> "COPY_SET", >>>> strictly speaking, but that's not at all user visible...), as we saw >> cases >>>> where gradually escalating locks on tables led to deadlocks that could >>>> waste hours worth of subscription work. >>>> >>>> But it seems likely to me that there's more to the problem than we're >>>> hearing, because deadlock shouldn't cause any corruption of anything - >> to >>>> the contrary, it may be expected to prevent it. >>> >>> this is the latest SQL that caused the problem (note there is not a >> COMMIT in >>> the sql): >>> >>> -------- >>> CREATE TABLE orders.amazon_items >>> ( >>> id serial NOT NULL, >>> order_id integer NOT NULL, >>> item_id integer NOT NULL, >>> amazon_item_id character varying(14) NOT NULL, >>> CONSTRAINT amazon_items_pkey PRIMARY KEY (id), >>> CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id) >>> REFERENCES orders.orders (id) MATCH SIMPLE >>> ON UPDATE NO ACTION ON DELETE NO ACTION, >>> CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id) >>> REFERENCES orders.items (id) MATCH SIMPLE >>> ON UPDATE NO ACTION ON DELETE NO ACTION >>> ) >>> WITH OIDS; >>> ALTER TABLE orders.amazon_items OWNER TO thenerds; >>> -------- >>> >>> It was called by the following slonik script: >>> >>> -------- >>> #!/usr/bin/slonik >>> include </nerds/preamble.slonik>; >>> >>> EXECUTE SCRIPT ( >>> SET ID = 1, >>> FILENAME = '/nerds/thenerds.sql', >>> EVENT NODE = 1 >>> ); >>> -------- >>> >>> and caused the following deadlock to occur: >>> >>> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR >>> remoteWorkerThread_1: "select "_nerdcluster".ddlScript >>> _complete_int(1, -1); " PGRES_FATAL_ERROR >>> Sep 18 15:27:54 sql1 slon[12252]: [39-2] ERROR: deadlock detected >>> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL: Process 12263 waits >> for >>> AccessExclusiveLock on relation 121589880 of databas >>> e 121589046; blocked by process 12096. >>> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for >>> AccessShareLock on relation 121589817 of database 121589046; >>> blocked by process 12263. >>> >>> Which then left the some of the tables on that slave in a bad state >> breaking >>> replication: >>> >>> 2007-09-18 15:56:06 EDT ERROR remoteWorkerThread_1: "select >>> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR: >>> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered >> state >>> CONTEXT: SQL statement "SELECT "_nerdcluster".alterTableRestore( $1 )" >>> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform >>> >>> Note that it's just an AccessShareLock that's killing us. Looks like >> that's >>> caused by a select query which does searches. Our application does not >>> produce any extraneous locking, it simply does SELECTS on that server. >>> >>> Interestingly, before we started using the slave for queries, we would >> have >>> the deadlocks happen on the master when doing DDL changes, but this >> never >>> caused the tables on the master to get into a bad state. You could just >>> re-run your EXECUTE SCRIPT and it would usually work fine the second >> time. >>> >>> What other info can I provide? >> >> So, what I'm getting from all this is that while the deadlocks can occur, >> slony should gracefully error out and return the tables to their >> previously >> altered state, but that doesn't seem to happen on these nodes. Note that >> it's >> only a problem when there's a deadlock on the slave, not on the master. >> Should I file this as a bug? If so, do I just need to send an email to >> slony1-bugs? >> > > If you can reproduce the problem then yes, absolutely. Ideally, write a new > test-case that demonstrates it. It seems quite reproducible on this particular setup. Basically, run any slonik command which uses EXECUTE SCRIPT and if you get a deadlock on node2, it will leave some of node2's tables in the non modified state (i.e. the triggers are missing). Problem is, I'm not sure how to write a test case which would guarantee a deadlock on node2 only. I'll write this up and send it to slony1-bugs at lists.slony.info. -- Jeff Frost, Owner <jeff at frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
- Previous message: [Slony1-general] bug in deadlock handling?
- Next message: [Slony1-general] How far is a receiver from the provider
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list