[Slony1-general] bug in deadlock handling?

Thu Sep 20 14:51:48 PDT 2007

On 9/20/07, Jeff Frost <jeff at frostconsultingllc.com> wrote:
>
> On Thu, 20 Sep 2007, St=E9phane Schildknecht wrote:
>
> >> --------
> >> CREATE TABLE orders.amazon_items
> >> (
> >>   id serial NOT NULL,
> >>   order_id integer NOT NULL,
> >>   item_id integer NOT NULL,
> >>   amazon_item_id character varying(14) NOT NULL,
> >>   CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
> >>   CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
> >>       REFERENCES orders.orders (id) MATCH SIMPLE
> >>       ON UPDATE NO ACTION ON DELETE NO ACTION,
> >>   CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
> >>       REFERENCES orders.items (id) MATCH SIMPLE
> >>       ON UPDATE NO ACTION ON DELETE NO ACTION
> >> )
> >> WITH OIDS;
> >> ALTER TABLE orders.amazon_items OWNER TO thenerds;
> >> --------
> >>
> >> It was called by the following slonik script:
> >>
> >> --------
> >> #!/usr/bin/slonik
> >> include </nerds/preamble.slonik>;
> >>
> >>         EXECUTE SCRIPT (
> >>                 SET ID =3D 1,
> >>                 FILENAME =3D '/nerds/thenerds.sql',
> >>                 EVENT NODE =3D 1
> >>         );
> >> --------
> >>
> >> and caused the following deadlock to occur:
> >>
> >> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
> >> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
> >> _complete_int(1, -1); " PGRES_FATAL_ERROR
> >> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
> >> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits
> for
> >> AccessExclusiveLock on relation 121589880 of databas
> >> e 121589046; blocked by process 12096.
> >> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
> >> AccessShareLock on relation 121589817 of database 121589046;
> >> blocked by process 12263.
> >>
> >> Which then left the some of the tables on that slave in a bad state
> >> breaking replication:
> >>
> >> 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
> >> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
> >> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered
> >> state
> >> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1
> )"
> >> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
> >>
> >> Note that it's just an AccessShareLock that's killing us.  Looks like
> >> that's caused by a select query which does searches.  Our application
> >> does not produce any extraneous locking, it simply does SELECTS on
> >> that server.
> >>
> >> Interestingly, before we started using the slave for queries, we would
> >> have the deadlocks happen on the master when doing DDL changes, but
> >> this never caused the tables on the master to get into a bad state.
> >> You could just re-run your EXECUTE SCRIPT and it would usually work
> >> fine the second time.
> >>
> >> What other info can I provide?
> >>
> > First, I don't think you have to use the execute script command to
> > create a new table.
> >
> > I'd also like to know if you use pgpool, on slave, master, both ?
> >
> > Does the re-executing of the script offer any solution ?
>
> You're quite right and I think we'll start just executing the create
> tables on
> all nodes independently without execute script, but the same problem
> happens
> with ALTER TABLE as well, this just happened to be the latest example that
> caused a problem.

He is quite wrong if the tables you're referencing are replicated. If that
is the case then you can create the tables without RI constraints and add
them via ALTER TABLE slonik scripts. I suspect that you already know this
based on your comment about ALTER TABLE problems, but want to make sure it's
clear in the archives.

We are using pgpool in master/slave mode.  So, pgpool is load balancing
> between node 1 and node 2.  Node 3 is not in the pgpool cluster.
>
> Unfortunately, since the first run of the script leaves node 2 in a bad
> state
> and breaks replication, running the script again doesn't help.  So far
> I've
> had to node 2 from replication and re-subscribe to fix the problem.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070920/=
9542a85d/attachment.htm