[Slony1-general] bug in deadlock handling?

Mon Sep 24 16:09:02 PDT 2007

On 9/21/07, Jeff Frost <jeff at frostconsultingllc.com> wrote:
>
> On Wed, 19 Sep 2007, Jeff Frost wrote:
>
> > On Wed, 19 Sep 2007, Christopher Browne wrote:
> >
> >>> Bails out is the wrong description. Clobbered by the pgsql deadlock
> >>> detection system, leaving the cluster in an unstable state would be
> more
> >>> accurate, if that's what happened. I don't know that there's a way to
> >>> catch that clobberage and run a "finally" type thing.
> >> The thing is, that's not what happens.
> >>
> >> Unless there's a COMMIT somewhere in the DDL script, in which case all
> bets
> >> are off, everywhere, the deadlock should lead to one of two things
> >> happening:
> >>
> >> 1. The other process that was holding onto the tables might fail and
> roll
> >> back, and the DDL script would complete, or
> >>
> >> 2.  The DDL script will fail and roll back, leading to the state of the
> >> tables falling back to what it was before DDL script processing began.
> >>
> >> In either case, the results should leave the node in a consistent
> state,
> >> either:
> >> a) With the DDL request having gone in, or
> >> b) With the DDL request *not* having gone in.
> >>
> >> Unless there's an extra COMMIT in the code (and I just looked at the
> code,
> >> and Did Not See One), the failure resulting from a deadlock should be
> >> benign, restoring the tables to the "previously altered state."
> >>>
> >>>     So, is there a reasonable way to fix this without
> >>>     droppping/resubscribing the
> >>>     node?
> >>>
> >>>
> >>> Well, to start with, you might want to figure out why your application
> is
> >>> taking such aggressive locks. And make sure in the future that it
> doesn't
> >>> happen again (not much point fixing it if it's just gonna re-occur).
> If
> >>> you are using a separate superuser account to connect to your database
> and
> >>> run your slons (generally the "slony" user) then this is really easy
> to
> >>> do: tweak your pg_hba to only allow connections from slony and then
> kick
> >>> all active non-slony connections. Revert your pg_hba at the end of the
> >>> maintenance.
> >>>
> >>> If you're willing to experiment with using slony internal functions,
> you
> >>> could put the table in question into altered state. Something like
> this on
> >>> the offending node might work.
> >>>
> >>> SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE
> >>> tab_reloid =3D (SELECT oid FROM pg_class WHERE relname=3D'cart')));
> >>>
> >>> Or of course it might mess things up even more. :)
> >> The one change to suggest that comes to *my* mind is that we perhaps
> ought
> >> to change Slony-I to aggressively lock the replicated tables ASAP at
> the
> >> start of the DDL script event.
> >>
> >> That will either immediately succeed, and eliminate any possibility of
> >> future deadlock, or immediately fail, before we even try to alter
> anything.
> >>
> >> This same change was made to the SUBSCRIBE SET process (well,
> "COPY_SET",
> >> strictly speaking, but that's not at all user visible...), as we saw
> cases
> >> where gradually escalating locks on tables led to deadlocks that could
> >> waste hours worth of subscription work.
> >>
> >> But it seems likely to me that there's more to the problem than we're
> >> hearing, because deadlock shouldn't cause any corruption of anything -
> to
> >> the contrary, it may be expected to prevent it.
> >
> > this is the latest SQL that caused the problem (note there is not a
> COMMIT in
> > the sql):
> >
> > --------
> > CREATE TABLE orders.amazon_items
> > (
> >  id serial NOT NULL,
> >  order_id integer NOT NULL,
> >  item_id integer NOT NULL,
> >  amazon_item_id character varying(14) NOT NULL,
> >  CONSTRAINT amazon_items_pkey PRIMARY KEY (id),
> >  CONSTRAINT amazon_items_order_id_fkey FOREIGN KEY (order_id)
> >      REFERENCES orders.orders (id) MATCH SIMPLE
> >      ON UPDATE NO ACTION ON DELETE NO ACTION,
> >  CONSTRAINT amazon_items_item_id_fkey FOREIGN KEY (item_id)
> >      REFERENCES orders.items (id) MATCH SIMPLE
> >      ON UPDATE NO ACTION ON DELETE NO ACTION
> > )
> > WITH OIDS;
> > ALTER TABLE orders.amazon_items OWNER TO thenerds;
> > --------
> >
> > It was called by the following slonik script:
> >
> > --------
> > #!/usr/bin/slonik
> > include </nerds/preamble.slonik>;
> >
> >        EXECUTE SCRIPT (
> >                SET ID =3D 1,
> >                FILENAME =3D '/nerds/thenerds.sql',
> >                EVENT NODE =3D 1
> >        );
> > --------
> >
> > and caused the following deadlock to occur:
> >
> > 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
> > remoteWorkerThread_1: "select "_nerdcluster".ddlScript
> > _complete_int(1, -1); " PGRES_FATAL_ERROR
> > Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
> > Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits
> for
> > AccessExclusiveLock on relation 121589880 of databas
> > e 121589046; blocked by process 12096.
> > Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
> > AccessShareLock on relation 121589817 of database 121589046;
> > blocked by process 12263.
> >
> > Which then left the some of the tables on that slave in a bad state
> breaking
> > replication:
> >
> > 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
> > "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
> > Slony-I: alterTableRestore(): Table "public"."carts" is not in altered
> state
> > CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
> > PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
> >
> > Note that it's just an AccessShareLock that's killing us.  Looks like
> that's
> > caused by a select query which does searches.  Our application does not
> > produce any extraneous locking, it simply does SELECTS on that server.
> >
> > Interestingly, before we started using the slave for queries, we would
> have
> > the deadlocks happen on the master when doing DDL changes, but this
> never
> > caused the tables on the master to get into a bad state.  You could just
> > re-run your EXECUTE SCRIPT and it would usually work fine the second
> time.
> >
> > What other info can I provide?
>
> So, what I'm getting from all this is that while the deadlocks can occur,
> slony should gracefully error out and return the tables to their
> previously
> altered state, but that doesn't seem to happen on these nodes.  Note that
> it's
> only a problem when there's a deadlock on the slave, not on the master.
> Should I file this as a bug?  If so, do I just need to send an email to
> slony1-bugs?
>

If you can reproduce the problem then yes, absolutely. Ideally, write a new
test-case that demonstrates it.

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070924/=
aa67f8fc/attachment.htm