[Slony1-general] bug in deadlock handling?

Wed Sep 19 11:37:50 PDT 2007

On 9/18/07, Jeff Frost <jeff at frostconsultingllc.com> wrote:
>
> Hi guys, I've got an interesting situation on a slony 1.2.10 3 node
> cluster.
> Both slaves get their data direct from the master.  Everything has been
> running well up to a few days ago.  Now every time we try to add a new
> table
> to the cluster, we end up with the following error:
>

"Every time"? You have tried more than once?

What I don't see in your problem report is a detailed description, starting
with the cluster in a known good state, of exactly what you have done.
Without knowing exactly what you have done (typescripts or copies of the
shell scripts you used), it's hard to figure out what went wrong.

2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR ERROR:
> Slony-I: alterTableRestore(): Table "public"."carts" is not in altered
> state
> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>
> It looks like the problem is being caused by a deadlock:
>
> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
> _complete_int(1, -1); " PGRES_FATAL_ERROR
> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits for
> AccessExclusiveLock on relation 121589880 of databas
> e 121589046; blocked by process 12096.
> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
> AccessShareLock on relation 121589817 of database 121589046;
> blocked by process 12263.
>
> So, my theory is that the execute script alters the tables back to their
> normal states, doesn't get all the locks it wants and bails out without
> putting them back to their previously altered state, thus breaking
> replication.

Bails out is the wrong description. Clobbered by the pgsql deadlock
detection system, leaving the cluster in an unstable state would be more
accurate, if that's what happened. I don't know that there's a way to catch
that clobberage and run a "finally" type thing.

So, is there a reasonable way to fix this without droppping/resubscribing
> the
> node?
>

Well, to start with, you might want to figure out why your application is
taking such aggressive locks. And make sure in the future that it doesn't
happen again (not much point fixing it if it's just gonna re-occur). If you
are using a separate superuser account to connect to your database and run
your slons (generally the "slony" user) then this is really easy to do:
tweak your pg_hba to only allow connections from slony and then kick all
active non-slony connections. Revert your pg_hba at the end of the
maintenance.

If you're willing to experiment with using slony internal functions, you
could put the table in question into altered state. Something like this on
the offending node might work.

SELECT alterTableForReplication((SELECT tab_id FROM sl_table WHERE
tab_reloid =3D (SELECT oid FROM pg_class WHERE relname=3D'cart')));

Or of course it might mess things up even more. :)

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20070919/=
bf798f16/attachment-0001.htm