[Slony1-general] bug in deadlock handling?

Wed Sep 19 01:39:24 PDT 2007

Jeff Frost a écrit :
> On Tue, 18 Sep 2007, Jeff Frost wrote:
>
>> Hi guys, I've got an interesting situation on a slony 1.2.10 3 node
>> cluster. Both slaves get their data direct from the master. 
>> Everything has been running well up to a few days ago.  Now every
>> time we try to add a new table to the cluster, we end up with the
>> following error:
>>
>> 2007-09-18 15:56:06 EDT ERROR  remoteWorkerThread_1: "select
>> "_nerdcluster".ddlScript_prepare_int(1, -1); " PGRES_FATAL_ERROR
>> ERROR: Slony-I: alterTableRestore(): Table "public"."carts" is not in
>> altered state
>> CONTEXT:  SQL statement "SELECT  "_nerdcluster".alterTableRestore( $1 )"
>> PL/pgSQL function "ddlscript_prepare_int" line 46 at perform
>>
>> It looks like the problem is being caused by a deadlock:
>>
>> 15:27:54 sql1 slon[12252]: [39-1] 2007-09-18 15:27:54 EDT ERROR
>> remoteWorkerThread_1: "select "_nerdcluster".ddlScript
>> _complete_int(1, -1); " PGRES_FATAL_ERROR
>> Sep 18 15:27:54 sql1 slon[12252]: [39-2]  ERROR:  deadlock detected
>> Sep 18 15:27:54 sql1 slon[12252]: [39-3] DETAIL:  Process 12263 waits
>> for AccessExclusiveLock on relation 121589880 of databas
>> e 121589046; blocked by process 12096.
>> Sep 18 15:27:54 sql1 slon[12252]: [39-4] Process 12096 waits for
>> AccessShareLock on relation 121589817 of database 121589046;
>> blocked by process 12263.
>>
>> So, my theory is that the execute script alters the tables back to
>> their normal states, doesn't get all the locks it wants and bails out
>> without putting them back to their previously altered state, thus
>> breaking replication.
>>
>> So, is there a reasonable way to fix this without
>> droppping/resubscribing the node?
>
> BTW, this is a quiet error...i.e. slonik does not indicate anything
> untoward happened.  Also, the problem only occurs on node 2 and not on
> node 3.  This isn't surprising as there's no activity on node 3 it's
> just a warm standby, but node 2 is actively processing queries.
>
Hi,

I encoutered the same problem and came to the conclusion pgpool was the
cause of that problem.

I do have 5 nodes in the replication. The databases on two of them are
accessed through pgpool. One of the two is a spare one, not accessed
under normal conditions, the other one is hugely accessed.

When trying to propagate a ddl script through a slonik "EXECUTE SCRIPT"
command, master and spare slave are correctly modified, the production
one falls into error producing the same messages about altered state.

The only way I found to repair that node was then to drop all slony
things in it (uninstall it) and then reinitialize it. As full
subscription is a more-than-three-hour process, I now stop pgpool before
I do the execute script. The last "execute script" I ran did not fall
into error.

I wonder how I could stop pgpool without preventing application to
connect to the database, and it seemed a good procedure to access the
pgpool or the database by tunneling the access.

I mean pg listens on 5432, pgpool on 5433, and the application connects
through 5434. I then juste have to modify the tunnel to bypass pgpool.
It seems to work, that way. I just wonder if it really a good solution.
Do you have any clue on such a way of processing ?

Best regards,

-- 
Stéphane SCHILDKNECHT
Président de PostgreSQLFr
http://www.postgresqlfr.org