[Slony1-general] deadlock between slonik failover and slon processes

Tue Mar 7 14:30:37 PST 2006

Using slony1-1.1.5-RC2, in testing failover, we are getting deadlock
errors like the following from the db log:

[ copy table statements ]
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."session"
succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select
"_tzreplic".prepareTableForCopy(151); copy "monkeys"."session"
("id","accessadress","accessid","entryaddress","entryid","properties","s
tarttime","updatetime") from stdin; 
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."vc"
succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select
"_tzreplic".prepareTableForCopy(152); copy "monkeys"."vc"
("id","used","capacity") from stdin; 
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."vp"
succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select
"_tzreplic".prepareTableForCopy(153); copy "monkeys"."vp"
("id","used","capacity") from stdin; 
2006-03-07 15:06:31 EST-[29438] NOTICE:  failedNode: set 1 has no other
direct receivers - move now
2006-03-07 15:06:31 EST-[29438] STATEMENT:  select
"_tzreplic".failedNode(2, 3); 
2006-03-07 15:06:37 EST-[29438] NOTICE:  failedNode: set 118 has no
other direct receivers - move now
2006-03-07 15:06:37 EST-[29438] STATEMENT:  select
"_tzreplic".failedNode(2, 3); 
2006-03-07 15:06:38 EST-[29438] ERROR:  deadlock detected
2006-03-07 15:06:38 EST-[29438] DETAIL:  Process 29438 waits for
ShareLock on transaction 3856; blocked by process 29406.
 Process 29406 waits for ShareLock on transaction 3851; blocked by
process 29438.
2006-03-07 15:06:38 EST-[29438] CONTEXT:  SQL statement "delete from
"_tzreplic".sl_setsync where ssy_setid =  $1 "
 PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-07 15:06:38 EST-[29438] STATEMENT:  select
"_tzreplic".failedNode(2, 3); 

So, we establish replication for 2 sets, then we issue a slonik FAILOVER
(pid 29438). The deadlock seems to be between the slonik connection and
the remote_worker connection in the slon. Another example of the same
type of problem:

2006-03-07 14:26:39 EST-[2629] NOTICE:  failedNode: set 74 has no other
direct receivers - move now
2006-03-07 14:26:39 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-07 14:26:42 EST-[2629] NOTICE:  failedNode: set 111 has no other
direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-07 14:26:42 EST-[2629] NOTICE:  failedNode: set 123 has no other
direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-07 14:26:43 EST-[2629] NOTICE:  failedNode: set 135 has no other
direct receivers - move now
2006-03-07 14:26:43 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-07 14:26:43 EST-[2635] NOTICE:  Slony-I: cleanup stale
sl_nodelock entry for pid=974
2006-03-07 14:26:43 EST-[2635] STATEMENT:  select
"_tzreplic".cleanupNodelock(); insert into "_tzreplic".sl_nodelock
values (    2, 0, "pg_catalog".pg_backend_pid()); 
2006-03-07 14:26:44 EST-[2629] NOTICE:  failedNode2(): faking
FAILOVER_SET event
2006-03-07 14:26:44 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode2(1,2,74,'650','651'); 
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking
FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode2(1,2,111,'650','652'); 
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking
FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode2(1,2,123,'650','653'); 
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking
FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select
"_tzreplic".failedNode2(1,2,135,'650','654'); 
2006-03-07 14:26:48 EST-[2636] ERROR:  deadlock detected
2006-03-07 14:26:48 EST-[2636] DETAIL:  Process 2636 waits for ShareLock
on transaction 613918; blocked by process 2629.
 Process 2629 waits for ExclusiveLock on relation 20809 of database
16384; blocked by process 2636.
2006-03-07 14:26:48 EST-[2636] STATEMENT:  notify "_tzreplic_Event";
notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event
(ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip,
ev_type     ) values ('1', '651', '2006-03-07 14:26:33.770196',
'867522', '867523', '', 'SYNC'); insert into "_tzreplic".sl_confirm
(con_origin, con_received, con_seqno, con_timestamp)    values (1, 2,
'651', now()); commit transaction;
2006-03-07 14:26:48 EST-[2636] LOG:  unexpected EOF on client connection

We didn't have the these sorts of deadlock issues with slony1-1.0.5, but
we were also only using one replication set at that point, and postgres
7.4.5, so too many variables have changed, obviously. The common theme
appears to be the interaction of the slonik failover command and the
slon running on the node being failed-over-to....

Has anybody else run into this kind of behavior?

- DAP
======================================================
David Parker    Tazz Networks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20060307/ea3b8b25/attachment.html