David Parker dparker
Thu Mar 9 12:51:37 PST 2006
slony1-1.1.5-rc2
postgres 8.1.2
 
With multiple replication sets defined (typically 2-4), in testing
failover, I am consistently getting deadlocks on the sl_setsync table
between the slonik failover connection and the local connection used by
remote_worker.c in the sync_event method.
 
I have node A as the origin of, say, 3 sets, and node B has a
subscription to each set. The deadlock errors are showing up in the db
log on node B after I attempt run a slonik failover on node B. 
 
Is anybody else using multiple replication sets? We have to dynamically
add schemas to our database, which need to be replicated. Rather than do
a MERGE SET for each schema added, we created a separate replication set
for each one. This now appears to have been a big mistake, and we are
re-working the code to do a MERGE SET when we add new tables, but it
seems that multiple replication sets should work.
 
Here is yet another example from the log:
 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.148 ms  statement:
begin transaction; set transaction isolation level serializable; 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.534 ms  statement:
select SSY.ssy_setid, SSY.ssy_seqno,     SSY.ssy_minxid, SSY.ssy_maxxid,
SSY.ssy_xip,     SSY.ssy_action_list from "_tzreplic".sl_setsync SSY
where SSY.ssy_seqno < '180'     and SSY.ssy_setid in (271,388,400,412); 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 1.144 ms  statement:
select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' ||
PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from
"_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,
"pg_catalog".pg_namespace PGN where T.tab_set = 388     and PGC.oid =
T.tab_reloid     and PGC.relnamespace = PGN.oid; 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.908 ms  statement:
select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' ||
PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from
"_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,
"pg_catalog".pg_namespace PGN where T.tab_set = 400     and PGC.oid =
T.tab_reloid     and PGC.relnamespace = PGN.oid; 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.886 ms  statement:
select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' ||
PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from
"_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,
"pg_catalog".pg_namespace PGN where T.tab_set = 412     and PGC.oid =
T.tab_reloid     and PGC.relnamespace = PGN.oid; 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 4.225 ms  statement:
select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' ||
PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from
"_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,
"pg_catalog".pg_namespace PGN where T.tab_set = 271     and PGC.oid =
T.tab_reloid     and PGC.relnamespace = PGN.oid; 
2006-03-09 10:01:36 EST-[19316] LOG:  duration: 0.246 ms  statement:
start transaction;set transaction isolation level serializable;select
last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:36 EST-[19316] LOG:  duration: 0.047 ms  statement:
rollback transaction;
2006-03-09 10:01:37 EST-[19316] LOG:  duration: 0.267 ms  statement:
start transaction;set transaction isolation level serializable;select
last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:37 EST-[19316] LOG:  duration: 0.050 ms  statement:
rollback transaction;
2006-03-09 10:01:37 EST-[19445] NOTICE:  failedNode: set 388 has no
other direct receivers - move now
2006-03-09 10:01:37 EST-[19445] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-09 10:01:38 EST-[19316] LOG:  duration: 0.320 ms  statement:
start transaction;set transaction isolation level serializable;select
last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:38 EST-[19316] LOG:  duration: 0.049 ms  statement:
rollback transaction;
2006-03-09 10:01:38 EST-[19445] ERROR:  deadlock detected
2006-03-09 10:01:38 EST-[19445] DETAIL:  Process 19445 waits for
ShareLock on transaction 5609; blocked by process 19314.
 Process 19314 waits for ShareLock on transaction 5596; blocked by
process 19445.
2006-03-09 10:01:38 EST-[19445] CONTEXT:  SQL statement "delete from
"_tzreplic".sl_setsync where ssy_setid =  $1 "
 PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-09 10:01:38 EST-[19445] STATEMENT:  select
"_tzreplic".failedNode(1, 2); 
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 2599.338 ms  statement:
update "_tzreplic".sl_setsync set     ssy_seqno = '180', ssy_minxid =
'6111', ssy_maxxid = '6112',     ssy_xip = '', ssy_action_list = ''
where ssy_setid in (271,388,400,412) and ssy_seqno < '180'; 
2006-03-09 10:01:38 EST-[19445] LOG:  duration: 0.064 ms  statement:
rollback transaction;
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 0.238 ms  statement:
insert into "_tzreplic".sl_seqlog  (seql_seqid, seql_origin,
seql_ev_seqno, seql_last_value)  values (0, 1, '180',
'1000000000000000'); 
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 3.842 ms  statement:
notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into
"_tzreplic".sl_event     (ev_origin, ev_seqno, ev_timestamp,
ev_minxid, ev_maxxid, ev_xip, ev_type     ) values ('1', '180',
'2006-03-09 10:01:36.232092', '6111', '6112', '', 'SYNC'); insert into
"_tzreplic".sl_confirm  (con_origin, con_received, con_seqno,
con_timestamp)    values (1, 2, '180', now()); commit transaction;
 
My interpretation of this log is that the sync_event in the
remote_worker thread started running, then a slonik failover command
came in, deadlocked and got kicked out, and the sync_event transaction
was allowed to finish. Does that make sense?
 
Thanks.
 
- DAP
======================================================
David Parker    Tazz Networks
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://gborg.postgresql.org/pipermail/slony1-general/attachments/20060309/85d3ad8e/attachment.html



More information about the Slony1-general mailing list