Yoshiharu Mori y-mori at sraoss.co.jp
Mon Mar 3 05:02:20 PST 2008
Hello.

> >> [...]
> >> Hey, I should test failover before updating to 1.2.13...
> >
> > I have some strange periodic problems with 'ACCEPT_SET - MOVE_SET or
> > FAILOVER_SET not received yet - sleep' on 1.2.12 and 1.2.13. Looks
> > similar to this one.
> >
> > I should try to downgrade to 1.2.11 and try if my 'move set' problems
> > will disappear. Here is the initial problem description:
> > http://lists.slony.info/pipermail/slony1-general/2008-February/007445.html
> 
> There's something about this that isn't making sense...
> 
> I just did a CVS diff between 1.2.11 and REL_1_2_STABLE, and didn't
> see anything that ought to have anything to do with this.
> 
> I haven't yet done any testing of this case, out of the samples
> described; I intend to do so; but it's not making sense that changing
> between 1.2.11 and 1.2.13 should make any difference in this...

Sorry,I should have checked more carefully.

I think this problem is not the difference of the version but "remoteWorkerThread"

When the problem of 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep' occurs,
the pg_lock table is as following.

----
testdb=# SELECT relname,granted,pid,mode from pg_locks as l , pg_class as c where c.oid = l.relation and locktype='relation';
          relname           | granted |  pid  |        mode
----------------------------+---------+-------+---------------------
 pg_class_oid_index         | t       | 15778 | AccessShareLock
 pg_class_relname_nsp_index | t       | 15778 | AccessShareLock
 pg_locks                   | t       | 15778 | AccessShareLock
 pg_class                   | t       | 15778 | AccessShareLock
 sl_event                   | t       | 15771 | AccessShareLock
 sl_event-pkey              | t       | 15771 | AccessShareLock
 sl_config_lock             | f       | 15770 | AccessExclusiveLock <-- attention!
 sl_config_lock             | t       | 15771 | AccessExclusiveLock
----

Next,I examined why two lock table sl_config_lock was executed.

In the case of failover or move set, two events are generated.
The one is "FAILOVER/MOVE_SET",the other is "ACCEPT_SET".
Furthermore, "FAILOVER/MOVE_SET" event is executed by remoteWorkerThread_1 which INSERT INTO sl_event table.
and "ACCEPT_SET" event is executed by remoteWorkerThread_2 which SELECT ev_type FROM sl_event.

Both events lock sl_config_lock table as following.
---
"begin transaction; set transaction isolation level serializable; lock table "_testdbcluster".sl_config_lock;
---

if it is executed in order of remoteWorkerThread_1(INSERT) and remoteWorkerThread_2(SELECT), the problem doesn't occur as following.

----this is postgresql SQL-log SUCCESS  CASE: attention pid=15407 ---
2008-03-03 18:56:15 JST[15407]LOG:  statement: begin transaction; set transaction isolation level serializable; /* FAILOVER_SET */ lock table "_testdbcluster".sl_config_lock;
2008-03-03 18:56:15 JST[15408]LOG:  statement: begin transaction; set transaction isolation level serializable; /* ACCEPT_SET */ lock table "_testdbcluster".sl_config_lock;
2008-03-03 18:56:15 JST[15407]LOG:  statement: select "_testdbcluster".failoverSet_int(1, 2, 1, 16); notify "_testdbcluster_Event"; insert into "_testdbcluster".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1, ev_data2, ev_data3    ) values ('1', '16', '2008-03-03 18:56:14.173481', '798269', '798271', '''798270''', 'FAILOVER_SET', '1', '2', '1'); insert into "_testdbcluster".sl_confirm   (con_origin, con_received, con_seqno, con_timestamp)    values (1, 3, '16', now()); commit transaction;
-------------------------------

But, if it is executed in order of remoteWorkerThread_2(SELECT) and remoteWorkerThread_2(INSERT),
we have  'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep' loops.

-- this is postgresql SQL-log FAILED CASE: attention pid = 15771 ---
2008-03-03 19:13:51 JST[15771]LOG:  statement: begin transaction; set transaction isolation level serializable; /* ACCEPT_SET */ lock table "_testdbcluster".sl_config_lock;
2008-03-03 19:13:51 JST[15770]LOG:  statement: begin transaction; set transaction isolation level serializable; /* FAILOVER_SET */ lock table "_testdbcluster".sl_config_lock;
2008-03-03 19:13:51 JST[15771]LOG:  statement: select 1 from "_testdbcluster".sl_event where      (ev_origin = 1 and       ev_seqno = 22 and       ev_type = 'MOVE_SET' and       ev_data1 = '1' and      ev_data2 = '1' and       ev_data3 = '2') or      (ev_origin = 1 and       ev_seqno = 22 and       ev_type = 'FAILOVER_SET' and       ev_data1 = '1' and       ev_data2 = '2' and       ev_data3 = '1');
----------------------------------------------

Because of "lock table sl_config_lock", remoteWorkerThread_1 cannot insert "FAILOVER/MOVE_SET" event into sl_event!!

I think this is big bug.

my env is Cent OS x86_64, DUAL-CORE cpu.

Regards,

-- 
SRA OSS, Inc. Japan
Yoshiharu Mori <y-mori at sraoss.co.jp>
http://www.sraoss.co.jp/


More information about the Slony1-general mailing list