[Slony1-general] Rapid-fire updates to table missed by slony

Thu Aug 26 08:03:50 PDT 2010

On Aug 25, 2010, at 1:17 PM, Jan Wieck wrote:

> On 8/26/2010 9:11 AM, Guy Helmer wrote:
>> On Aug 23, 2010, at 2:42 PM, Steve Singer wrote:
>>> Guy Helmer wrote:
>>>> I'm seeing something odd occasionally on a fairly new slony1 (1.2.20) replication set involving one slave.  At times, the application inserts a record to a particular table, updates the record several times, and then deletes the record, sometimes in a fairly quick succession (but not always).
>>>> When I run the test-slony-state script, sometimes I find that the replication is failing, and when I look deeper, I find that Slony is having trouble replicating the changes to this table because of rows in the slave table that shouldn't be there.  After I manually remove the conflicting rows, Slony is then able to finish the backlogged replication.
>>>> Is there anything in particular I should look for in the log file prior to this problem?
>>> Shortly after the problem happens your going to want to look at sl_log_1  sl_log_2 and sl_event to figure out what was going on.
>>> You want to find the what sync the delete should have been part of, and what sync the failing insert was part of and try to figure out why the delete wasn't applied to the slave by the time it tried the insert.
>>> You would also want to look at the logs slon generates to see if that sync did get applied and look in sl_confirm to verify that.
>>> Honestly I am somewhat suspect that something else isn't going on I find your description somewhat hard reconcile with how things work.
>> Thanks for the advice.  It has happened again.  Due to the timing of the issue corresponding somewhat closely with a software update where we took the database & slony down for the maintenance, I am wondering if we might be taking things down in incorrect order...
>> I didn't notice the problem until test-slony-state saw the problem during last night's check, so the data is about 21 hours old.  sl_log_1 contains this for the stuck table:
>> mydb=# SELECT * FROM _replication.sl_log_1 WHERE log_tableid = 28 ORDER BY log_xid;
>> log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype |              log_cmddata               ------------+---------+-------------+---------------+-------------+----------------------------------------
>>          1 | 2062810 |          28 |          6854 | I           | ("user_id","status") values ('1','2')
>>          1 | 2063155 |          28 |          6881 | I           | ("user_id","status") values ('3','2')
>>          1 | 2063342 |          28 |          6908 | I           | ("user_id","status") values ('3','2')
>>          1 | 2072564 |          28 |          6980 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072564 |          28 |          6984 | D           | "user_id"='34'
>>          1 | 2072564 |          28 |          6986 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072564 |          28 |          6990 | D           | "user_id"='34'
>>          1 | 2072564 |          28 |          6992 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072580 |          28 |          7002 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7021 | D           | "user_id"='34'
>>          1 | 2072586 |          28 |          7023 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7027 | D           | "user_id"='34'
>>          1 | 2072586 |          28 |          7029 | I           | ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7033 | D           | "user_id"='34'
>>          1 | 2072586 |          28 |          7035 | I           | ("user_id","status") values ('34','2')
>> (19 rows)
>> There are two consecutive inserts for user_id 34 (user_id is the primary key) -- is that a possible problem?
> 
> It looks like there is one delete for user_id=34 missing. This could be caused by a corrupted index on sl_log_1. Can you do a
> 
>    REINDEX _replication.sl_log_1;
> 
> and then repeat that SELECT?
> 

I had already manually intervened in the slave's table to get the replication working again, so the sl_log_1 table was empty.  I have run the REINDEX TABLE _replication.sl_log_1 command, and the table is still empty...

Thanks,
Guy--------
This message has been scanned by ComplianceSafe, powered by Palisade's PacketSure.