Cyril Scetbon cscetbon.ext at orange-ftgroup.com
Tue Apr 28 05:44:13 PDT 2009
Hi,

I found in slony log files that the worker has restarted the process on 
each host in my configuration (1 provider, 3 receivers) at the same time :

2009-04-28 08:47:37 CEST INFO   remoteWorkerThread_101: SYNC 4516360 
sync_event timing:  pqexec (s/count)- provider 0.001/1 - subscriber 
0.035/1 - IUD 0.000/2
2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,        
"pg_catalog".txid_snapshot_xmin(ev_snapshot),        
"pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,        
ev_data1, ev_data2,        ev_data3, ev_data4,        ev_data5,
ev_data6,        ev_data7, ev_data8 from 
"_pns_slony_voila_archi_0".sl_event e where (e.ev_origin = '102' and 
e.ev_seqno > '29') or (e.ev_origin = '103' and e.ev_seqno > '62') order 
by e.ev_origin, e.ev_seqno limit 40" - FATAL:  terminating connection 
due to administrator command
2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
con_origin, con_received,     max(con_seqno) as con_seqno,     
max(con_timestamp) as con_timestamp from 
"_pns_slony_voila_archi_0".sl_confirm where con_received <> 104 group by 
con_origin, con_received" 2009-04-28 08:47:38 CEST CONFIG slon: child 
terminated status: 11; pid: 28152, current worker pid: 28152
2009-04-28 08:47:38 CEST CONFIG slon: restart of worker in 10 seconds

The origin of the first error is that postgresql has been restarted on 
one receiver without stopping slon before. Is it a known source of errors ?
The slon process has been terminated with status=0 on the receiver where 
postgresql has been restarted and has segfault (chid status=11) on the 
others (each watchdog has restart a new slon process). This fact seems 
to cause the error with events and confirmations.

Regards.

Cyril Scetbon wrote:
> A simple restart of all slon processes seems to have resolved the 
> issue. weird ....
>
> Cyril Scetbon wrote:
>> I use test_slony_state and see some informations like :
>>
>> Check of event info
>> ---------------------------------------------------
>> Problem : Events not propagating to node 2
>> Problem : Events not propagating to node 4
>> Problem : Events not propagating to node 3
>>
>> Check of sl_confirm aging
>> ---------------------------------------------------
>> Confirmations not propagating from 2 to 1
>> Confirmations not propagating from 2 to 3
>> Confirmations not propagating from 2 to 4
>> Confirmations not propagating from 3 to 1
>> Confirmations not propagating from 3 to 2
>> Confirmations not propagating from 3 to 4
>> Confirmations not propagating from 4 to 1
>> Confirmations not propagating from 4 to 2
>> Confirmations not propagating from 4 to 3
>>
>> You can see the results on one of my databases :
>>
>> - for Confirmations
>>
>> select con_origin, con_received, min(con_seqno) as minseq, 
>> max(con_seqno) as maxseq, date_trunc('minutes', 
>> min(now()-con_timestamp)) as age1, date_trunc('minutes', 
>> max(now()-con_timestamp)) as age2, min(now() - con_timestamp) > 
>> '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group 
>> by con_origin, con_received order by con_origin, con_received;
>> con_origin | con_received | minseq  | maxseq  |   age1   |   age2   | 
>> tooold 
>> ------------+--------------+---------+---------+----------+----------+-------- 
>>
>>        101 |          102 | 4464029 | 4464792 | 00:00:00 | 00:16:00 | f
>>        101 |          103 | 4464027 | 4464792 | 00:00:00 | 00:16:00 | f
>>        101 |          104 | 4464024 | 4464792 | 00:00:00 | 00:16:00 | f
>>        102 |          101 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>        102 |          103 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>        102 |          104 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>        103 |          101 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>        103 |          102 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>        103 |          104 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>        104 |          101 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>        104 |          102 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>        104 |          103 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>
>>
>> - for Events
>>
>> select ev_origin, min(ev_seqno), max(ev_seqno),
>>         date_trunc('minutes', min(now() - ev_timestamp)),
>>         date_trunc('minutes', max(now() - ev_timestamp)),
>>         min(now() - ev_timestamp) > '00:30:00' as agehi
>>     from _pns_slony_voila_archi_0.sl_event group by ev_origin;
>> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
>> -----------+---------+---------+------------+------------+-------
>>       103 |      62 |      62 | 03:49:00   | 03:49:00   | t
>>       104 |      57 |      57 | 03:49:00   | 03:49:00   | t
>>       102 |      29 |      29 | 03:49:00   | 03:49:00   | t
>>       101 | 4464493 | 4465346 | 00:00:00   | 00:14:00   | f
>>
>>
>> What can be the source of these errors and how to track them ? FYI, I 
>> have logs but in debug level 1.
>>
>> I did not have issues in 1.2.15.
>>
>> Regards.
>

-- 
Cyril SCETBON


More information about the Slony1-general mailing list