Wed Apr 29 11:58:29 PDT 2009
- Previous message: [Slony1-general] [Slony 2.0.1] Events and Confirmations not propagating
- Next message: [Slony1-general] hyphen in slony cluster names
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Any help would be great ...
Cyril Scetbon wrote:
> I'm still investigating ....
>
> The slon restart does not resolve the error. After the restart I have
> something like :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
> date_trunc('minutes', min(now() - ev_timestamp)),
> date_trunc('minutes', max(now() - ev_timestamp)),
> min(now() - ev_timestamp) > '00:30:00' as agehi
> from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin | min | max | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
> 103 | 65 | 66 | 00:01:00 | 00:47:00 | f
> 104 | 60 | 61 | 00:01:00 | 00:47:00 | f
> 102 | 33 | 34 | 00:01:00 | 00:47:00 | f
> 101 | 4541581 | 4542350 | 00:00:00 | 00:13:00 | f
> (4 rows)
>
> So, another event is generated (for example, 66 from node 103)
>
> log extract :
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 33
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_102: no sets need
> syncing for this event
> 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event
> 102,33 SYNC
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 102,33
> ignored - duplicate
> 2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event
> 104,60 SYNC
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 104,60
> ignored - duplicate
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_104: SYNC 60
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_104: no sets need
> syncing for this event
> 2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_103: SYNC 65
> processing
> 2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_103: no sets need
> syncing for this event
>
> after a few minutes I have :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
> date_trunc('minutes', min(now() - ev_timestamp)),
> date_trunc('minutes', max(now() - ev_timestamp)),
> min(now() - ev_timestamp) > '00:30:00' as agehi
> from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin | min | max | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
> 103 | 66 | 66 | 00:12:00 | 00:12:00 | f
> 104 | 61 | 61 | 00:12:00 | 00:12:00 | f
> 102 | 34 | 34 | 00:12:00 | 00:12:00 | f
> 101 | 4542310 | 4543008 | 00:00:00 | 00:11:00 | f
>
> log extract :
> 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_103: SYNC 66
> processing
> 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_103: no sets need
> syncing for this event
> 2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_104: SYNC 61
> processing
> 2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_104: no sets need
> syncing for this event
> 2009-04-28 16:37:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 34
> processing
> 2009-04-28 16:37:53 CEST DEBUG1 remoteWorkerThread_102: no sets need
> syncing for this event
>
> Precedent events (65,60 and 33) have been cleared, and a new one per
> node has been generated. But, the error continues and after 30 minutes
> I have :
>
> select ev_origin, min(ev_seqno), max(ev_seqno),
> date_trunc('minutes', min(now() - ev_timestamp)),
> date_trunc('minutes', max(now() - ev_timestamp)),
> min(now() - ev_timestamp) > '00:30:00' as agehi
> from _pns_slony_voila_archi_0.sl_event group by ev_origin;
> ev_origin | min | max | date_trunc | date_trunc | agehi
> -----------+---------+---------+------------+------------+-------
> 103 | 66 | 66 | 00:41:00 | 00:41:00 | t
> 104 | 61 | 61 | 00:41:00 | 00:41:00 | t
> 102 | 34 | 34 | 00:41:00 | 00:41:00 | t
> 101 | 4543634 | 4544715 | 00:00:00 | 00:18:00 | f
>
> In Summary :
>
> Events are not correctly propagated from nodes 102, 103 and 104. If I
> restart every slons processes new events are generated on these nodes
> but still not correctly propagated. However the last event on each
> node seems to be propagated thanks to the restart stage.
>
> Any idea ?
>
> Cyril Scetbon wrote:
>> Hi,
>>
>> I found in slony log files that the worker has restarted the process
>> on each host in my configuration (1 provider, 3 receivers) at the
>> same time :
>>
>> 2009-04-28 08:47:37 CEST INFO remoteWorkerThread_101: SYNC 4516360
>> sync_event timing: pqexec (s/count)- provider 0.001/1 - subscriber
>> 0.035/1 - IUD 0.000/2
>> 2009-04-28 08:47:38 CEST ERROR remoteListenThread_102: "select
>> ev_origin, ev_seqno, ev_timestamp, ev_snapshot,
>> "pg_catalog".txid_snapshot_xmin(ev_snapshot),
>> "pg_catalog".txid_snapshot_xmax(ev_snapshot), ev_type,
>> ev_data1, ev_data2, ev_data3, ev_data4, ev_data5,
>> ev_data6, ev_data7, ev_data8 from
>> "_pns_slony_voila_archi_0".sl_event e where (e.ev_origin = '102' and
>> e.ev_seqno > '29') or (e.ev_origin = '103' and e.ev_seqno > '62')
>> order by e.ev_origin, e.ev_seqno limit 40" - FATAL: terminating
>> connection due to administrator command
>> 2009-04-28 08:47:38 CEST ERROR remoteListenThread_102: "select
>> con_origin, con_received, max(con_seqno) as con_seqno,
>> max(con_timestamp) as con_timestamp from
>> "_pns_slony_voila_archi_0".sl_confirm where con_received <> 104 group
>> by con_origin, con_received" 2009-04-28 08:47:38 CEST CONFIG slon:
>> child terminated status: 11; pid: 28152, current worker pid: 28152
>> 2009-04-28 08:47:38 CEST CONFIG slon: restart of worker in 10 seconds
>>
>> The origin of the first error is that postgresql has been restarted
>> on one receiver without stopping slon before. Is it a known source of
>> errors ?
>> The slon process has been terminated with status=0 on the receiver
>> where postgresql has been restarted and has segfault (chid status=11)
>> on the others (each watchdog has restart a new slon process). This
>> fact seems to cause the error with events and confirmations.
>>
>> Regards.
>>
>> Cyril Scetbon wrote:
>>> A simple restart of all slon processes seems to have resolved the
>>> issue. weird ....
>>>
>>> Cyril Scetbon wrote:
>>>> I use test_slony_state and see some informations like :
>>>>
>>>> Check of event info
>>>> ---------------------------------------------------
>>>> Problem : Events not propagating to node 2
>>>> Problem : Events not propagating to node 4
>>>> Problem : Events not propagating to node 3
>>>>
>>>> Check of sl_confirm aging
>>>> ---------------------------------------------------
>>>> Confirmations not propagating from 2 to 1
>>>> Confirmations not propagating from 2 to 3
>>>> Confirmations not propagating from 2 to 4
>>>> Confirmations not propagating from 3 to 1
>>>> Confirmations not propagating from 3 to 2
>>>> Confirmations not propagating from 3 to 4
>>>> Confirmations not propagating from 4 to 1
>>>> Confirmations not propagating from 4 to 2
>>>> Confirmations not propagating from 4 to 3
>>>>
>>>> You can see the results on one of my databases :
>>>>
>>>> - for Confirmations
>>>>
>>>> select con_origin, con_received, min(con_seqno) as minseq,
>>>> max(con_seqno) as maxseq, date_trunc('minutes',
>>>> min(now()-con_timestamp)) as age1, date_trunc('minutes',
>>>> max(now()-con_timestamp)) as age2, min(now() - con_timestamp) >
>>>> '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group
>>>> by con_origin, con_received order by con_origin, con_received;
>>>> con_origin | con_received | minseq | maxseq | age1 | age2
>>>> | tooold
>>>> ------------+--------------+---------+---------+----------+----------+--------
>>>>
>>>> 101 | 102 | 4464029 | 4464792 | 00:00:00 | 00:16:00
>>>> | f
>>>> 101 | 103 | 4464027 | 4464792 | 00:00:00 | 00:16:00
>>>> | f
>>>> 101 | 104 | 4464024 | 4464792 | 00:00:00 | 00:16:00
>>>> | f
>>>> 102 | 101 | 29 | 29 | 03:39:00 | 03:39:00
>>>> | t
>>>> 102 | 103 | 29 | 29 | 03:39:00 | 03:39:00
>>>> | t
>>>> 102 | 104 | 29 | 29 | 03:39:00 | 03:39:00
>>>> | t
>>>> 103 | 101 | 62 | 62 | 03:39:00 | 03:39:00
>>>> | t
>>>> 103 | 102 | 62 | 62 | 03:39:00 | 03:39:00
>>>> | t
>>>> 103 | 104 | 62 | 62 | 03:39:00 | 03:39:00
>>>> | t
>>>> 104 | 101 | 57 | 57 | 03:39:00 | 03:39:00
>>>> | t
>>>> 104 | 102 | 57 | 57 | 03:39:00 | 03:39:00
>>>> | t
>>>> 104 | 103 | 57 | 57 | 03:39:00 | 03:39:00
>>>> | t
>>>>
>>>>
>>>> - for Events
>>>>
>>>> select ev_origin, min(ev_seqno), max(ev_seqno),
>>>> date_trunc('minutes', min(now() - ev_timestamp)),
>>>> date_trunc('minutes', max(now() - ev_timestamp)),
>>>> min(now() - ev_timestamp) > '00:30:00' as agehi
>>>> from _pns_slony_voila_archi_0.sl_event group by ev_origin;
>>>> ev_origin | min | max | date_trunc | date_trunc | agehi
>>>> -----------+---------+---------+------------+------------+-------
>>>> 103 | 62 | 62 | 03:49:00 | 03:49:00 | t
>>>> 104 | 57 | 57 | 03:49:00 | 03:49:00 | t
>>>> 102 | 29 | 29 | 03:49:00 | 03:49:00 | t
>>>> 101 | 4464493 | 4465346 | 00:00:00 | 00:14:00 | f
>>>>
>>>>
>>>> What can be the source of these errors and how to track them ? FYI,
>>>> I have logs but in debug level 1.
>>>>
>>>> I did not have issues in 1.2.15.
>>>>
>>>> Regards.
>>>
>>
>
--
Cyril SCETBON
- Previous message: [Slony1-general] [Slony 2.0.1] Events and Confirmations not propagating
- Next message: [Slony1-general] hyphen in slony cluster names
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list