[Slony1-general] Slave server dies after a few days of replication

Tue Apr 11 02:03:29 PDT 2006

Hi Guys,

Thanks for the replys...

Jan Wieck wrote:
> On 4/7/2006 2:16 PM, Christopher Browne wrote:
>> Aaron Randall <aaron.randall at visionoss.com> writes:
>>
>>> Hi all!
>>>
>>> I am seeing a problem occurring after a few days of replication 
>>> between two of my servers - they replicate fine and then suddenly 
>>> the slon process stops on the slave. 
>>
>> Does the slon start back up happily after this?
Yes it does.  I get a message something like "cleaning up old slon 
process" (sorry I can't give the exact message, it is on a live system 
so I cannot reproduce the messages).  But yes, the whole process starts 
up nicely again.
>>
>>> The log file gives good information...I just need help in 
>>> understanding it.  Here is the point in the slave logs where the 
>>> slon process shuts down:
>>>
>>> "2006-03-31 12:47:40 GMT DEBUG2 remoteHelperThread_1_1: 0.007 
>>> seconds until close cursor
>>> 2006-03-31 12:47:40 GMT DEBUG2 remoteWorkerThread_1: new 
>>> sl_rowid_seq value: 1000000000000000
>>> 2006-03-31 12:47:40 GMT DEBUG2 remoteWorkerThread_1: SYNC 244391 
>>> done in 0.034 seconds
>>> 2006-03-31 12:47:47 GMT DEBUG2 syncThread: new sl_action_seq 1 - 
>>> SYNC 230540
>>> 2006-03-31 12:47:47 GMT DEBUG2 localListenThread: Received event 
>>> 2,230540 SYNC
>>> 2006-03-31 12:47:47 GMT DEBUG2 remoteWorkerThread_1: forward confirm 
>>> 2,230540 received by 1
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteListenThread_1: queue event 
>>> 1,244392 SYNC
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: Received event 
>>> 1,244392 SYNC
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: SYNC 244392 
>>> processing
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: syncing set 1 
>>> with 250 table(s) from mytable 1
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteHelperThread_1_1: 0.006 seconds 
>>> delay for first row
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteHelperThread_1_1: 0.007 seconds 
>>> until close cursor
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: new 
>>> sl_rowid_seq value: 1000000000000000
>>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: SYNC 244392 
>>> done in 0.032 seconds
>>> 2006-03-31 12:47:56 GMT FATAL  syncThread: "start transaction;set 
>>> transaction isolation level serializable;select last_value from 
>>> "_my_replication".sl_action_seq;" - FATAL:  terminating connection 
>>> due to administrator command
>>> server closed the connection unexpectedly
>>>         This probably means the server terminated abnormally
>>>         before or while processing the request.
>
> There must be a) something in the postmaster log explaining why the 
> postmaster killed the backend and b) probably a coredump somewhere 
> laying around in $PGDATA, explaining in more detail what happened.
>
>
> Jan
I will take a look next time I have access and post the results if 
needed, thanks for the tips!
>
>
>>> 2006-03-31 12:47:56 GMT DEBUG1 slon: shutdown requested
>>> 2006-03-31 12:47:56 GMT DEBUG2 slon: notify worker process to shutdown
>>> 2006-03-31 12:47:56 GMT DEBUG2 slon: wait for worker process to 
>>> shutdown
>>> 2006-03-31 12:47:56 GMT INFO   remoteListenThread_1: disconnecting 
>>> from 'host=1.1.1.2 dbname=mydb user=slonyuser port=5432'
>>> 2006-03-31 12:47:56 GMT DEBUG1 remoteListenThread_1: thread done
>>> 2006-03-31 12:47:56 GMT DEBUG1 localListenThread: thread done
>>> 2006-03-31 12:47:56 GMT DEBUG1 cleanupThread: thread done
>>> 2006-03-31 12:47:56 GMT DEBUG1 main: scheduler mainloop returned
>>> 2006-03-31 12:47:56 GMT DEBUG2 main: wait for remote threads
>>> 2006-03-31 12:47:56 GMT DEBUG2 sched_wakeup_node(): no_id=1 (0 
>>> threads + worker signaled)
>>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: helper thread 
>>> for provider 1 terminated
>>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: disconnecting 
>>> from data provider 1
>>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: thread done
>>> 2006-03-31 12:47:56 GMT DEBUG2 main: notify parent that worker is done
>>> 2006-03-31 12:47:56 GMT DEBUG1 main: done
>>> 2006-03-31 12:47:56 GMT DEBUG2 slon: worker process shutdown ok
>>> 2006-03-31 12:47:56 GMT DEBUG2 slon: exit(-1)
>>> "
>>
>> Something sent a SIGTERM signal to the backend supporting the
>> syncThread, which, if memory serves, could mean that *any* of the
>> backends that slon is listening to were terminated.
>>
>> You should figure out why something is sending SIGTERM signals to your
>> databases; this isn't a Slony-I issue per se.
>>
>> Out of memory problems have historically caused this; you should check
>> database logs to see what's up.  Slony-I won't fix your database
>> problems; it is simply vulnerable to them :-(.
>
>
>