[Slony1-general] Slave server dies after a few days of replication

Mon Apr 10 06:55:31 PDT 2006

On 4/7/2006 2:16 PM, Christopher Browne wrote:
> Aaron Randall <aaron.randall at visionoss.com> writes:
> 
>> Hi all!
>>
>> I am seeing a problem occurring after a few days of replication between 
>> two of my servers - they replicate fine and then suddenly the slon 
>> process stops on the slave. 
> 
> Does the slon start back up happily after this?
> 
>> The log file gives good information...I 
>> just need help in understanding it.  Here is the point in the slave logs 
>> where the slon process shuts down:
>>
>> "2006-03-31 12:47:40 GMT DEBUG2 remoteHelperThread_1_1: 0.007 seconds 
>> until close cursor
>> 2006-03-31 12:47:40 GMT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq 
>> value: 1000000000000000
>> 2006-03-31 12:47:40 GMT DEBUG2 remoteWorkerThread_1: SYNC 244391 done in 
>> 0.034 seconds
>> 2006-03-31 12:47:47 GMT DEBUG2 syncThread: new sl_action_seq 1 - SYNC 230540
>> 2006-03-31 12:47:47 GMT DEBUG2 localListenThread: Received event 
>> 2,230540 SYNC
>> 2006-03-31 12:47:47 GMT DEBUG2 remoteWorkerThread_1: forward confirm 
>> 2,230540 received by 1
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteListenThread_1: queue event 
>> 1,244392 SYNC
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: Received event 
>> 1,244392 SYNC
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: SYNC 244392 processing
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: syncing set 1 with 
>> 250 table(s) from mytable 1
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteHelperThread_1_1: 0.006 seconds 
>> delay for first row
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteHelperThread_1_1: 0.007 seconds 
>> until close cursor
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: new sl_rowid_seq 
>> value: 1000000000000000
>> 2006-03-31 12:47:50 GMT DEBUG2 remoteWorkerThread_1: SYNC 244392 done in 
>> 0.032 seconds
>> 2006-03-31 12:47:56 GMT FATAL  syncThread: "start transaction;set 
>> transaction isolation level serializable;select last_value from 
>> "_my_replication".sl_action_seq;" - FATAL:  terminating connection due 
>> to administrator command
>> server closed the connection unexpectedly
>>         This probably means the server terminated abnormally
>>         before or while processing the request.

There must be a) something in the postmaster log explaining why the 
postmaster killed the backend and b) probably a coredump somewhere 
laying around in $PGDATA, explaining in more detail what happened.

Jan

>> 2006-03-31 12:47:56 GMT DEBUG1 slon: shutdown requested
>> 2006-03-31 12:47:56 GMT DEBUG2 slon: notify worker process to shutdown
>> 2006-03-31 12:47:56 GMT DEBUG2 slon: wait for worker process to shutdown
>> 2006-03-31 12:47:56 GMT INFO   remoteListenThread_1: disconnecting from 
>> 'host=1.1.1.2 dbname=mydb user=slonyuser port=5432'
>> 2006-03-31 12:47:56 GMT DEBUG1 remoteListenThread_1: thread done
>> 2006-03-31 12:47:56 GMT DEBUG1 localListenThread: thread done
>> 2006-03-31 12:47:56 GMT DEBUG1 cleanupThread: thread done
>> 2006-03-31 12:47:56 GMT DEBUG1 main: scheduler mainloop returned
>> 2006-03-31 12:47:56 GMT DEBUG2 main: wait for remote threads
>> 2006-03-31 12:47:56 GMT DEBUG2 sched_wakeup_node(): no_id=1 (0 threads + 
>> worker signaled)
>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: helper thread for 
>> provider 1 terminated
>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: disconnecting from 
>> data provider 1
>> 2006-03-31 12:47:56 GMT DEBUG1 remoteWorkerThread_1: thread done
>> 2006-03-31 12:47:56 GMT DEBUG2 main: notify parent that worker is done
>> 2006-03-31 12:47:56 GMT DEBUG1 main: done
>> 2006-03-31 12:47:56 GMT DEBUG2 slon: worker process shutdown ok
>> 2006-03-31 12:47:56 GMT DEBUG2 slon: exit(-1)
>> "
> 
> Something sent a SIGTERM signal to the backend supporting the
> syncThread, which, if memory serves, could mean that *any* of the
> backends that slon is listening to were terminated.
> 
> You should figure out why something is sending SIGTERM signals to your
> databases; this isn't a Slony-I issue per se.
> 
> Out of memory problems have historically caused this; you should check
> database logs to see what's up.  Slony-I won't fix your database
> problems; it is simply vulnerable to them :-(.

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck at Yahoo.com #