[Slony1-general] getting postgresql server crashes with slony

Mon Jan 29 18:34:36 PST 2007

Hi,

I'm having some serious problems ever since I installed slony  
(earlier today).   Let me give some background information first.

Today I installed slony (1.2.6) on a 8.1.4 server (node 1) and am  
replicating to a 8.2.1 (node 2) server.  It's a simple master (node  
1) -> slave (node 2) setup.  Nothing fancy.  I'm getting ready for  
migrating to 8.2 and I was planning on using slony to do it.  I've  
used slony with mixed success in the past, and I thought I'd give it  
another go.   Of course everything went fine in our test environment ;).

Postgresql has been happily running without a problem.  Never has  
crashed since I installed 8.1.   This server does about 20 million  
queries a day, so it's not sitting around idle.   Most of them are  
selects, but there is a constant trickle of updates, inserts and  
deletes. Within 2 hours of starting up slony, node 1 has crashed  
twice.  The postgres logs are like such:

LOG:  server process (PID 29842) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server  
process
DETAIL:  The postmaster has commanded this server process to roll  
back the current transaction and exit, because another server process  
exited abnormally and possibly corrupted shared memory.
.
.
.
HINT:  In a moment you should be able to reconnect to the database  
and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-01-29 19:45:03 CST
LOG:  checkpoint record is at 37C/C5F428E4
.
.
.

The node 1 slon log is as follows when the crash happened:

2007-01-29 19:42:25 CST DEBUG1 cleanupThread:    0.312 seconds for  
delete logs
2007-01-29 19:45:09 CST FATAL  syncThread: "start transaction;set  
transaction isolation level serializable;select last_value from  
"_mobycluster".sl_action_seq;" - server closed the connection  
unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
2007-01-29 19:45:09 CST DEBUG1 slon: retry requested
2007-01-29 19:45:09 CST INFO   remoteListenThread_2: disconnecting  
from '***************'

The node 2 slon log is as follows when the crash happened:

2007-01-29 19:42:16 CST DEBUG1 cleanupThread:    0.196 seconds for  
delete logs
2007-01-29 19:45:09 CST ERROR  remoteListenThread_1: "select  
ev_origin, ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid,  
ev_xip,        ev_type,        ev_data1, ev_data2,        ev_data3,  
ev_data4,        ev_data5, ev_data6,        ev_data7, ev_data8 from  
"_mobycluster".sl_event e where (e.ev_origin = '1' and e.ev_seqno >  
'8720') order by e.ev_origin, e.ev_seqno" - server closed the  
connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
2007-01-29 19:45:19 CST DEBUG1 remoteListenThread_1: connected to  
'****************'
2007-01-29 19:45:26 CST ERROR  remoteWorkerThread_1: "start  
transaction; set enable_seqscan = off; set enable_indexscan = on; "  
PGRES_FATAL_ERROR server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
2007-01-29 19:45:26 CST ERROR  remoteWorkerThread_1: "close LOG; "  
PGRES_FATAL_ERROR 2007-01-29 19:45:26 CST ERROR   
remoteWorkerThread_1: "rollback transaction; set enable_seqscan =  
default; set enable_indexscan = default; " PGRES_FATAL_ERROR  
2007-01-29 19:45:26 CST ERROR  remoteWorkerThread_1: helper 1  
finished with error

Is it possible that slony is causing these crashes?  I think that  
since $libdir/slony1_funcs.so is being included in the postgres  
processes, it's certainly possible.   I also think that a coincidence  
is a little to much of a reach.  However, I would love to hear what  
the experts think.   What's the best way to track this down?  Any  
advice on what I should do?   I'm very close to uninstalling slony,  
but if there is something I can do to help identify the problem so  
that it can be fixed, I'd like to help.

I realize that I'm not running 8.1.6, and I've checked the release  
notes for .5 and .6 and only see one reference to a crash fix:   
"Disallow aggregate functions in UPDATE commands, except within sub- 
SELECTs (Tom)"  I'm centainly not doing this kind up update, and I'm  
pretty sure slony isn't either.

On a much more benign note, I keep seeing a bunch of these in both  
slon logs for node 1 and node 2.

NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=8572
CONTEXT:  SQL statement "SELECT  "_mobycluster".cleanupNodelock()"
         PL/pgSQL function "cleanupevent" line 77 at perform

Are they something to be worried about?

Thanks for any help in advance.

Brian Hirt