Jason L. Buberel jason at buberel.org
Sun Sep 16 20:45:54 PDT 2007
To follow up on this:

The problem appears to have been caused by the presence of 
'pg_bulkload'. There seems to be some essential conflict between that 
utility and slony. Although I cannot be 100% sure, these are the steps I 
went through prior to reaching that conclusion:

1. On the three nodes in the cluster, leaving pg_bulkload installed on 
all nodes, I completely removed slony from each node by stopping the 
slon daemons and doing a 'drop schema _ar_primary cascade' on each node.

2. I then recreated the cluster using the original slonik configuration 
scripts. At this point, I did NOT start the slon daemons.

3. I ran several queries against each node in the cluster. Everything 
worked fine.

4. I then started the slon daemons. Within a few minutes after the first 
attempts at synchronization, postmaster processes starting dying. The 
syslog showed the telltale fatal messages:

Sep 15 17:28:12 srv3 slon[31511]: [2407-1] DEBUG2 remoteListenThread_2: 
LISTEN
Sep 15 17:28:17 srv3 slon[31511]: [2408-1] DEBUG2 syncThread: new 
sl_action_seq 23036 - SYNC 1019078
Sep 15 17:28:21 srv3 slon[31511]: [2409-1] DEBUG2 localListenThread: 
Received event 3,1019078 SYNC
Sep 15 17:28:22 srv3 slon[31511]: [2411-1] WARN   cleanupThread: "select 
"_ar_primary".logswitch_weekly(); " - server closed the connection 
unexpectedly
Sep 15 17:28:22 srv3 slon[31511]: [2411-2]      This probably means the 
server terminated abnormally
Sep 15 17:28:22 srv3 slon[31511]: [2411-3]      before or while 
processing the request.
Sep 15 17:28:22 srv3 slon[31511]: [2412-1] FATAL  cleanupThread: "delete 
from "_ar_primary".sl_log_1 where log_origin = '2' and log_xid < 
'151888804'; delete from
Sep 15 17:28:22 srv3 slon[31511]: [2412-2]  "_ar_primary".sl_log_2 where 
log_origin = '2' and log_xid < '151888804'; delete from 
"_ar_primary".sl_seqlog where seql_origin
Sep 15 17:28:22 srv3 slon[31511]: [2412-3]  = '2' and seql_ev_seqno < 
'1687734'; select "_ar_primary".logswitch_finish(); " -
Sep 15 17:28:22 srv3 slon[31511]: [2413-1] DEBUG2 slon_retry() from 
pid=31511
Sep 15 17:28:22 srv3 slon[31440]: [6-1] DEBUG2 slon: notify worker 
process to shutdown
Sep 15 17:28:22 srv3 slon[31511]: [2414-1] INFO   remoteListenThread_2: 
disconnecting from 'dbname=altos_research host=srv2 user=postgres 
port=54824'
Sep 15 17:28:22 srv3 postgres-8.2[30743]: [4-1] LOG:  00000: server 
process (PID 31525) was terminated by signal 11
Sep 15 17:28:23 srv3 slon[31511]: [2418-1] INFO   remoteListenThread_1: 
disconnecting from 'dbname=altos_research host=srv1 user=postgres port=5433'
Sep 15 17:28:23 srv3 postgres-8.2[30743]: [4-2] LOCATION:  LogChildExit, 
postmaster.c:2430
Sep 15 17:28:23 srv3 postgres-8.2[30743]: [5-1] LOG:  00000: terminating 
any other active server processes
Sep 15 17:28:23 srv3 postgres-8.2[30743]: [5-2] LOCATION:  
HandleChildCrash, postmaster.c:2315
Sep 15 17:28:23 srv3 slon[31511]: [2421-1] DEBUG2 main: wait for remote 
threads
Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-1] WARNING:  57P02: 
terminating connection because of crash of another server process
Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-2] DETAIL:  The postmaster 
has commanded this server process to roll back the current transaction 
and exit, because another server
Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-3]  process exited 
abnormally and possibly corrupted shared memory.


5. I then stopped the slon daemons, and restarted each of the 
postmasters. I executed a few queries, and everything worked well.

6. Next, I fully uninstalled pg_bulklod (executing the two uninstall.sql 
scripts, removing the .so file, and removing references to the pg_ctl 
wrapper script 'postgresql'.

7. I restarted all of the postmaster daemons. I then restarted all of 
the slon daemons. I executed several queries, and everything is once 
again working normally. Replication is fine.

So let that be a warning to everyone :)

-jason

Christopher Browne wrote:
> logswitch_weekly() doesn't invoke any "wacky C stuff" that I'd expect
> to be at all sensitive that way.  Indeed, it doesn't do much.
>
> - It checks what the date is, to see if it's time to switch logs yet.
>
> - It saves that time.
>
> - It fiddles with the value of the sequence sl_log_status.
>
> - And... nothing else.
>
> This doesn't smell at all like a plausible cause for a sig11.
>
> And the log doesn't tell us that it caused the problem - rather it
> tells us that the problem terminated it.  The cleanup thread work was
> being done on process 13841, terminated due to what was up with 13861.
>
> I think you need to look, if you can, at what backend process 13861
> was doing.  That, I believe, is the thing that's responsible for the
> problem.
>   


More information about the Slony1-general mailing list