Sun Sep 16 20:45:54 PDT 2007
- Previous message: [Slony1-general] logswtich.weekly causing postmaster to crash?
- Next message: [Slony1-general] Re: conflict beween slony and pg_bulkload causing postmaster crash
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
To follow up on this: The problem appears to have been caused by the presence of 'pg_bulkload'. There seems to be some essential conflict between that utility and slony. Although I cannot be 100% sure, these are the steps I went through prior to reaching that conclusion: 1. On the three nodes in the cluster, leaving pg_bulkload installed on all nodes, I completely removed slony from each node by stopping the slon daemons and doing a 'drop schema _ar_primary cascade' on each node. 2. I then recreated the cluster using the original slonik configuration scripts. At this point, I did NOT start the slon daemons. 3. I ran several queries against each node in the cluster. Everything worked fine. 4. I then started the slon daemons. Within a few minutes after the first attempts at synchronization, postmaster processes starting dying. The syslog showed the telltale fatal messages: Sep 15 17:28:12 srv3 slon[31511]: [2407-1] DEBUG2 remoteListenThread_2: LISTEN Sep 15 17:28:17 srv3 slon[31511]: [2408-1] DEBUG2 syncThread: new sl_action_seq 23036 - SYNC 1019078 Sep 15 17:28:21 srv3 slon[31511]: [2409-1] DEBUG2 localListenThread: Received event 3,1019078 SYNC Sep 15 17:28:22 srv3 slon[31511]: [2411-1] WARN cleanupThread: "select "_ar_primary".logswitch_weekly(); " - server closed the connection unexpectedly Sep 15 17:28:22 srv3 slon[31511]: [2411-2] This probably means the server terminated abnormally Sep 15 17:28:22 srv3 slon[31511]: [2411-3] before or while processing the request. Sep 15 17:28:22 srv3 slon[31511]: [2412-1] FATAL cleanupThread: "delete from "_ar_primary".sl_log_1 where log_origin = '2' and log_xid < '151888804'; delete from Sep 15 17:28:22 srv3 slon[31511]: [2412-2] "_ar_primary".sl_log_2 where log_origin = '2' and log_xid < '151888804'; delete from "_ar_primary".sl_seqlog where seql_origin Sep 15 17:28:22 srv3 slon[31511]: [2412-3] = '2' and seql_ev_seqno < '1687734'; select "_ar_primary".logswitch_finish(); " - Sep 15 17:28:22 srv3 slon[31511]: [2413-1] DEBUG2 slon_retry() from pid=31511 Sep 15 17:28:22 srv3 slon[31440]: [6-1] DEBUG2 slon: notify worker process to shutdown Sep 15 17:28:22 srv3 slon[31511]: [2414-1] INFO remoteListenThread_2: disconnecting from 'dbname=altos_research host=srv2 user=postgres port=54824' Sep 15 17:28:22 srv3 postgres-8.2[30743]: [4-1] LOG: 00000: server process (PID 31525) was terminated by signal 11 Sep 15 17:28:23 srv3 slon[31511]: [2418-1] INFO remoteListenThread_1: disconnecting from 'dbname=altos_research host=srv1 user=postgres port=5433' Sep 15 17:28:23 srv3 postgres-8.2[30743]: [4-2] LOCATION: LogChildExit, postmaster.c:2430 Sep 15 17:28:23 srv3 postgres-8.2[30743]: [5-1] LOG: 00000: terminating any other active server processes Sep 15 17:28:23 srv3 postgres-8.2[30743]: [5-2] LOCATION: HandleChildCrash, postmaster.c:2315 Sep 15 17:28:23 srv3 slon[31511]: [2421-1] DEBUG2 main: wait for remote threads Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-1] WARNING: 57P02: terminating connection because of crash of another server process Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Sep 15 17:28:23 srv3 postgres-8.2[31510]: [4-3] process exited abnormally and possibly corrupted shared memory. 5. I then stopped the slon daemons, and restarted each of the postmasters. I executed a few queries, and everything worked well. 6. Next, I fully uninstalled pg_bulklod (executing the two uninstall.sql scripts, removing the .so file, and removing references to the pg_ctl wrapper script 'postgresql'. 7. I restarted all of the postmaster daemons. I then restarted all of the slon daemons. I executed several queries, and everything is once again working normally. Replication is fine. So let that be a warning to everyone :) -jason Christopher Browne wrote: > logswitch_weekly() doesn't invoke any "wacky C stuff" that I'd expect > to be at all sensitive that way. Indeed, it doesn't do much. > > - It checks what the date is, to see if it's time to switch logs yet. > > - It saves that time. > > - It fiddles with the value of the sequence sl_log_status. > > - And... nothing else. > > This doesn't smell at all like a plausible cause for a sig11. > > And the log doesn't tell us that it caused the problem - rather it > tells us that the problem terminated it. The cleanup thread work was > being done on process 13841, terminated due to what was up with 13861. > > I think you need to look, if you can, at what backend process 13861 > was doing. That, I believe, is the thing that's responsible for the > problem. >
- Previous message: [Slony1-general] logswtich.weekly causing postmaster to crash?
- Next message: [Slony1-general] Re: conflict beween slony and pg_bulkload causing postmaster crash
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list