[Slony1-general] child terminated status: 11 -> restart of worker in 10 seconds

Tue May 25 04:19:55 PDT 2010

Hello!

I have 6 pairs of servers which use slony (1.2.15) with postgresql 8.3.7
on Ubuntu 9.04.

Each pair is a master/slave combination, which replicates 2 DBs, one
smaller and one bigger (on disk they total 16gb on the masters).

The system is a high availability setup with automatic failovers.
Failovers have occured, and then the master/slave servers have switched
place, so to speak, and replication has (manually) been restarted in the
other direction.

That has worked fine, until just recently, where 1 server pair fails to
replicate.

First i noticed "too many files open" errors on the new slave (on the
bigger DB) just before it had managed to replicate all data.

--------------------------
2010-05-25 07:06:08 UTC ERROR  slon_connectdb: PQconnectdb("dbname=BIGDB
host=10.10.130.51 user=postgres password=pass") failed - could not
create socket: Too many open files
2010-05-25 07:06:08 UTC ERROR  remoteWorkerThread_1: cannot connect to
data provider 1 on 'dbname=BIGDB host=10.10.130.51 user=postgres
password=pass'
2010-05-25 07:06:08 UTC DEBUG1 slon: shutdown requested
2010-05-25 07:06:08 UTC DEBUG2 slon: notify worker process to shutdown
2010-05-25 07:06:08 UTC INFO   remoteListenThread_1: disconnecting from
'dbname=BIGDB host=10.10.130.51 user=postgres password=pass'
--------------------------

The .51 IP is the slaves IP.

So i restarted the system, and hoped for the best.

For the smaller DB i saw this on the slave:
--------------------------
2010-05-25 07:33:39 UTC DEBUG1 slon: child termination timeout - kill child
2010-05-25 07:33:39 UTC DEBUG2 slon: child terminated status: 9; pid:
30273, current worker pid: 30273
2010-05-25 07:33:39 UTC DEBUG1 slon: done
2010-05-25 07:33:39 UTC DEBUG2 slon: remove pid file
2010-05-25 07:33:39 UTC DEBUG2 slon: exit(0)
--------------------------

For the bigger DB i saw this on the slave:
--------------------------
2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1320 SYNC
2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1321 SYNC
2010-05-25 11:14:29 UTC DEBUG2 remoteWorkerThread_1: syncing set 1 with
30 table(s) from provider 1
2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1322 SYNC
2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1323 SYNC
2010-05-25 11:14:29 UTC DEBUG2 slon: child terminated status: 11; pid:
28735, current worker pid: 28735
2010-05-25 11:14:29 UTC DEBUG1 slon: restart of worker in 10 seconds
--------------------------

So i restarted /etc/init.d/slony1, and looked again.
The smaller DB all of a sudden worked fine, and caught up.
But the bigger DB just keeps getting these
--------------------------
...
2010-05-25 11:16:00 UTC DEBUG2 remoteListenThread_1: queue event 1,2533 SYNC
2010-05-25 11:16:00 UTC DEBUG2 slon: child terminated status: 11; pid:
29027, current worker pid: 29027
2010-05-25 11:16:00 UTC DEBUG1 slon: restart of worker in 10 seconds
--------------------------

In syslog i see this on the slave (slon segfault errors every 10s):
--------------------------
May 25 11:16:30 semc-sh62 kernel: [20053518.436336] slon[29076]:
segfault at 273936 ip 00007fd69e8bac40 sp 00007fd69ad48698 error 4 in
libc-2.9.so[7fd69e83a000+168000]
May 25 11:16:40 semc-sh62 kernel: [20053528.548794] slon[29104]:
segfault at 273936 ip 00007f359f4f4c40 sp 00007f359b982698 error 4 in
libc-2.9.so[7f359f474000+168000]
--------------------------

What could cause this?
Looking at the size of /var/lib/postgresql/8.3/ i can see that it has
almost succeeded in replicating the DB, but something is going boink.

Slon loglevel is set to 4.
Are there any slon sl_* tables i can look in for info?
Tried google but only get 8 results on: "child terminated status: 11" +
"restart of worker", and none of those provide a solution.

  wbr / Alexander