Tue May 25 04:19:55 PDT 2010
- Previous message: [Slony1-general] Upgrade from 2.0.3
- Next message: [Slony1-general] child terminated status: 11 -> restart of worker in 10 seconds
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello! I have 6 pairs of servers which use slony (1.2.15) with postgresql 8.3.7 on Ubuntu 9.04. Each pair is a master/slave combination, which replicates 2 DBs, one smaller and one bigger (on disk they total 16gb on the masters). The system is a high availability setup with automatic failovers. Failovers have occured, and then the master/slave servers have switched place, so to speak, and replication has (manually) been restarted in the other direction. That has worked fine, until just recently, where 1 server pair fails to replicate. First i noticed "too many files open" errors on the new slave (on the bigger DB) just before it had managed to replicate all data. -------------------------- 2010-05-25 07:06:08 UTC ERROR slon_connectdb: PQconnectdb("dbname=BIGDB host=10.10.130.51 user=postgres password=pass") failed - could not create socket: Too many open files 2010-05-25 07:06:08 UTC ERROR remoteWorkerThread_1: cannot connect to data provider 1 on 'dbname=BIGDB host=10.10.130.51 user=postgres password=pass' 2010-05-25 07:06:08 UTC DEBUG1 slon: shutdown requested 2010-05-25 07:06:08 UTC DEBUG2 slon: notify worker process to shutdown 2010-05-25 07:06:08 UTC INFO remoteListenThread_1: disconnecting from 'dbname=BIGDB host=10.10.130.51 user=postgres password=pass' -------------------------- The .51 IP is the slaves IP. So i restarted the system, and hoped for the best. For the smaller DB i saw this on the slave: -------------------------- 2010-05-25 07:33:39 UTC DEBUG1 slon: child termination timeout - kill child 2010-05-25 07:33:39 UTC DEBUG2 slon: child terminated status: 9; pid: 30273, current worker pid: 30273 2010-05-25 07:33:39 UTC DEBUG1 slon: done 2010-05-25 07:33:39 UTC DEBUG2 slon: remove pid file 2010-05-25 07:33:39 UTC DEBUG2 slon: exit(0) -------------------------- For the bigger DB i saw this on the slave: -------------------------- 2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1320 SYNC 2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1321 SYNC 2010-05-25 11:14:29 UTC DEBUG2 remoteWorkerThread_1: syncing set 1 with 30 table(s) from provider 1 2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1322 SYNC 2010-05-25 11:14:29 UTC DEBUG2 remoteListenThread_1: queue event 1,1323 SYNC 2010-05-25 11:14:29 UTC DEBUG2 slon: child terminated status: 11; pid: 28735, current worker pid: 28735 2010-05-25 11:14:29 UTC DEBUG1 slon: restart of worker in 10 seconds -------------------------- So i restarted /etc/init.d/slony1, and looked again. The smaller DB all of a sudden worked fine, and caught up. But the bigger DB just keeps getting these -------------------------- ... 2010-05-25 11:16:00 UTC DEBUG2 remoteListenThread_1: queue event 1,2533 SYNC 2010-05-25 11:16:00 UTC DEBUG2 slon: child terminated status: 11; pid: 29027, current worker pid: 29027 2010-05-25 11:16:00 UTC DEBUG1 slon: restart of worker in 10 seconds -------------------------- In syslog i see this on the slave (slon segfault errors every 10s): -------------------------- May 25 11:16:30 semc-sh62 kernel: [20053518.436336] slon[29076]: segfault at 273936 ip 00007fd69e8bac40 sp 00007fd69ad48698 error 4 in libc-2.9.so[7fd69e83a000+168000] May 25 11:16:40 semc-sh62 kernel: [20053528.548794] slon[29104]: segfault at 273936 ip 00007f359f4f4c40 sp 00007f359b982698 error 4 in libc-2.9.so[7f359f474000+168000] -------------------------- What could cause this? Looking at the size of /var/lib/postgresql/8.3/ i can see that it has almost succeeded in replicating the DB, but something is going boink. Slon loglevel is set to 4. Are there any slon sl_* tables i can look in for info? Tried google but only get 8 results on: "child terminated status: 11" + "restart of worker", and none of those provide a solution. wbr / Alexander
- Previous message: [Slony1-general] Upgrade from 2.0.3
- Next message: [Slony1-general] child terminated status: 11 -> restart of worker in 10 seconds
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list