Wed Aug 20 03:15:20 PDT 2008
- Previous message: [Slony1-general] Partial replication
- Next message: [Slony1-general] Replication node suddenly lagging, CPU bound postmaster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi everyone. I have a replicating node that suddenly started to lag, on a 4 nodes Slony cluster that worked well for months. This node is powerful enough (ie. older, slower machines on the cluster achieve to keep up well). Network and block devices are mostly idling (with regard to interrupts/second and throughput). Strangely, the replication on this node seems CPU bound by the postmaster process doing the actual inserts/updates for slon (this postmaster process is stuck at 99% CPU usage since the beginning of the problem). Neither Slony (at "slon -d2" level) nor PostgreSQL did log any warning or error message, and the replication did not stopped on this node (it makes progresses, but too slowly to keep up, so it's now 3 days behind master). Any clue? Where should I look to better diagnose this problem (I may miss the obvious, I'm very new to Slony and PostgreSQL)? More details: * Slony 1.2.13 on PostgreSQL 8.1.10 * "psql_replication_check.pl" and "test_slony_state" do not find problem * A few hours before I noticed the problem, I had an almost (98%) full tablespace on this node. Which I solved by moving a large index on an other tablespace (with "ALTER INDEX foo SET TABLESPACE bar;"). * Selects runs fast as usual on this node (ie. faster than some other nodes) * According to pg_stat_activity , the only activity running now on the busy/working postmaster is the replication's inserts/updates. And there's no significant remaining activity running on this machine, beside replication * No other processes than replication do writes on the node's database. * The replicating/inserting postmaster process eats 99% of one CPU. Slon uses between 4% to 20% of an other CPU (this is a 8 cpu machine). * The block devices aren't busy at all (iostat show a very low block I/O activity, ie. about 100KB to 500KB read&writes/second on an hardware raid10 array of 15Krpm scsi drives that can handle much more activity). Actually, block devices are mostly idling on this node. * The replication is not totally halted, it just became very slow (ie. the node has an increasing lag behind the master node, but makes progresses). So I guess it's not locked rebuilding a large index or halted on some error condition. And I can see the running replication insert/update requests evolve over time (according to pg_stat_activity). * This node is one of the fastest machines of the cluster (wrt both cpu and blocks devices) and the slower nodes has no problem keeping up with replication. * This cluster was working for months, and there was no activity peack the day it failed. * We didn't change the replication set, and didn't alter any replicated table. * Stracing the working postmaster process, I didn't see any suspect/long syscalls that could explain the cpu usage or lag (just a few lseek, read, send, recv, but since block devices and network link are mostly idle, that shouldn't cause slowdowns). I guess this process consumes cpu by doing some internal heavy calculations (?). * I restarted the whole slon cluster twice, restarded this node's postgresql, and even (in despair) rebooted the node. logs etc : * examples logs : Aug 18 17:39:22 mynode cluster_myclust: 2008-08-18 17:39:22 CEST DEBUG2 remoteListenThread_2: LISTEN Aug 18 17:39:22 mynode cluster_myclust: 2008-08-18 17:39:22 CEST DEBUG2 remoteWorkerThread_2: forward confirm 1,66023883 received by 2 Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteWorkerThread_2: forward confirm 1,66023884 received by 2 Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteWorkerThread_2: forward confirm 2,7658887 received by 1 Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteWorkerThread_2: forward confirm 3,7751837 received by 1 Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteWorkerThread_2: forward confirm 4,1550604 received by 1 Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteListenThread_1: queue event 1,66023883 SYNC Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteListenThread_1: queue event 1,66023884 SYNC Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteListenThread_1: queue event 1,66023885 SYNC Aug 18 17:39:23 mynode cluster_myclust: 2008-08-18 17:39:23 CEST DEBUG2 remoteListenThread_1: UNLISTEN [...] Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832742 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832743 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteWorkerThread_1: Received event 1,65832742 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832744 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 calc sync size - last time: 1 last length: 1660 ideal: 36 proposed size: 3 Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832745 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832746 SYNC [...] Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832758 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteListenThread_1: queue event 1,65832759 SYNC Aug 19 13:57:51 mynode cluster_myclust: 2008-08-19 13:57:51 CEST DEBUG2 remoteWorkerThread_1: SYNC 65832744 processing * strace -ttt -p 17450 : 1219073660.529469 _llseek(196, 17924096, [17924096], SEEK_SET) = 0 1219073660.529530 read(196, "p\v\0\0\0vke\1\0\0\0\344\0\20\33\360\37\3 \330\2370\0\270"..., 8192) = 8192 1219073660.531583 _llseek(214, 6086656, [6086656], SEEK_SET) = 0 1219073660.531680 read(214, "p\v\0\0\350oke\1\0\0\0<\2\340\v\0 \3 \260\233h\0|\233h"..., 8192) = 8192 1219073660.536787 send(6, "\3\0\0\0&\0\0\0\3\0\0\0*D\0\0<IDLE> in transa"..., 38, 0) = 38
- Previous message: [Slony1-general] Partial replication
- Next message: [Slony1-general] Replication node suddenly lagging, CPU bound postmaster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list