Tignor, Tom ttignor at akamai.com
Tue Jun 27 08:59:56 PDT 2017
            Hello Slony-I community,
            Hoping someone can advise on a strange and serious problem. We performed a slony service failover yesterday. For the first time ever, our slony service FAILOVER op errored out. We recently expanded our cluster to 7 consumers from a single provider. There are no load issues during normal operations. As the error output below shows, though, our node 4 and node 5 consumers never got the events they needed. Here’s where it gets weird: closer inspection has shown that node 2->4 and node 2->5 path data went missing out of the service at some point. It seems clear that’s the main issue, but in spite of that, both node 4 and node 5 continued to find and process node 2 SYNC events for a full week! The logs show this happened in spite of multiple restarts.
How can this happen? If missing path data stymies the failover, wouldn’t it also prevent normal SYNC processing?
In the case where a failover is begun with inadequate path data, what’s the best resolution? Can path data be quickly applied to allow failover to succeed?
            Thanks in advance for any insights.


---- failover error ----

/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:  calling restart node 1
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55: 2017-06-26 18:33:02
executing preFailover(1,1) on 2
executing preFailover(1,1) on 3
executing preFailover(1,1) on 4
executing preFailover(1,1) on 5
executing preFailover(1,1) on 6
executing preFailover(1,1) on 7
executing preFailover(1,1) on 8
NOTICE: executing "_ams_cluster".failedNode2 on node 2
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 8 only on event 5000061654, node 4 only on event 5000061654, node 5 only on event 5000061655, node 3 only on event 5000061662, node 6\
 only on event 5000061654, node 7 only on event 5000061656
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061657, node 5 only on event 5000061663, node 3 only on event 5000061663, node 6 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663, node 6 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for event (2,5000061664).  node 4 only on event 5000061663, node 5 only on event 5000061663


---- node 4 log archive ----

bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out
2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart notification
2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4 pa_conninfo="dbname=ams
2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart notification
2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2 pa_client=4 pa_conninfo="dbname=ams
2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2 pa_client=4 pa_conninfo="dbname=ams
2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4 pa_conninfo="dbname=ams
2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
2017-06-19 15:11:45 UTC [2707] INFO   localListenThread: got restart notification
2017-06-20 18:40:15 UTC [31224] INFO   localListenThread: got restart notification
2017-06-21 14:31:42 UTC [6253] INFO   localListenThread: got restart notification
2017-06-21 14:35:26 UTC [32367] INFO   localListenThread: got restart notification
2017-06-26 18:21:25 UTC [9278] INFO   localListenThread: got restart notification
2017-06-26 18:33:04 UTC [28839] INFO   localListenThread: got restart notification
2017-06-26 18:33:30 UTC [1785] INFO   localListenThread: got restart notification
bos-mpt5c:odin-9353 ttignor$


---- node 5 log archive ----

bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: pa_server=2 pa_client=5|restart notification' prod5/node5-pathconfig.out
2017-06-15 15:13:56 UTC [20700] INFO   localListenThread: got restart notification
2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2 pa_client=5 pa_conninfo="dbname=ams
2017-06-15 15:53:01 UTC [20374] INFO   localListenThread: got restart notification
2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5 pa_conninfo="dbname=ams
2017-06-16 17:28:19 UTC [2859] INFO   localListenThread: got restart notification
2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2 pa_client=5 pa_conninfo="dbname=ams
2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
2017-06-19 15:11:40 UTC [10753] INFO   localListenThread: got restart notification
2017-06-20 18:40:11 UTC [450] INFO   localListenThread: got restart notification
2017-06-21 14:31:41 UTC [22300] INFO   localListenThread: got restart notification
2017-06-21 14:35:28 UTC [26777] INFO   localListenThread: got restart notification
2017-06-26 18:21:27 UTC [28366] INFO   localListenThread: got restart notification
2017-06-26 18:33:04 UTC [29345] INFO   localListenThread: got restart notification
2017-06-26 18:33:27 UTC [1299] INFO   localListenThread: got restart notification
bos-mpt5c:odin-9353 ttignor$


            Tom    ☺


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20170627/d6abab77/attachment-0001.htm 


More information about the Slony1-general mailing list