Tignor, Tom ttignor at akamai.com
Wed Jun 28 05:21:32 PDT 2017
	Hi Steve,
	Thanks for the info. I was able to repro this problem in testing and saw as soon as I added the missing path back the still-in-process failover op continued on and completed successfully.
	We do issue DROP NODEs in the event we need to restore a replica from scratch, which did occur. However, the restore workflow also should issue store paths to/from the new replica node and every other node. Still investigating this.
	What still confuses me is the recurring “remoteWorkerThread_X: SYNC” output, despite the fact of not having a configured path. If the path is missing, how does slon continue to get SYNC events?

	Tom    (


On 6/27/17, 5:04 PM, "Steve Singer" <steve at ssinger.info> wrote:

    On 06/27/2017 11:59 AM, Tignor, Tom wrote:
    
    
    The disableNode() in the makes it look like someone did a DROP NODE
    
    If the only issue is that your missing active paths in sl_path you can 
    add/update the paths with slonik.
    
    
    
    
    > **
    >
    > **Hello Slony-I community,
    >
    >              Hoping someone can advise on a strange and serious problem.
    > We performed a slony service failover yesterday. For the first time
    > ever, our slony service FAILOVER op errored out. We recently expanded
    > our cluster to 7 consumers from a single provider. There are no load
    > issues during normal operations. As the error output below shows,
    > though, our node 4 and node 5 consumers never got the events they
    > needed. Here’s where it gets weird: closer inspection has shown that
    > node 2->4 and node 2->5 path data went missing out of the service at
    > some point. It seems clear that’s the main issue, but in spite of that,
    > both node 4 and node 5 continued to find and process node 2 SYNC events
    > for a full week! The logs show this happened in spite of multiple restarts.
    >
    > How can this happen? If missing path data stymies the failover, wouldn’t
    > it also prevent normal SYNC processing?
    >
    > In the case where a failover is begun with inadequate path data, what’s
    > the best resolution? Can path data be quickly applied to allow failover
    > to succeed?
    >
    >              Thanks in advance for any insights.
    >
    > ---- failover error ----
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
    > calling restart node 1
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
    > 2017-06-26 18:33:02
    >
    > executing preFailover(1,1) on 2
    >
    > executing preFailover(1,1) on 3
    >
    > executing preFailover(1,1) on 4
    >
    > executing preFailover(1,1) on 5
    >
    > executing preFailover(1,1) on 6
    >
    > executing preFailover(1,1) on 7
    >
    > executing preFailover(1,1) on 8
    >
    > NOTICE: executing "_ams_cluster".failedNode2 on node 2
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 8 only on event 5000061654, node 4 only
    > on event 5000061654, node 5 only on event 5000061655, node 3 only on
    > event 5000061662, node 6\
    >
    >   only on event 5000061654, node 7 only on event 5000061656
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061657, node 5 only
    > on event 5000061663, node 3 only on event 5000061663, node 6 only on
    > event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663, node 6 only on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting
    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 only
    > on event 5000061663
    >
    > ---- node 4 log archive ----
    >
    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    > pa_server=2 pa_client=4|restart notification' prod4/node4-pathconfig.out
    >
    > 2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4
    > pa_conninfo="dbname=ams
    >
    > 2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2
    > pa_client=4 pa_conninfo="dbname=ams
    >
    > 2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2
    > pa_client=4 pa_conninfo="dbname=ams
    >
    > 2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4
    > pa_conninfo="dbname=ams
    >
    > 2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
    >
    > 2017-06-19 15:11:45 UTC [2707] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-20 18:40:15 UTC [31224] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-21 14:31:42 UTC [6253] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-21 14:35:26 UTC [32367] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:21:25 UTC [9278] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:33:04 UTC [28839] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:33:30 UTC [1785] INFO   localListenThread: got restart
    > notification
    >
    > bos-mpt5c:odin-9353 ttignor$
    >
    > ---- node 5 log archive ----
    >
    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    > pa_server=2 pa_client=5|restart notification' prod5/node5-pathconfig.out
    >
    > 2017-06-15 15:13:56 UTC [20700] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2
    > pa_client=5 pa_conninfo="dbname=ams
    >
    > 2017-06-15 15:53:01 UTC [20374] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5
    > pa_conninfo="dbname=ams
    >
    > 2017-06-16 17:28:19 UTC [2859] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2
    > pa_client=5 pa_conninfo="dbname=ams
    >
    > 2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
    >
    > 2017-06-19 15:11:40 UTC [10753] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-20 18:40:11 UTC [450] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-21 14:31:41 UTC [22300] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-21 14:35:28 UTC [26777] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:21:27 UTC [28366] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:33:04 UTC [29345] INFO   localListenThread: got restart
    > notification
    >
    > 2017-06-26 18:33:27 UTC [1299] INFO   localListenThread: got restart
    > notification
    >
    > bos-mpt5c:odin-9353 ttignor$
    >
    >              Tom ☺
    >
    >
    >
    > _______________________________________________
    > Slony1-general mailing list
    > Slony1-general at lists.slony.info
    > http://lists.slony.info/mailman/listinfo/slony1-general
    >
    
    



More information about the Slony1-general mailing list