[Slony1-general] slony1 drop node failure

Wed Feb 28 19:24:28 PST 2018

On Mon, 26 Feb 2018, Tignor, Tom wrote:

>
> 	In the slony1 log of our primary host (the same one which later showed the bug) we had:
>
> 2018-02-21 19:19:49 UTC [22582] INFO   remoteWorkerThread_8: SYNC 5002075962 done in 1.725 seconds
> 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
>
> 	The timestamp of the problem event was somewhat earlier: 2018-02-21 
> 19:19:41.
> 	To me it looks like there is a race condition and despite the order 
> of the log events, the DROP NODE was processed first and the SYNC was 
> accepted later, and thus created an sl_event record. Of course, that’s 
> simply a guess.

This might be possible.

If the drop being droped is 8, and the event node from the drop is 4.
Some third node say 3 might have a sequence like this

remoteWorkerThread_8: Starts processing SYNC
remoteWorkerThread_4: process drop node. Updates runtime config
remoteWorkerThread_8: finishes processing sync COMMITS

I'd feel more comfortable if we had a way of reproducing this though.

One option might be to have the remoteWorkerThread recheck the runtime 
config before the COMMIT to make sure that the node is still active.

>
> 	Tom    (
>
>
> On 2/26/18, 12:01 PM, "Steve Singer" <steve at ssinger.info> wrote:
>
>    On Mon, 26 Feb 2018, Tignor, Tom wrote:
>
>    >
>    > 	Thanks. I see the deletes added for sl_seqlog and sl_log_script. The
>    > constraint violation appearing in the errors was for sl_event. Do we
>    > expect these changes fully remove all state, including sl_event? The
>    > checkNodeDeleted function doesn’t look at sl_event.
>
>    It could be that this is a different issue then.
>    That function (the dropNode_int) should have removed the rows from sl_event.
>    The question the becomes did it not remove them for some reason, or did they
>    get added back later, and if so how?
>
>
>    > 	Tom    (
>    >
>    >
>    > On 2/26/18, 11:03 AM, "Steve Singer" <steve at ssinger.info> wrote:
>    >
>    >    On Mon, 26 Feb 2018, Tignor, Tom wrote:
>    >
>    >    You can get it from the github branch (latest commit) at
>    >    https://github.com/ssinger/slony1-engine/tree/bug375
>    >
>    >
>    >
>    >
>    >    >
>    >    > 	Steve,
>    >    > 	The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release?
>    >    >
>    >    > 	Tom    (
>    >    >
>    >    >
>    >    > On 2/22/18, 6:06 PM, "Steve Singer" <steve at ssinger.info> wrote:
>    >    >
>    >    >    On Thu, 22 Feb 2018, Tignor, Tom wrote:
>    >    >
>    >    >    Looks like?
>    >    >    http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html
>    >    >
>    >    >    I can't remember if that was what prompted
>    >    >    http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html
>    >    >
>    >    >
>    >    >    https://github.com/ssinger/slony1-engine/tree/bug375
>    >    >
>    >    >    I can't seem to find a reason why this wasn't committed.
>    >    >
>    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >                 Hello slony1 community,
>    >    >    >
>    >    >    >                 We have a head scratcher here. It appears a DROP NODE command was not fully processed. The
>    >    >    > command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to
>    >    >    > restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation
>    >    >    > showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this
>    >    >    > is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is
>    >    >    > there a potential race condition somewhere which could leave behind state?
>    >    >    >
>    >    >    >                 Thanks in advance,
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ---- master log replication freeze error ----
>    >    >    >
>    >    >    > 2018-02-21 21:38:52 UTC [5775] ERROR  remoteWorkerThread_8: "insert into "_ams_cluster".sl_event     (ev_origin,
>    >    >    > ev_seqno, ev_timestamp,      ev_snapshot, ev\
>    >    >    >
>    >    >    > _type     ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert
>    >    >    > into "_ams_cluster".sl_confirm       (con_origi\
>    >    >    >
>    >    >    > n, con_received, con_seqno, con_timestamp)    values (8, 1, '5002075962', now()); select
>    >    >    > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\
>    >    >    >
>    >    >    > val); commit transaction;" PGRES_FATAL_ERROR ERROR:  duplicate key value violates unique constraint
>    >    >    > "sl_event-pkey"
>    >    >    >
>    >    >    > DETAIL:  Key (ev_origin, ev_seqno)=(8, 5002075962) already exists.
>    >    >    >
>    >    >    > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775
>    >    >    >
>    >    >    > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds
>    >    >    >
>    >    >    > ---- master log replication freeze error ----
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ---- master DB leftover event ----
>    >    >    >
>    >    >    > ams at ams6.cmb.netmgmt:~$ psql -U akamai -d ams
>    >    >    >
>    >    >    > psql (9.1.24)
>    >    >    >
>    >    >    > Type "help" for help.
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ams=# select * from sl_event_bak;
>    >    >    >
>    >    >    >  ev_origin |  ev_seqno  |         ev_timestamp          |    ev_snapshot     | ev_type | ev_data1 | ev_data2 |
>    >    >    > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_
>    >    >    >
>    >    >    > data7 | ev_data8
>    >    >    >
>    >    >    > -----------+------------+-------------------------------+--------------------+---------+----------+----------+-
>    >    >    > ---------+----------+----------+----------+----
>    >    >    >
>    >    >    > ------+----------
>    >    >    >
>    >    >    >          8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC    |          |          |
>    >    >    >         |          |          |          |
>    >    >    >
>    >    >    >       |
>    >    >    >
>    >    >    > (1 row)
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ams=#
>    >    >    >
>    >    >    > ---- master DB leftover event ----
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ---- master log drop node record ----
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4
>    >    >    >
>    >    >    > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated
>    >    >    >
>    >    >    > ---- master log drop node record ----
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    > ---- replica log drop node record ----
>    >    >    >
>    >    >    > 2018-02-21 19:19:51 UTC [22650] WARN   remoteWorkerThread_1: got DROP NODE for local node ID
>    >    >    >
>    >    >    > NOTICE:  Slony-I: Please drop schema "_ams_cluster"
>    >    >    >
>    >    >    > 2018-02-21 19:19:53 UTC [22650] INFO   remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds
>    >    >    >
>    >    >    > NOTICE:  drop cascades to 243 other objects
>    >    >    >
>    >    >    > DETAIL:  drop cascades to table _ams_cluster.sl_node
>    >    >    >
>    >    >    > drop cascades to table _ams_cluster.sl_nodelock
>    >    >    >
>    >    >    > drop cascades to table _ams_cluster.sl_set
>    >    >    >
>    >    >    > drop cascades to table _ams_cluster.sl_setsync
>    >    >    >
>    >    >    > drop cascades to table _ams_cluster.sl_table
>    >    >    >
>    >    >    > drop cascades to table _ams_cluster.sl_sequence
>    >    >    >
>    >    >    > ---- replica log drop node record ----
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >                 Tom    ☺
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >    >
>    >    >
>    >    >
>    >    >
>    >
>    >
>    >
>
>
>