Wed Feb 28 19:24:28 PST 2018
- Previous message: [Slony1-general] slony1 drop node failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 26 Feb 2018, Tignor, Tom wrote: > > In the slony1 log of our primary host (the same one which later showed the bug) we had: > > 2018-02-21 19:19:49 UTC [22582] INFO remoteWorkerThread_8: SYNC 5002075962 done in 1.725 seconds > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > The timestamp of the problem event was somewhat earlier: 2018-02-21 > 19:19:41. > To me it looks like there is a race condition and despite the order > of the log events, the DROP NODE was processed first and the SYNC was > accepted later, and thus created an sl_event record. Of course, that’s > simply a guess. This might be possible. If the drop being droped is 8, and the event node from the drop is 4. Some third node say 3 might have a sequence like this remoteWorkerThread_8: Starts processing SYNC remoteWorkerThread_4: process drop node. Updates runtime config remoteWorkerThread_8: finishes processing sync COMMITS I'd feel more comfortable if we had a way of reproducing this though. One option might be to have the remoteWorkerThread recheck the runtime config before the COMMIT to make sure that the node is still active. > > Tom ( > > > On 2/26/18, 12:01 PM, "Steve Singer" <steve at ssinger.info> wrote: > > On Mon, 26 Feb 2018, Tignor, Tom wrote: > > > > > Thanks. I see the deletes added for sl_seqlog and sl_log_script. The > > constraint violation appearing in the errors was for sl_event. Do we > > expect these changes fully remove all state, including sl_event? The > > checkNodeDeleted function doesn’t look at sl_event. > > It could be that this is a different issue then. > That function (the dropNode_int) should have removed the rows from sl_event. > The question the becomes did it not remove them for some reason, or did they > get added back later, and if so how? > > > > Tom ( > > > > > > On 2/26/18, 11:03 AM, "Steve Singer" <steve at ssinger.info> wrote: > > > > On Mon, 26 Feb 2018, Tignor, Tom wrote: > > > > You can get it from the github branch (latest commit) at > > https://github.com/ssinger/slony1-engine/tree/bug375 > > > > > > > > > > > > > > Steve, > > > The patch link actually goes to a page which says “Bugzilla is down for maintenance.” Is there a way to see the patch currently? Does it exist or is it scheduled in some Slony-I release? > > > > > > Tom ( > > > > > > > > > On 2/22/18, 6:06 PM, "Steve Singer" <steve at ssinger.info> wrote: > > > > > > On Thu, 22 Feb 2018, Tignor, Tom wrote: > > > > > > Looks like? > > > http://lists.slony.info/pipermail/slony1-general/2016-September/013331.html > > > > > > I can't remember if that was what prompted > > > http://lists.slony.info/pipermail/slony1-hackers/2016-December/000560.html > > > > > > > > > https://github.com/ssinger/slony1-engine/tree/bug375 > > > > > > I can't seem to find a reason why this wasn't committed. > > > > > > > > > > > > > > > > > > > > > > Hello slony1 community, > > > > > > > > We have a head scratcher here. It appears a DROP NODE command was not fully processed. The > > > > command was issued and confirmed on all our nodes at approximately 2018-02-21 19:19:50 UTC. When we went to > > > > restore it over two hours later, all replication stopped on an sl_event constraint violation. Investigation > > > > showed a SYNC event for the dropped node with a timestamp of just a few seconds before the drop. I believe this > > > > is a first for us. The DROP NODE command is supposed to remove all state for the dropped node. Is that right? Is > > > > there a potential race condition somewhere which could leave behind state? > > > > > > > > Thanks in advance, > > > > > > > > > > > > > > > > ---- master log replication freeze error ---- > > > > > > > > 2018-02-21 21:38:52 UTC [5775] ERROR remoteWorkerThread_8: "insert into "_ams_cluster".sl_event (ev_origin, > > > > ev_seqno, ev_timestamp, ev_snapshot, ev\ > > > > > > > > _type ) values ('8', '5002075962', '2018-02-21 19:19:41.958719+00', '87044110:87044110:', 'SYNC'); insert > > > > into "_ams_cluster".sl_confirm (con_origi\ > > > > > > > > n, con_received, con_seqno, con_timestamp) values (8, 1, '5002075962', now()); select > > > > "_ams_cluster".logApplySaveStats('_ams_cluster', 8, '0.139 s'::inter\ > > > > > > > > val); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint > > > > "sl_event-pkey" > > > > > > > > DETAIL: Key (ev_origin, ev_seqno)=(8, 5002075962) already exists. > > > > > > > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: child terminated signal: 9; pid: 5775, current worker pid: 5775 > > > > > > > > 2018-02-21 21:38:52 UTC [13649] CONFIG slon: restart of worker in 10 seconds > > > > > > > > ---- master log replication freeze error ---- > > > > > > > > > > > > > > > > > > > > > > > > ---- master DB leftover event ---- > > > > > > > > ams at ams6.cmb.netmgmt:~$ psql -U akamai -d ams > > > > > > > > psql (9.1.24) > > > > > > > > Type "help" for help. > > > > > > > > > > > > > > > > ams=# select * from sl_event_bak; > > > > > > > > ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | > > > > ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_ > > > > > > > > data7 | ev_data8 > > > > > > > > -----------+------------+-------------------------------+--------------------+---------+----------+----------+- > > > > ---------+----------+----------+----------+---- > > > > > > > > ------+---------- > > > > > > > > 8 | 5002075962 | 2018-02-21 19:19:41.958719+00 | 87044110:87044110: | SYNC | | | > > > > | | | | > > > > > > > > | > > > > > > > > (1 row) > > > > > > > > > > > > > > > > ams=# > > > > > > > > ---- master DB leftover event ---- > > > > > > > > > > > > > > > > ---- master log drop node record ---- > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG disableNode: no_id=8 > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=4 li_receiver=1 li_provider=4 > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=7 li_receiver=1 li_provider=7 > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: update provider configuration > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 4 terminated > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: disconnecting from data provider 4 > > > > > > > > 2018-02-21 19:19:50 UTC [22582] CONFIG remoteWorkerThread_4: connection for provider 7 terminated > > > > > > > > ---- master log drop node record ---- > > > > > > > > > > > > > > > > ---- replica log drop node record ---- > > > > > > > > 2018-02-21 19:19:51 UTC [22650] WARN remoteWorkerThread_1: got DROP NODE for local node ID > > > > > > > > NOTICE: Slony-I: Please drop schema "_ams_cluster" > > > > > > > > 2018-02-21 19:19:53 UTC [22650] INFO remoteWorkerThread_7: SYNC 5001868819 done in 2.153 seconds > > > > > > > > NOTICE: drop cascades to 243 other objects > > > > > > > > DETAIL: drop cascades to table _ams_cluster.sl_node > > > > > > > > drop cascades to table _ams_cluster.sl_nodelock > > > > > > > > drop cascades to table _ams_cluster.sl_set > > > > > > > > drop cascades to table _ams_cluster.sl_setsync > > > > > > > > drop cascades to table _ams_cluster.sl_table > > > > > > > > drop cascades to table _ams_cluster.sl_sequence > > > > > > > > ---- replica log drop node record ---- > > > > > > > > > > > > > > > > > > > > > > > > Tom ☺ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
- Previous message: [Slony1-general] slony1 drop node failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list