Mon Jul 24 07:47:22 PDT 2017
- Previous message: [Slony1-general] more missing paths, and <event_pending>
- Next message: [Slony1-general] Slony 2.2.6 release plans
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi again,
I’ve made some progress here on my own. Checking the various node DBs not hearing my node 4, I found they had sl_event and sl_confirm entries for sequence# 5000071346, and that from 5 days ago now. The node 4 DB itself had its sl_event_seq sequence at 5000040947. It seems clear the bad state in the other nodes was leftover from before my last node 4 restore op. My solution was to advance the node 4 sequence to 5000071347. As soon as I did, I saw new node 4 SEQ events accumulating in other node sl_event tables. After that, store path ops worked fine.
Seems like this could be useful to others. Is there a bug fix or doc update to derive from this? Let me know if I should write something up more formally or open a ticket.
ams=# select * from _ams_cluster.sl_event where ev_origin = 4;
ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_data7 | ev_data8
-----------+------------+-------------------------------+--------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
4 | 5000071346 | 2017-07-19 20:27:26.418196+00 | 15346449:15346449: | SYNC | | | | | | | |
(1 row)
ams=#
ams=# select * from _ams_cluster.sl_confirm where con_origin = 4;
con_origin | con_received | con_seqno | con_timestamp
------------+--------------+------------+-------------------------------
4 | 6 | 5000071346 | 2017-07-19 20:35:33.504667+00
4 | 3 | 5000071346 | 2017-07-19 20:29:09.763466+00
4 | 9 | 5000071346 | 2017-07-19 20:29:22.496843+00
4 | 8 | 5000071346 | 2017-07-19 20:27:27.9303+00
4 | 1 | 5000071346 | 2017-07-19 20:27:26.705526+00
4 | 7 | 5000071346 | 2017-07-20 18:04:01.978874+00
(6 rows)
ams=#
Tom (
On 7/22/17, 10:39 AM, "Tignor, Tom" <ttignor at akamai.com> wrote:
Hi Steve,
Thanks for the store path desc. That’s what I surmised generally. I should note: when problems arise with subscribers, we have a utility to drop and re-store the node, and then re-store paths to all other nodes.
To answer your questions: node 4 has all expected state, 7*6=42 connections, i.e.
Sl_path server = 1, client = 3
Sl_path server = 1, client = 4
Sl_path server = 1, client = 6
Sl_path server = 1, client = 7
Sl_path server = 1, client = 8
Sl_path server = 1, client = 9
Sl_path server = 3, client = 1
Sl_path server = 3, client = 4
Sl_path server = 3, client = 6
Sl_path server = 3, client = 7
Sl_path server = 3, client = 8
Sl_path server = 3, client = 9
…
All the other nodes have 37 connections. The following are missing in each DB:
Sl_path server = 3, client = 4
Sl_path server = 6, client = 4
Sl_path server = 7, client = 4
Sl_path server = 8, client = 4
Sl_path server = 9, client = 4
Moreover, the Sl_path server = 1, client = 4 path shows the conninfo as <event pending>.
Just a guess: is there possibly some sl_event table entry which, if deleted, will allow the node-4-client store path ops to get processed?
Tom (
On 7/21/17, 9:53 PM, "Steve Singer" <steve at ssinger.info> wrote:
On Fri, 21 Jul 2017, Tignor, Tom wrote:
>
>
>
> Hello again, Slony-I community,
>
> After our last missing path issue, we’ve taken a new interest in keeping all our path/conninfo
> data up to date. We have a cluster running with 7 nodes. Each has conninfo to all the others, so we expect N=7;
> N*(N-1) = 42 paths. We’re having persistent problems with our paths for node 4. Node 4 itself has fully accurate
> path data. However, all the other nodes have missing or inaccurate data for node-4-client conninfo. Specifically:
> node 1 shows:
>
>
>
> 1 | 4 | <event pending> | 10
>
>
>
> For the other five nodes, the node-4-client conninfo is just missing. In other words, there are no
> pa_server=X, pa_client=4 rows in sl_path for these nodes. Again, the node 4 DB itself shows all the paths we
> expect.
>
> Does anyone have thoughts on how this is caused and how it could be fixed? Repeated “store path”
> operations all complete without errors but do not change state. Service restarts haven’t worked either.
When you issue a store path command with line client=4 server=X
slonik connects to db4 and
A) updates sl_path
B) creates an event in sl_event of ev_type=STORE_PATH with ev_origin=4
This event then needs to propogate to the other nodes in the network.
When this event propogates to the other nodes then the remoteWorkerThread_4
in each of the other nodes will process this STORE_PATH entry, and you
should see a
CONFIG storePath: pa_server=X pa_client=4
message in each of the other slons.
If this happens you should see the actual path in sl_path. Since your not I
assume that this isn't happening.
Where on the chain of events are things breaking down?
Do you have other paths from other nodes with client=[X,Y,Z] server=4
Steve
>
> Thanks in advance,
>
>
>
> Tom ☺
>
>
>
>
>
>
>
- Previous message: [Slony1-general] more missing paths, and <event_pending>
- Next message: [Slony1-general] Slony 2.2.6 release plans
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list