Mon Dec 7 18:48:05 PST 2015
- Previous message: [Slony1-general] prepare clone failure
- Next message: [Slony1-general] prepare clone failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 12/07/2015 09:25 PM, Josh Berkus wrote:
> On 12/07/2015 11:32 AM, Josh Berkus wrote:
>> On 12/07/2015 10:56 AM, Josh Berkus wrote:
>>> So, the prepare clone method above worked perfectly twice. But then we
>>> tried to bring up a new node as a prepared clone from node 11 and things
>>> went to hell.
>>
>> One thing I just realized was different between the first two,
>> successful, runs and the failed runs: the first two times, we didn't
>> have pg_hba.conf configured, so when we brought up slony on the new node
>> it couldn't connect until we fixed that.
>>
>> So I'm wondering if there's a timing issue here somewhere.
>
> So, this problem was less interesting than I thought. As it turns out,
> the sysadmin was handling "make sure slony doesn't start on the server"
> by letting it autostart, then shutting it down. In the couple minutes
> it was running, though, it did enough to prevent finish clone from working.
>
I wonder if there is more going on here
In remoteWorker_event
We have
if (node->last_event >= ev_seqno)
{
rtcfg_unlock();
slon_log(SLON_DEBUG2,
"remoteWorker_event: event %d," INT64_FORMAT
" ignored - duplicate\n",
ev_origin, ev_seqno);
return;
}
/*
* We lock the worker threads message queue before bumping the nodes last
* known event sequence to avoid that another listener queues a later
* message before we can insert this one.
*/
pthread_mutex_lock(&(node->message_lock));
node->last_event = ev_seqno;
rtcfg_unlock();
It seems strange to me that we are obtaining the mutex lock after
checking node->last_event.
Does the rtcfg_lock prevent the race condition making the direct
message_lock redundent? If not do we need to obtain the
node->message_lock before we do the comparision?
The CLONE_NODE handler in remote_worker sets last_event by calling
rtcfg_getNodeLastEvent which obtains the rtcfg_lock but not the message
lock.
The clone node handler in remote_worker seems to do this
1. call rtcfg_storeNode (which obtains then releases the config lock)
2. calls cloneNodePrepare_int()
3. queries the last event id
4. calls rtcfg_getNodeLastEvent() which would re-obtain then release the
config lock
I wonder if sometime after step 1 but before step 4 a remote listener
queries events from the new node and adds them into the queue because
the last_event hasn't yet been set.
Maybe cloneNodePrepare needs to obtain the message queue lock at step 1
and hold it until step 4 and then remoteWorker_event needs to obtain
that lock a bit earlier
- Previous message: [Slony1-general] prepare clone failure
- Next message: [Slony1-general] prepare clone failure
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list