[Slony1-general] STILL can't migrate a node.

Sat Feb 23 10:28:57 PST 2008

Jan Wieck wrote:
> On 2/23/2008 12:20 AM, Craig James wrote:
>> A little more info on this problem...
>>
>> Craig James wrote:
>>> I'm trying to migrate a node for the second time, and no luck.  Last 
>>> time I tried it, it just got stuck, and due to lack of time, I didn't 
>>> investigate.
>>>
>>> This time I watched -- it got stuck again, doing some sort of huge 
>>> SELECT statement.  I was under the impression that migrating a node 
>>> was a fairly simple operation that should happen in a short time 
>>> (less than a minute?) even for large databases.
>>>
>>> I waited 10 minutes, during which the entire system was completely 
>>> locked up (no other process could access the database), and our web 
>>> site was offline.  I finally had to kill all of the slon daemons and 
>>> kill Postgres to get our site back on the air, then run the 
>>> node-unlock command to get Slony back in shape.
>>>
>>> This system appears to otherwise be working well.  I can insert, 
>>> update and delete records, and they're copied to the slave node 
>>> immediately.
>>>
>>> What's up?  Am I just too impatient?
>>
>> I tried it again, after vacuuming the slony tables that are subject to 
>> bloat.  This time I shut everything off, started the migration of the 
>> master to node 2, and waited for 35 minutes, but the SELECT never 
>> finished.  vmstat showed massive I/O and CPU activity the whole time.
> 
> What SELECT are you referring to? I don't see where in the MOVE SET you 
> have to perform any SELECT.

You tell me?  It is the slon daemon that is executing this select.  There were no other connections to the database the second time I tried this.

>> Again, after I killed postgres, restarted, and unlocked the node, 
>> Slony went back to performing perfectly.
> 
> Killing postgres is a bad idea. Stop that habit right now, before you 
> physically corrupt any of your databases.

Thanks for the advice, but I don't think it's a problem.  That's one of the features of a robust relational database with a write-ahead log -- it can withstand being killed without corrupting data.  Besides, I had no choice, my web site went offline because slon apparently took an exclusive lock on the tables, blocking all other activity.  And I killed a SELECT, not an INSERT or UPDATE.

But that's a topic for a separate discussion ... I have to fix this Slony problem first.

> Anyhow, apparently the LOCK SET part of the process succeeds. So what I 
> now assume is that the WAIT FOR EVENT never finishes. First, you don't 
> need a WAIT FOR EVENT between LOCK SET and MOVE SET. Both events are 
> executed on the origin, so by the time the LOCK SET finishes, everything 
> is ready for the MOVE.

I don't think it got as far as this, but I don't know the internals.  When I execute the script, the SELECT starts, and that's where everything comes to a sudden halt.

> But what this indicates is that node 2 never confirms the LOCK SET. Can 
> it be that you actually have a problem with the connection from node 2 
> to node 1? What is the content of the view sl_status on both nodes?

Both nodes seem to be normal -- the st_last_event_ts is just a few seconds prior to the query, st_lag_time is 00:00:11.465251 (node 1) and 00:00:07.50164.

> If you want to speed up this communication in order to meet your Sat. 
> noon deadline, I'll be available on IRC, channel #slony on freenode.

Thanks, but I don't use an IRC client, hope you get this.

Craig