Jan Wieck JanWieck at Yahoo.com
Thu Apr 22 17:26:28 PDT 2010
On 4/22/2010 7:33 PM, Jaime Casanova wrote:
> On Thu, Apr 22, 2010 at 12:58 AM, Jan Wieck <JanWieck at yahoo.com> wrote:
 >
>> You may be able to fix things by reinserting that sl_subscribe row with
>> sub_active = false, then restart the slon for node 2 and see how far that
>> gets you.
>>
>
> yes, that makes receiver start accepting events again... it's trying
> to get upto date now...
> thanx for your help...

Jaime was so kind to provide me with a dump of the slony schema of node 
2 and we were able to completely figure out what happened.

The whole mess was started by using direct DDL against a subscriber 
under Slony 1.2.x. The attempted fix for this was to drop the table from 
the replication set via SET DROP TABLE, fix the table definitions and 
resubscribe it via a temp set. The subscription failed because of an 
inconsistency between the system catalog and the slony catalog on the 
subscriber.

The exact steps after that are not 100% clear to me yet, but I think I 
understand them good enough to be able to reproduce them later down the 
road. The SUBSCRIBE SET is actually a two step operation. In the first 
step, the SUBSCRIBE_SET event causes the new subscriber and everyone in 
the path to create the sl_subscribe row, which causes all data 
forwarders to keep replication data until the new subscriber has 
confirmed it. The second step is an internal event, ENABLE_SUBSCRIPTION, 
that is generated automatically by the origin of the set and that kicks 
off the actual copy_set() call.

That copy_set() failed due to the catalog inconsistency. What Jaime 
tried then was an UNSUBSCRIBE SET, which slonik issued against the half 
subscribed node 2, deleting the sl_subscribe row. The code in copy_set() 
doesn't use the parameters from the event, but expects the in memory 
runtime configuration data to know the data provider for the set. Since 
the sl_subscribe row is gone now, that information is missing and the -1 
is the default value for a set, the node isn't subscribed to.

I don't know exactly what the right fix for this bug is. My first gut 
feeling is to ignore the ENABLE_SUBSCRIPTION and generate another 
UNSUBSCRIBE_SET event just to clear out any sl_subscribe row existing in 
the cluster. Since I am in Toronto right now, I can discuss this with 
Steve Singer tomorrow morning.

Thank you Jaime. Your patience on this matter helped to track down a 
very nasty bug that apparently had been lingering in the system for a 
long time.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin



More information about the Slony1-general mailing list