Shahaf Abileah shahaf at redfin.com
Fri Aug 15 14:34:30 PDT 2008
Hello list,

 

I've been using Slony in production for months and things have worked
out pretty well so far.  But this past week I ran into problems when we
switched from Postgres 8.2 to Postgres 8.3 (still using Slony version
1.2.13).  I don't think the error results from the version upgrade, but
I could be wrong.

 

I set up the master database (query-1) and then proceeded to set up
replication to two slaves (query-2 and query-4).  I kicked off the slon
processes for all three simultaneously.  Replication to query-4 got
started right away and succeeded - it took 2 hours for the initial bulk
copy of data, and soon after that it was caught up with incremental
changes.  However, replication to query-2 died right away.  The slon
process for query-2 quit immediately and I saw the following message in
the log:

 

2008-08-15 13:57:01 PDT FATAL  localListenThread: "select
"_stingray_cluster".cleanupNodelock(); insert into
"_stingray_cluster".sl_nodelock values (    2, 0,
"pg_catalog".pg_backend_pid()); " - ERROR:  duplicate key value violates
unique constraint "sl_nodelock-pkey"

 

Now, I searched the interweb for solutions to this problem and I found
several suggestions, but none of them have helped so far...

 

1.	People suggest that this happens when you try to start multiple
slon processes for the same node at the same time.  I'm quite certain
that this is not the case.  I've checked and double-checked and
quadruple-checked.  For what it's worth, remember that I've successfully
run Slony replication to multiple slaves for months.

 

Here are the ones that are running:

 

25537 pts/3    S      0:00 slon -d 2 stingray_cluster
host=query-1.colo.redfin.com dbname=my_db user=my_user
password=my_password

25541 pts/3    Sl     0:00  \_ slon -d 2 stingray_cluster
host=query-1.colo.redfin.com dbname= my_db user= my_user password=
my_password

25544 pts/3    S      0:00 slon -d 2 stingray_cluster
host=query-4.colo.redfin.com dbname= my_db user= my_user password=
my_password

25545 pts/3    Sl     0:00  \_ slon -d 2 stingray_cluster
host=query-4.colo.redfin.com dbname= my_db user= my_user password=
my_password

 

            And the one that starts and then immediately dies has
query-2 as the host.

 

2.	Some people suggest that this happens when a slon process dies
and leaves a stale connection, a connection so deep in the network stack
that slony and postgres are not aware.  For this people suggest either
waiting about two hours for some timeout to kick in, or else having the
admin kill -2 the zombie process/connection.  Apparently this happens
more often if you try to run slony across a wan.

 

Well, our slon processes all run on the master node and the other nodes
are right there in the same LAN.  (we did once try to run Slony across a
WAN but that never worked)

 

This duplicate key issue first happened more than 12 hours ago, and it
continues to happen.  Each time I try to start a slon process for
query-2, it dies right away with the same error.

 

So, I don't think there's a zombie connection, but I also don't know
exactly how to check for one.  I can say that all the proc's in the
pg_stat_activity have a xact_start that is less than an hour old.

 

3.	One person mentioned having to go into the sl_nodelock table and
manually whack the bad row.  I'm a little concerned about doing this
kind of surgery because I don't know the sl_nodelock table and therefore
I don't know what would be the consequence of whacking a row (especially
if I accidentally whack the wrong row).

 

Here's what I see in that table:

 

stingray_prod=# select * from _stingray_cluster.sl_nodelock;

 nl_nodeid | nl_conncnt | nl_backendpid

-----------+------------+---------------

         2 |         11 |         14774

         2 |         15 |         20569

         1 |          0 |         25551

         4 |         16 |         25568

         4 |         17 |         25570

(5 rows)

 

What do these columns mean exactly?  Is there documentation on this
table somewhere?  Which one is likely to be the offending row?  Is it
crazy to whack the row manually?  Would I need to also whack
corresponding rows in other tables?

 

Any help with this would be much appreciated.

 

On a related note - is there a psql command (like \d) that I can use to
list all the objects in my _stingray_cluster schema?

 

Thanks,

 

--S

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.slony.info/pipermail/slony1-general/attachments/20080815/0b06263b/attachment.htm


More information about the Slony1-general mailing list