John Sidney-Woollett johnsw
Tue Feb 15 09:55:53 PST 2005
Hopefully this will prevent data loss or problems for others using slony 
1.0.5 and pg 7.4.6...

We just got bitten by something we didn't foresee when completely 
uninstalling a slony replication cluster from the master and slave...

MAKE SURE YOU STOP YOUR APPLICATION RUNNING AGAINST YOUR MASTER DATABASE 
WHEN REMOVING THE WHOLE SLONY CLUSTER, or at least re-cycle all your 
open connections after the event!

The connections appear to "remember" or refer to objects which are 
removed by the uninstall node script. And you get lots of errors as a 
result (and possible data loss??)...

Question: Why do our database objects still refer to removed slony 
objects after they are removed?

John Sidney-Woollett

More info...

Our system is a web application which runs against a postgres 7.4.6 
database. Tomcat is restarted at 5am each day.

Last Friday afternoon we uninstalled the slony cluster (1 master + 1 
slave) so that we could add a new second slave. (I find it easier to 
uninstall the cluster and then reapply a new setup, subscribe script 
rather than trying to add the single node.)

The cluster was successfully removed, and then rebuilt with 1 master and 
2 slave nodes.

However, we didn't stop and start our web application which uses Tomcat 
connection pool and continued to run against the master (during the 
uninstall and rebuild). Everything appeared fine.

Only today while checking something else did I notice lots of
ERROR:  relation with OID 1036711 does not exist
errors in the postgres serverlog

In our tomcat application we also saw lots of errors like
org.postgresql.util.PSQLException: ERROR: relation with OID 1036711 does 
not exist
at 
org.postgresql.util.PSQLException.parseServerError(PSQLException.java:139)
at org.postgresql.core.QueryExecutor.executeV3(QueryExecutor.java:154)
at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:101)
at org.postgresql.core.QueryExecutor.execute(QueryExecutor.java:43)
...

Basically these errors started when the cluster was uninstalled and 
continued until the web application was restarted (at 5am the following 
morning).

I'm not sure what object had OID 1036711 - maybe the slony replication 
schema?? Is there a way to find out after the event?

I do have daily full backups of the master each day going back 1 month - 
I could load one into another database and lookup the OID, if someone 
could explain how to do that. And if that would be useful?

But the net effect is that we got bizarre failures in our application, 
and large gaps in primary key values (using sequences) in certain tables 
within the database. On the whole the app seemed to be behaving fine, 
with the odd error that we put down to user error... (bad assumption, I 
guess in hindsight)

At the moment I'm trying to figure out exactly what kind of damage our 
data has suffered.

If anyone has any ideas or suggestions as to what went wrong or what to 
check for, I'd appreciate hearing.

Thanks

John


More information about the Slony1-general mailing list