[Slony1-general] Bad failure mode encountered during EXECUTE SC RIPT(maybe fixed in 1.2.12?)

Wed Dec 19 13:17:24 PST 2007

Casey, if you search through the archives for deadlock you'll find a similar problem I had.  This has been fixed in 1.2.12.  Sorry for top posting, my phone doesn't quote worth a flip.

------- Original Message -------
From: Casey Duncan <casey at pandora.com>
To: slony1-general at lists.slony.info
Sent: 12/19/07, 3:51:16 PM
Subject: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT(maybe fixed in 1.2.12?)

Running slony 1.2.10 we encountered an issue with a schema upgrade  
when it propagated to the secondary in a two-node slony mirror. Our  
typical procedure for doing schema upgrades is as follows:

1, Point the application server clients to the secondary (the app  
becomes "read-only" as a result)
2. After a few minutes, turn off the slon daemons so we can control  
when the schema changes propagate
3. Appy the schema changes (to the primary) via slonik using execute  
script also adding and removing tables from the replication set as  
needed.
4. After the upgrade completes on the primary, point the app servers  
back to it, so the secondary is quiet.
5. Turn on the slon daemons and wait for the changes to propagate to  
the secondary.
6. Go to sleep or play video games or what-have-you 8^)

This has worked well for us for dozens (if not hundreds) of upgrades,  
with very occasional hiccups in step 5 due to timing related issues  
(when adding tables, etc).  Last night though we had a problem with  
step 4, a wedged pgpool process prevented the app servers from  
switching back to the primary. When we tried to switch them, the  
timed out trying to connect and spontaneously failed over to the  
secondary once again as they are programmed to. Unfortunately this  
went unnoticed as we proceeded to step 5. Due to the level of  
concurrency, this lead to the following deadlock error on the secondary:

2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 9]  
ERROR:  deadlo
ck detected
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 10]  
DETAIL:  Proc
ess 6313 waits for AccessExclusiveLock on relation 20355 of database  
19833; bloc
ked by process 11991.
         Process 11991 waits for AccessShareLock on relation 20083 of  
database 19
833; blocked by process 6313.
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 11]  
CONTEXT:  SQL
  statement "lock table "public"."feedback_packed" in access  
exclusive mode"
         PL/pgSQL function "altertableforreplication" line 53 at  
execute statemen
t
         SQL statement "SELECT  "_radio".alterTableForReplication( $1 )"
         PL/pgSQL function "ddlscript_complete_int" line 11 at perform
2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 12]  
STATEMENT:  s
elect "_radio".ddlScript_complete_int(1, -1);

Unfortunately this too went unnoticed at first, what we did notice  
was that the app servers were still failed over and we couldn't  
connect to the primary. This was quickly tracked down to the problem  
with pgpool which was fixed by restarting it. After that, we  
succeeded in switching the app servers back to the primary.

At that point the slon daemons were running, but the secondary daemon  
was not getting anywhere. A look at the logs revealed this error:

ERROR  remoteWorkerThread_1: "select "_radio".ddlScript_prepare_int 
(1, -1); " PGRES_FATAL_ERROR ERROR:  Slony-I: alterTableRestore():  
Table "public"."settlement_log" is not in altered state
  INFO  <output from /usr/lib/postgresql/bin/slon> CONTEXT:  SQL  
statement "SELECT  "_radio".alterTableRestore( $1 )"

This resulted in a bit of a sinking feeling in my gut, since I knew  
this likely meant that none of the replicated tables were in the  
altered state, thus none of the slony triggers were present to  
prevent data from being written to the secondary. A quick look at the  
tables revealed this to be the case. Unfortunately by this time the  
app servers were happily writing to the primary, thus the mirror was  
broken and the slony pair was hosed.

So the end result was that data got written to the secondary, and is  
more-or-less lost because other rows with the same primary key values  
got added to the primary subsequent to that.

This situation points to a lack of atomicity in execute script.  
Obviously the removal of the slony trigggers and the schema changes  
were applied in a separate transaction from the restoration of the  
triggers. This leads to this dangerous possibly that the former will  
succeed but the latter will fail, leading to the issue I ran into here.

Glancing at the changelog for slony 1.2.12, there appear to be  
changes around transaction handling in execute script. Would those  
changes prevent this problem? Note I plan to upgrade soon anyway, but  
obviously such a fix makes upgrading a more compelling thing for me  
to do even sooner.

Thanks.

-Casey