Wed Dec 19 13:17:24 PST 2007
- Previous message: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT (maybe fixed in 1.2.12?)
- Next message: [Slony1-general] questions about slon_tools.conf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Casey, if you search through the archives for deadlock you'll find a similar problem I had. This has been fixed in 1.2.12. Sorry for top posting, my phone doesn't quote worth a flip. ------- Original Message ------- From: Casey Duncan <casey at pandora.com> To: slony1-general at lists.slony.info Sent: 12/19/07, 3:51:16 PM Subject: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT(maybe fixed in 1.2.12?) Running slony 1.2.10 we encountered an issue with a schema upgrade when it propagated to the secondary in a two-node slony mirror. Our typical procedure for doing schema upgrades is as follows: 1, Point the application server clients to the secondary (the app becomes "read-only" as a result) 2. After a few minutes, turn off the slon daemons so we can control when the schema changes propagate 3. Appy the schema changes (to the primary) via slonik using execute script also adding and removing tables from the replication set as needed. 4. After the upgrade completes on the primary, point the app servers back to it, so the secondary is quiet. 5. Turn on the slon daemons and wait for the changes to propagate to the secondary. 6. Go to sleep or play video games or what-have-you 8^) This has worked well for us for dozens (if not hundreds) of upgrades, with very occasional hiccups in step 5 due to timing related issues (when adding tables, etc). Last night though we had a problem with step 4, a wedged pgpool process prevented the app servers from switching back to the primary. When we tried to switch them, the timed out trying to connect and spontaneously failed over to the secondary once again as they are programmed to. Unfortunately this went unnoticed as we proceeded to step 5. Due to the level of concurrency, this lead to the following deadlock error on the secondary: 2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 9] ERROR: deadlo ck detected 2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 10] DETAIL: Proc ess 6313 waits for AccessExclusiveLock on relation 20355 of database 19833; bloc ked by process 11991. Process 11991 waits for AccessShareLock on relation 20083 of database 19 833; blocked by process 6313. 2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 11] CONTEXT: SQL statement "lock table "public"."feedback_packed" in access exclusive mode" PL/pgSQL function "altertableforreplication" line 53 at execute statemen t SQL statement "SELECT "_radio".alterTableForReplication( $1 )" PL/pgSQL function "ddlscript_complete_int" line 11 at perform 2007-12-18 21:41:12.174 PST [d:radio_prod_seg11 u:slony p:6313 12] STATEMENT: s elect "_radio".ddlScript_complete_int(1, -1); Unfortunately this too went unnoticed at first, what we did notice was that the app servers were still failed over and we couldn't connect to the primary. This was quickly tracked down to the problem with pgpool which was fixed by restarting it. After that, we succeeded in switching the app servers back to the primary. At that point the slon daemons were running, but the secondary daemon was not getting anywhere. A look at the logs revealed this error: ERROR remoteWorkerThread_1: "select "_radio".ddlScript_prepare_int (1, -1); " PGRES_FATAL_ERROR ERROR: Slony-I: alterTableRestore(): Table "public"."settlement_log" is not in altered state INFO <output from /usr/lib/postgresql/bin/slon> CONTEXT: SQL statement "SELECT "_radio".alterTableRestore( $1 )" This resulted in a bit of a sinking feeling in my gut, since I knew this likely meant that none of the replicated tables were in the altered state, thus none of the slony triggers were present to prevent data from being written to the secondary. A quick look at the tables revealed this to be the case. Unfortunately by this time the app servers were happily writing to the primary, thus the mirror was broken and the slony pair was hosed. So the end result was that data got written to the secondary, and is more-or-less lost because other rows with the same primary key values got added to the primary subsequent to that. This situation points to a lack of atomicity in execute script. Obviously the removal of the slony trigggers and the schema changes were applied in a separate transaction from the restoration of the triggers. This leads to this dangerous possibly that the former will succeed but the latter will fail, leading to the issue I ran into here. Glancing at the changelog for slony 1.2.12, there appear to be changes around transaction handling in execute script. Would those changes prevent this problem? Note I plan to upgrade soon anyway, but obviously such a fix makes upgrading a more compelling thing for me to do even sooner. Thanks. -Casey
- Previous message: [Slony1-general] Bad failure mode encountered during EXECUTE SCRIPT (maybe fixed in 1.2.12?)
- Next message: [Slony1-general] questions about slon_tools.conf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list