Locking Issues

12. Locking Issues

One of the usual merits the use, by PostgreSQL, of Multi-Version Concurrency Control (MVCC) is that this eliminates a whole host of reasons to need to lock database objects. On some other database systems, you need to acquire a table lock in order to insert data into the table; that can severely hinder performance. On other systems, read locks can impede writes; with MVCC, PostgreSQL eliminates that whole class of locks in that "old reads" can access "old tuples." Most of the time, this allows the gentle user of PostgreSQL to not need to worry very much about locks. Slony-I configuration events normally grab locks on an internal table, sl_config_lock, which should not be visible to applications unless they are performing actions on Slony-I components.

Unfortunately, there are several sorts of Slony-I events that do require exclusive locks on PostgreSQL tables, with the result that modifying Slony-I configuration can bring back some of those "locking irritations." In particular:

  • set add table

    A momentary exclusive table lock must be acquired on the "origin" node in order to add the trigger that collects updates for that table. It only needs to be acquired long enough to establish the new trigger.

  • move set

    When a set origin is shifted from one node to another, exclusive locks must be acquired on each replicated table on both the old origin and the new origin in order to change the triggers on the tables.

  • lock set

    This operation expressly requests locks on each of the tables in a given replication set on the origin node.

  • execute script

    This operation runs a set of SQL queries; in order for it to work, the Slony-I triggers must be removed, followed by the query (which potentially updates the data) running, followed by triggers being restored. The operation therefore must acquire table locks on all replicated tables on each node.

  • During the SUBSCRIBE_SET event on a new subscriber

    In a sense, this is the least provocative scenario, since, before the replication set has been populated, it is pretty reasonable to say that the node is "unusable" and that Slony-I could reasonably demand exclusive access to the node.

    A change in version 1.2 is that an express LOCK TABLE SQL request is submitted in the loop that validates that all of the tables are there. This means that all tables in the replication set will be locked via an exclusive lock for the entire duration of the process of subscription. By locking the tables early, this means that the subscription cannot fail after copying some of the data due to some other process having held on to a table.

    In any case, note that this one began with the wording "on a new subscriber." The locks are applied on the new subscriber. They are not applied on the provider or on the origin.

  • pg_autovacuum may not be part of Slony-I, but those that run it find that it wakes up roughly once per minute and may, at any time, start vacuuming a table, thereby taking out a ShareUpdateExclusiveLock lock. This may block the other events for an unpredictable period of time.

  • Each time an event is generated (including SYNC events) Slony-I obtains an exclusive lock on the sl_event table long enough to insert the event into sl_event. This is not normally an issue as Slony-I should be the only program using sl_event. However this means that any non-slony transactions that read from sl_event can cause replication to pause. If you pg_dump your database avoid dumping your Slony schemas or else pg_dump's locking will compete with Slony's own locking which could stop Slony replication for the duration of the pg_dump. Exclude the Slony schemas from pg_dump with --exclude-schema=schemaname to specifically exclude your Slony schema.

Each of these actions requires, at some point, modifying each of the tables in the affected replication set, which requires acquiring an exclusive lock on the table. Some users that have tried running these operations on Slony-I nodes that were actively servicing applications have experienced difficulties with deadlocks and/or with the operations hanging up.

The obvious question: "What to do about such deadlocks?"

Several possibilities admit themselves:

  • Announce an application outage to avoid deadlocks

    If you can temporarily block applications from using the database, that will provide a window of time during which there is nothing running against the database other than administrative processes under your control.

  • Try the operation, hoping for things to work

    Since nothing prevents applications from leaving access locks in your way, you may find yourself deadlocked. But if the number of remaining locks are small, you may be able to negotiate with users to "get in edgewise."

  • Use pgpool

    If you can use this or some similar "connection broker", you may be able to tell the connection manager to stop using the database for a little while, thereby letting it "block" the applications for you. What would be ideal would be for the connection manager to hold up user queries for a little while so that the brief database outage looks, to them, like a period where things were running slowly.

  • Rapid Outage Management

    The following procedure may minimize the period of the outage:

    • Modify pg_hba.conf so that only the slony user will have access to the database.

    • Issue a kill -SIGHUP to the PostgreSQL postmaster.

      This will not kill off existing possibly-long-running queries, but will prevent new ones from coming in. There is an application impact in that incoming queries will be rejected until the end of the process.

    • If "all looks good," then it should be safe to proceed with the Slony-I operation.

    • If some old query is lingering around, you may need to kill -SIGQUIT one of the PostgreSQL processes. This will restart the backend and kill off any lingering queries. You probably need to restart the slon processes that attach to the node.

      At that point, it will be safe to proceed with the Slony-I operation; there will be no competing processes.

    • Reset pg_hba.conf to allow other users in, and kill -SIGHUP the postmaster to make it reload the security configuration.

  • The section Section 17 suggests some additional techniques that may be useful, such as moving tables between replication sets in such a way that you minimize the set of tables that need to be locked.

Regrettably, there is no perfect answer to this. If it is necessary to submit a SLONIK MOVE SET request, then it is presumably necessary to accept the brief application outage. As Slony-I/ pgpool linkages improve, that may become a better way to handle this.