[Slony1-general] log switchover not completing with 1.2.0

Mon Oct 23 15:27:30 PDT 2006

paul cannon wrote:
> Our installation of slony 1.2.0 is struggling.
>
> It started trying to do a log switchover several days ago. Since then,
> every 10-20 minutes we get this in the logs:
>
> NOTICE:  Slony-I: log switch to sl_log_2 still in progress - sl_log_1 not truncated
>
> Between each time, it looks like it's doing a DELETE on both logs and a
> VACUUM on sl_log_1.
>
> sl_log_status on the master node is 3, which from the code means
> everything should be going into sl_log_2, right? And yet /both/ logs are
> continually growing, aside from the periodic DELETEs.
>
> What's going on? Is this expected behavior? The way I understand it,
> only one log should be growing and the other should be getting cleaned
> out as the events in it propogate and are deleted.
>
> I'm hoping that eventually a log switchover will be able to complete,
> but it doesn't look like that's going to happen as long as things are
> still getting put into the inactive log.
>
> There aren't any long-running transactions on any of the nodes.
>
> Ideas?
>
> I ran into the problem at
> http://gborg.postgresql.org/project/slony1/bugs/bugupdate.php?1591,
> before applying my patch as shown there. Could that have messed up the
> log switch process and left it only partially switched somehow?
>
>   
A simpler patch is thus:

cbbrowne at dba2:Slony-I/slony1_1_2_STABLE/src/slon> cvs diff -u
remote_worker.c                                                 Monday
22:14:36
Index: remote_worker.c
===================================================================
RCS file: /usr/local/cvsroot/slony1/slony1-engine/src/slon/remote_worker.c,v
retrieving revision 1.124.2.1
diff -c -u -r1.124.2.1 remote_worker.c
cvs diff: conflicting specifications of output style

--- remote_worker.c    23 Oct 2006 21:57:01 -0000    1.124.2.1
+++ remote_worker.c    23 Oct 2006 22:14:44 -0000
@@ -5508,8 +5508,15 @@
                                 "from %s.sl_log_1 "
                                 "where log_origin = '%s' "
                                 "  and log_xid = '%s' "
+                                "  and log_actionseq = '%s' UNION ALL"
+                                "select log_cmddata "
+                                "from %s.sl_log_2 "
+                                "where log_origin = '%s' "
+                                "  and log_xid = '%s' "
                                 "  and log_actionseq = '%s'",
                                 rtcfg_namespace,
+                                log_origin, log_xid, log_actionseq,
+                                rtcfg_namespace,
                                 log_origin, log_xid, log_actionseq);
                         res2 = PQexec(dbconn, dstring_data(&query2));
                         if (PQresultStatus(res2) != PGRES_TUPLES_OK)

That has the merit of not requiring any variables to communicate which
log is which :-).  Jan may come out with something else...

That data is going to both log tables seems rather curious.

According to the logic in slony1_funcs.c:

        switch(log_status)
        {
            case 0:
            case 2:        cs->plan_active_log = cs->plan_insert_log_1;
                        break;
           
            case 1:
            case 3:        cs->plan_active_log = cs->plan_insert_log_2;
                        break;

            default:    elog(ERROR, "Slony-I: illegal log status %d",
log_status);
                        break;
        }

Based on that, you should indeed only be adding tuples to sl_log_2.

Corresponding logic in logswitch_final:

    -- ----
    -- status = 3: sl_log_2 active, cleanup sl_log_1
    -- ----
    if v_current_status = 3 then
        -- ----
        -- The cleanup thread calls us after it did the delete and
        -- vacuum of both log tables. If sl_log_2 is empty now, we
        -- can truncate it and the log switch is done.
        -- ----
        for v_dummy in select 1 from @NAMESPACE at .sl_log_1 loop
            -- ----
            -- Found a row ... log switch is still in progress.
            -- ----
            raise notice ''Slony-I: log switch to sl_log_2 still in
progress - sl_log_1 not truncated'';
            return -1;
        end loop;

        raise notice ''Slony-I: log switch to sl_log_2 complete -
truncate sl_log_1'';
        truncate @NAMESPACE at .sl_log_1;
        perform "pg_catalog".setval(''@NAMESPACE at .sl_log_status'', 1);
        -- Run addPartialLogIndices() to try to add indices to unused
sl_log_? table
        perform @NAMESPACE at .addPartialLogIndices();
        return 2;
    end if;

The reason you should get the log message you're getting is if there are
still live tuples in sl_log_1.

That's curious; they shouldn't be getting added in by anything (notably
if that sequence value is set to 3).

Seems to warrant looking at what's in sl_log_1 to see if it can get
associated with SYNCs to see why old entries maybe aren't getting
cleaned out.