Bug 335 - Disable node on failed node event
Summary: Disable node on failed node event
Status: NEW
Alias: None
Product: Slony-I
Classification: Unclassified
Component: slon (show other bugs)
Version: devel
Hardware: All All
: low enhancement
Assignee: Slony Bugs List
Depends on:
Reported: 2014-03-19 13:19 UTC by Jan Wieck
Modified: 2014-03-19 13:54 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Jan Wieck 2014-03-19 13:19:35 UTC
Currently there is a possibility that a failed node can process the FAILED_NODE event for itself and even make it to the DROP_NODE, in which case it will drop it's own Slony schema. This can contain vital information like missing sl_log data that hadn't replicated.

The current proposal is to disable the node by setting no_active=False on FAILED_NODE processing, then terminate slon. slon checks that flag (or should do so) and refuses to start if it is set.

We further should add a unique identifier like a UUID to sl_node. The remote listener will check on startup that the value in the for it's entry in the event provider is the same and refuse to work if it is not. This will prevent that a temporary failed node will process bogus data after a "DROP NODE, STORE NODE" sequence had happened while it was failed if it never received any of those or the FAILED_NODE event.
Comment 1 Christopher Browne 2014-03-19 13:54:17 UTC
Here is a function that generates well-formed UUIDs of Type 1 form:

create or replace function public.make_retroactive_uuid (p_trid integer, p_date timestamptz, p_suffix text)
returns uuid as
   c_uuid uuid;
   c_epoch bigint;
   c_tlow character(8);
   c_tmid character(4);
   c_version character(1);
   c_thi character(4);
   c_seq character(4);
   c_node character(12);
   if p_date is null then
     c_epoch := 10000000*(extract(epoch from '1970-01-01'::timestamptz)+12219292800::bigint)::bigint;
     c_epoch := (10000000*(extract(epoch from p_date)+12219292800::bigint))::bigint;
   end if;
   c_tlow := lpad(to_hex(mod(c_epoch, 4294967296::bigint) 
                         # mod(p_trid/8192, 256)
                    ), 8, '0');
   c_tmid := lpad(to_hex(mod(c_epoch /4294967296::bigint, 65536)),4,'0');
   c_version := '1';
   c_thi := lpad(to_hex(c_epoch /281474976710656::bigint),3,'0');
   c_seq := lpad(to_hex((B'10000000' | mod(p_trid/256, 32)::bit(8))::integer),2,'0')
              || lpad(to_hex(mod(p_trid, 256)), 2, '0');
   c_node := p_suffix;
   c_uuid := (c_tlow || '-' || c_tmid || '-' || c_version || c_thi || '-' || c_seq || '-' || c_node)::uuid;
   return c_uuid;
$$ language plpgsql;

Usage example:

select public.make_retroactive_uuid(1, now(), 'f6b3a3220461');
(1 row)

That can then be decoded using OSSP UUID code:

> uuid -d b39dde40-afa7-11e3-8001-f6b3a3220461
encode: STR:     b39dde40-afa7-11e3-8001-f6b3a3220461
        SIV:     238751509672186172578688542368744997985
decode: variant: DCE 1.1, ISO/IEC 11578:1996
        version: 1 (time and node based)
        content: time:  2014-03-19 20:47:35.911379.2 UTC
                 clock: 1 (usually random)
                 node:  f6:b3:a3:22:04:61 (local unicast)

Note that type 1 is the usual sort generated by default.