Tue Sep 21 21:35:41 PDT 2004
- Previous message: [Slony1-general] Replicating complex (?) databases
- Next message: [Slony1-general] Error while running slave slon process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
After watching what sorts of common "bump in the night" scenarios pop
up, I'm wanting to set up a smarter sort of "watchdog" script to watch
each slon instance to see if it needs to be restarted.
At present, there's a script that basically "zaps" the slons every now
and again and restarts them. That is not nearly ideal from a couple
of perspectives that I can see:
1. It leaves PG backends around waiting for notifications, which
causes dead tuples on pg_listener to linger around, and whatever
other ills are engendered by "zombie" transactions.
2. Sometimes it causes the slon instances to get a bit deranged such
that they need a "restart node".
My thought is to have the "watchdog" be smarter in three ways:
a) It should only kill the slon if there seems reason to do so.
The case where we _definitely_ need it is when a VPN network
connection falls down, so that events no longer get through.
That suggests looking to see how recently events have made it
through.
Here's the query I'm thinking of.
oxrslive=# select now() - ev_timestamp > '00:20:00'::interval as event_old, now() - ev_timestamp as age,
oxrslive-# ev_timestamp, ev_seqno, ev_origin as origin
oxrslive-# from _oxrslive.sl_event events, _oxrslive.sl_subscribe slony_master
oxrslive-# where
oxrslive-# events.ev_origin = slony_master.sub_provider and
oxrslive-# not exists (select * from _oxrslive.sl_subscribe providers
oxrslive(# where providers.sub_receiver = slony_master.sub_provider and
oxrslive(# providers.sub_set = slony_master.sub_set and
oxrslive(# slony_master.sub_active = 't' and
oxrslive(# providers.sub_active = 't')
oxrslive-# order by ev_origin desc, ev_seqno desc limit 1;
event_old | age | ev_timestamp | ev_seqno | origin
-----------+-----------------+----------------------------+----------+--------
f | 00:00:01.025902 | 2004-09-21 19:16:43.804917 | 621069 | 1
(1 row)
It looks for the latest timestamp associated with an event coming
from a "master" node, and returns "t" in the first field if the
interval since the last event exceeds 20 minutes (which I'm
treating as a provisional parameter value).
Is there anything particularly deranged about that? Or should I
be looking to see which 'active' origin has checked in least
recently?
b) It should submit a "restart node" if it notices, in the logs:
FATAL localListenThread: Another slon daemon is serving this node already
Question: How exuberent should it be about this? Tell all the
nodes to restart? Or just the offending one?
c) If the slon process has died, it should restart it, and probably
throw out a "Help! Call a dba!" if this has happened too many times
recently.
--
let name="cbbrowne" and tld="ca.afilias.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 673-4124 (land)
- Previous message: [Slony1-general] Replicating complex (?) databases
- Next message: [Slony1-general] Error while running slave slon process
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Slony1-general mailing list