[Slony1-general] Master server reboot

Mon Sep 19 10:19:20 PDT 2005

Hi Christopher,

At 05:59 AM 9/19/2005, Christopher Browne wrote:
>"Deon van der Merwe" <dvdm at truteq.co.za> writes:
> > We have a master and 4 slave servers running Slony1 on a 
> LAN.  Everything is
> > working great, except for one thing:
> > - each of the slon processes runs on its own server
> > - they each run in an endless loop, so that they can always start again for
> > whatever reason
> > - we had do a reboot of the master server
> > - after the reboot, all the slaves reconnected
> > - the problem is this: the actual replication of data stopped.  With a
> > restart of the slon process on every slave the replication started to work
> > again.
> >
> > My question this is:
> > - what is the expected behavior for the above scenario?
> > - I need to investigate some more... What can/should/must I check in order
> > to find out why this is happened?  That is if I am able to repeat it!
> > - I will need to find out if I can repeat what happened...
> >
> > We are running on FC4 (so that is PostgreSQL 8.0.3) on all the 
> servers using
> > Slony-I 1.1.0.
>
>So, the only database that "fell over" was the master?

Correct.  All 4 slaves was untouched, and we rebooted the master.

>It sounds like what happened is that the remote worker threads that
>pointed to the "master" saw that DB go away, and shut down the one
>relevant remote worker thread.
>
>This left all the other threads up and running, which would have been
>OK had subscriptions been provided by the other threads...

 From what I could see (off the little that I know of Slony1...) was 
that they did reconnect.

>I have to call this behaviour "not unexpected."

>An interesting retry would be to have one or more cascaded
>subscribers.

I will try and make a plan on the test system, as the above was on 
the live system.

>Expected result there: If you restart the slons for the direct
>subscribers, that should suffice to get all the subscribers back
>going.  The cascaded subscribers should pick up once the direct
>subscribers have their slons restarted.

On restart of the slons on each slave did restart the actual 
replication without any delay.

I really want to investigate this more, but need to know what to 
check where in order provide more/better detailed information.  Any 
suggestions?

-Deon
_____________________________________________________
TruTeq Wireless (Pty) Ltd.  | Tel: +27 (0)12 667 1530
http://www.truteq.co.za     | Fax: +27 (0)12 667 1531
Wireless communications for remote machine management