[Slony1-general] Please HELP - URGENT

Sun Oct 30 18:52:17 PST 2005

Andew

You're right, I realised after, that they're not full vacuums.

There was another database (mail_lxtreme) that was unused (as far as I 
can tell) and which was not being vacuumed:

  SELECT datname, age(datfrozenxid) FROM pg_database;
    datname    |     age
--------------+-------------
  mail_lxtreme | -2074187459
  bp_live      |  1079895636
  template1    |  1076578064
  template0    | -2074187459
(4 rows)

In the end, I did a moveset on all 6 sets from the (damaged) master to 
the slave. Then I shutdown slon and postgress on the old master and 
deleted its data dir, and then re-initdb'd it.

I removed the replication info on the surviving node by doing an uninstall.

I created a new cluster and subscribed to get all the data back onto the 
rebuilt server.

Later when I'm feeling less tired, I'll switchover to reinstate the 
former master (as it is a much faster server).

I'm going to be a lot more careful when I add databases to ensure that 
they always vacuumed periodically.

I'l also going to add a new nagios script to scan serverlog for any WARN 
or ERROR messages for the current day - this way I should get notice of 
a problem before it becomes a disaster!

Thanks for your feedback.

John

Andrew Sullivan wrote:
> On Sun, Oct 30, 2005 at 09:00:12AM +0000, John Sidney-Woollett wrote:
> 
>>over 2 billion transactions
>>DETAIL:  You may have already suffered transaction-wraparound data loss.
>>
>>We have cronscripts that perform FULL vacuums
> 
> 
> Not on all your your databases.  And anyway
> 
> 
>># vacuum template1 every sunday
>>35 2 * * 7 /usr/local/pgsql/bin/vacuumdb --analyze --verbose template1
>>
>># vacuum live DB every day
>>35 5 * * * /usr/local/bin/psql -c "vacuum verbose analyze" -d bp_live -U
>>postgres --output /home/postgres/cronscripts/live/vacuumfull.log
> 
> 
> Those aren't fill vacuums.  There must be some database in there that
> you're not telling us about.  Do you have anything other than
> template0, template1, and bp_live?  Also, has template0 always been
> frozen?
> 
> 
>>2) What can I do to recover the data?
> 
> 
> Nothing, save for restoring from old backups.
> 
> 
>>I can failover to the slave server, but what do I need to do to rebuild
>>the original database?
> 
> 
> You'll need to rebuild it from scratch.  You could do a switchover
> instead, but I think that's risky in this case.
> 
> 
>>Should I failover now?!! And then start rebuilding the old master
>>database (using slon, I presume)?
> 
> 
> That's what I'd do.  It's just like adding a new node.
> 
> 
>>How do I stop this EVER happening again??!!!
> 
> 
> Well, _something_ didn't get vacuumed in time.  Better find out what
> that was.  I'm also extremely surprised you didn't see the warnings
> in time -- are you sure you're not overlooking something important in
> your logs?
> 
> A
>