When a Backup is Not a Backup

When a Backup is Not a Backup September 8, 2009

Posted by mwidlake in Architecture.
Tags: rant, system development
trackback

I had a call from an old client today who had found themselves in a bit of a pickle. They were facing what I’d judge is the third most common “avoidable disaster” for Oracle databases. And probably the most devastating.

They had a damaged database, just one of those things. Media failure had corrupted a couple of datafiles. This is notthe disaster, most DBAs who have been around for a few years have experienced this. Disks are mechanical devices, they go wrong, controllers develop faults, people accidentally pull the plugs out of switches.

The disaster was that they had never tested their backup. Guess what? It turned out that their backup was not be a valid backup. It was just a bunch of data on tape {or disk or wherever they had put the backup}.

A backup is not a backup until you have proven you can recover from it.

If your backup has not been tested, and by that I mean used to recreate the working system which you then test, then it is just a bunch of data and a hopeful wish. You will not know if it works until you are forced to resort to it, and that means (a) you already have a broken system and an impact on your business (b) time is going to be short and pressure is high and (c) if you are the DBA, you could be about to fail in your job in a most visible and spectacular way.

Oddly enough, only two or three weeks ago I had another client in exactly the same position. Database is corrupted, the backup had not been tested, the backup turned out to be a waste of storage. In fact, in this case, I think the backup had not been taken for a considerable period of time as a standby database was being relied on as the backup. Which would have been fine if the Standby had been tested as fulfilling the purpose of “working backup”.

The standby had not, as far as I could deduce, ever been opened and tested as a working and complete database since the last major upgrade to the host system. When tested for real, It proved not to be a working and complete database. It was an expensive “hopeful wish”.

The client from a few weeks ago finally got their system back, but it took a lot of effort for the DBAs and the developers to sort it out and they were a tad lucky. The jury is out on the client who called me today.

I can’t think of anything at all that the DBA function does that is more important than ensuring backups are taken and work. {Maybe an argument could be made that creating the system is more important as nothing can be done about that, but then you could argue that the most important thing you do in the morning is get out of bed}. Admittedly, setting up, running and testing backups is not a very exciting job. In fact it often seems to be a task passed on to a less experienced member of the team {just like creating databases in the first place}. But most of us are not paid to have fun, we are paid to do a job.

I’ll just make a handful more comments before giving up for today.

The database backup nearly always needs to be part of a more comprehensive backup solution. You need a way to recreate the oracle install, either a backup of the binaries and auxiliary files (sql*net, initialization files, password files etc) or at least installation media and all patches so you can recreate the installation. You need a method to recover or recreate your application and also your monitoring. You might need to be able to recreate your batch control. O/S? Have you covered all components of your system, can you delete any given file off your system and fix it?
Despite the potential complexity resulting from the previous point, you should keep your backup as simple as you can. For example, if you only have a handful of small databases, you might just need old-style hot backups via a shell script, rather than RMAN. Only get complex {like backing up from a standby} if you have a very, very compelling business need to do so.
Testing the backup once is a whole world better then never testing it. However, regular, repeated recovery tests not only allow the DBA/Sys Admin teams to become very comfortable with the recovery process and help ensure it is swift and painless, but by trying different scenarios, you may well discover issues that come under the first two points.

I’ve not even touched on the whole nightmare of Disaster Recovery 🙂

Comments»

1. PdV - September 8, 2009: Excellent Warning.

(and do keep it Simple,
Simple recoveries are easier to understand, easier to test and easier to do under pressure)

Reply
mwidlake - September 8, 2009: Ahhh, simplicity, it makes life easier.
I knew one site which decided to make their backups smaller and quicker by flipping tablespaces to read-only, backing them up once and then excluding them from the regular backups.
Absolutely fine – except they regularly could not find a tape more than a few weeks old.
These read-only tablespace backups? Some were many months old…

Reply
PdV - September 8, 2009: That trick., R/O tablespaces to make it faster, is not, in my opinion a simplification. It makes the backup a hybrid of at least two methods.

Simple, in my opinion, would be a begin-backup + snapshot + end-backup (and disk-change-log). Or an RMAN to disk.

Simple = I can explain it to anyone on the back of a ciggy.
nb: simple does not mean: hidden behind a GUI.
mwidlake - September 8, 2009: Oh, I absolutely agree Piet, I was making the point that these “back up the R/O Tablespace once” solutions are NOT a simplification as you then have to be so sure you can find that once-only backup months or years down the line. It is made even worse when tablespaces are flipped to R/O ad-hoc and “backed up once”. People seem to forget that finding all those tapes is a complication (and in the case of the client I was thinking of, would be a minor miracle!). It is, however,worth the complication if it makes your backups managable.

I also agree that vanilla hot backups (usually where a SQL script selects each tablespace from the database and wraps it with a begin-backup, copy-file, end-backup set of statements) is a pleasantly simple and generally-not-mentioned-in-10g-plus-documentation solution.

BTW I find if I can explain something to you Piet, it is indeed simple 🙂 {Do I owe you another pint now?}
PdV - September 9, 2009: I did and Do Agree Martin!

And I am so grateful for “SQL>ALTER DATABASE BEGIN BACKUP;” in 10g(R2?). Of course, oracle did not give too much airtime to that simple command, as it might lure ppl away into simple DIY jobs using a disk-copy system.

BTW: By insisting on Simplicity, I turn my intellectual disadvantage into an asset.
A pint or a (double-)Glenlivet is always welcome (no ice, and a still water on the side). But only After I double-checked my recoverability. Cheers.
mwidlake - September 9, 2009: It is a nice new feature to come under the covers Piet. It can result in problems for systems with lots of redo, as tablespaces in backup mode create more redo than normal (quite a bit more it seems) so by swapping all tablespaces into backup mode for the whole period of the backup, you can generate a terrific number of archived redo logs.
Oracle really don’t want to meniton anything other than RMAN any more do they?
As for whiskey, I might introduce you to a new one…
2. Marko Sutic - September 8, 2009: I definitely agree with you that managing backups is the most important obligation for DBA.

Doing frequent recoveries on another test machine is the best thing to do because you’ll become proficient in whole process of recovering databases.

But besides that, what do you think about RMAN validation?
Could you rely enough on RMAN validation and be certain that you have valid and useful backups without doing recovery on another machine?

Regards,
Marko

Reply
3. mwidlake - September 8, 2009: Marko,

I’m far from an RMAN expert, I don’t know if the latest versions include a logical check that you have backup up everything you need to do a recovery of the database or just verifies that what you have written to your archive is readable and valid.

I suspect the latter. Which is of course very similar to getting your tape robot or device to verify that the files on tape can be read. It gives confidence that what you have written to your archive can be read. It does NOT, however, ensure that what you decided to back up is a complete set of what you need to recover.

There is a middle ground. Once you have done a couple of test recoveries to find issues and ensure that what you back up is what you needed to back up to allow recovery, then you can probably use the veification of the backup set to assure yourself that you have avoided media failure.

Reply
Neil Chandler - September 8, 2009: We’re fortunate enough to have the resources to have yesterdays database open for the support teams to do support against near-live data. The recovery of this database validates the RMAN backup. It’s surprising just how often the backup goes wrong – we have to re-run the backup occasionally when the recovery fails. Several times a year. Then work out why the backup didn’t work in the first place – ususally some space issue, but not always. Sometimes it’s something weird, unexpected and unpredictable. Our backups run off the Max Availability standby, just to keep things a little more complex.

Whoever invented incremental backups was just a sucker for a long and difficult recovery path. R/O tablespaces. Just back it up once? NO! Back it up every night like everything else. Or you’ll lose it. Guaranteed. After 3 years in storage, tapes have a nasty habit of decaying – if the warehouse hasn’t burned down (ref: a large media storage company a few years ago in East London)…

Reply
mwidlake - September 8, 2009: I love the sound of your backup/recovery/use solution Neil. I of coure know you are an experienced DBA/Architect and the fact that you find regular gotchas and issues just shows that test recoveries really are not optional.

I don’t agree with you on “back it up every night” though Neil. Because of VLDBs. If you have several 10’s of Terabytes of database, you simply can’t back it up every night as it takes a couple of days to do the backup. Or you have to spend a lot of money on what are usually complex disc solutions. Even backing up once a week can be tricky as the longer the backup takes, the more likely an issue will arise.

Having large sections of the database as readonly and “backing up once” is also a valid solution. Though the “back it up once” is a bit of a lie. You back it up twice (2 copies) and repeat the backup regularly, say once every 6 months. It is not as simple as Piet, you and I would like, but it makes the backups managable.

So long as you test the recovery regularly
🙂
Neil Chandler - September 8, 2009: Test recoveries are NOT optional.

Reasons to lose your job #1 – can’t recover database.

Accidentally shutdown the Production Trading system at a large bank, your bonus will be diminished but you shouldn’t get sacked (well, not the first time 🙂 ). Lose the data? You’re Fired.
mwidlake - September 8, 2009: 🙂 You are quite… strong in your opinion on this one Neil 🙂

And I utterly agree. If you are a DBA and you have not tested your backup, being Set Free Upon The Oceans Of Opportunity is and should be a distinct possiblity.

So how many time have you accidentally shut down the production system, huh?
4. PdV - September 13, 2009: Oracle definitely herds us towards RMAN. At least for the moment.

I understand the risk of too much redo during “alter database begin/end backup”, but that is manageable. Use quiet periods, and limit the time “in backup mode”.
On larger systems I would only recommed it if you have a disk system with snapshot abilities. In that case, you put the burden of change-logging with the disk-system.

Ditto for virtual machines: a snapshot of a virtual machine amounts to the same principle: The VM takes care of the change-logging.

Reply

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Martin Widlake's Yet Another Oracle Blog