jump to navigation

When a Backup is Not a Backup September 8, 2009

Posted by mwidlake in Architecture.
Tags: ,
13 comments

I had a call from an old client today who had found themselves in a bit of a pickle. They were facing what I’d judge is the third most common “avoidable disaster” for Oracle databases. And probably the most devastating.

They had a damaged database, just one of those things. Media failure had corrupted a couple of datafiles. This is notthe disaster, most DBAs who have been around for a few years have experienced this. Disks are mechanical devices, they go wrong, controllers develop faults, people accidentally pull the plugs out of switches.

The disaster was that they had never tested their backup. Guess what? It turned out that their backup was not be a valid backup. It was just a bunch of data on tape {or disk or wherever they had put the backup}.

A backup is not a backup until you have proven you can recover from it.

If your backup has not been tested, and by that I mean used to recreate the working system which you then test, then it is just a bunch of data and a hopeful wish. You will not know if it works until you are forced to resort to it, and that means (a) you already have a broken system and an impact on your business (b) time is going to be short and pressure is high and (c) if you are the DBA, you could be about to fail in your job in a most visible and spectacular way.

Oddly enough, only two or three weeks ago I had another client in exactly the same position. Database is corrupted, the backup had not been tested, the backup turned out to be a waste of storage. In fact, in this case, I think the backup had not been taken for a considerable period of time as a standby database was being relied on as the backup. Which would have been fine if the Standby had been tested as fulfilling the purpose of “working backup”.

The standby had not, as far as I could deduce, ever been opened and tested as a working and complete database since the last major upgrade to the host system. When tested for real, It proved not to be a working and complete database. It was an expensive “hopeful wish”.

The client from a few weeks ago finally got their system back, but it took a lot of effort for the DBAs and the developers to sort it out and they were a tad lucky. The jury is out on the client who called me today.

I can’t think of anything at all that the DBA function does that is more important than ensuring backups are taken and work. {Maybe an argument could be made that creating the system is more important as nothing can be done about that, but then you could argue that the most important thing you do in the morning is get out of bed}. Admittedly, setting up, running and testing backups is not a very exciting job. In fact it often seems to be a task passed on to a less experienced member of the team {just like creating databases in the first place}. But most of us are not paid to have fun, we are paid to do a job.

I’ll just make a handful more comments before giving up for today.

  • The database backup nearly always needs to be part of a more comprehensive backup solution. You need a way to recreate the oracle install, either a backup of the binaries and auxiliary files (sql*net, initialization files, password files etc) or at least installation media and all patches so you can recreate the installation. You need a method to recover or recreate your application and also your monitoring. You might need to be able to recreate your batch control. O/S? Have you covered all components of your system, can you delete any given file off your system and fix it?
  • Despite the potential complexity resulting from the previous point, you should keep your backup as simple as you can. For example, if you only have a handful of small databases, you might just need old-style hot backups via a shell script, rather than RMAN. Only get complex {like backing up from a standby} if you have a very, very compelling business need to do so.
  • Testing the backup once is a whole world better then never testing it. However, regular, repeated recovery tests not only allow the DBA/Sys Admin teams to become very comfortable with the recovery process and help ensure it is swift and painless, but by trying different scenarios, you may well discover issues that come under the first two points.

I’ve not even touched on the whole nightmare of Disaster Recovery :-)

Follow

Get every new post delivered to your Inbox.

Join 161 other followers