VLDB Backups

VLDB Backups October 13, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: Architecture, backup, system development, VLDB
trackback

One of the indications that your database classes as a VLDB is that your backups are giving you problems simply due to the size of the database.

As some of you will know from previous posts of mine, like this one about how vital it is to prove your backup and this one about how you maybe need to back up more than just the database, I have a thing about backups. Or, more specifically, recovery.

My focus on backups may surprise some people who have worked with me as I often state “I am not a production DBA”. {And no Dave, this is not me saying I am too good to be a proddy DBA, it is me saying I do not have current, strong skills in many of the daily proddy DBA tasks}. However, I am an architect. Whenever I work on a new system or heavily modify an existing system, I try and keep the need for backup and recovery at the front of my mind.

The most common issue encountered with backing up a VLDB is the time it takes to run the backup, it can’t be done in the time between backup cycles, usually a day.

The second most common issue is the impact on the live system of running the backup. Sometimes this impact is overstated, after all if the backup is one thread running on the server it is only going to consume as much resource as one thread can, which may well leave enough over for the daily processing requirements, but usually for large systems steps have been taken to run the backup in parallel, thus creating considerable load on the system.

A third issue, which is related to the first, is that the backup takes so long and uses so many tapes (or space) that it rarely completes – a network glitch, a failure of the backup suite, running out of media, all stop the backup finishing. I’ve been in the situation of attempting 4 or 5 backups for each one that succeeds as something crops up in the 2 or 3 days it takes to run the backup. {In our case it was the flaky backup software, grrrrr}.

The final issue I’ll mention is one that is often overlooked. You can’t afford the time to recover the backup if it was ever needed. I’ve seen this especially with export or expdp-based backups – Some sites still use export and it has it’s place with smaller systems – often it seems to be used with OLTP systems that have more than 75% of the database volume as indexes. The export runs fine overnight, it is only processing that 25% of the system that is data. But when you ask the client if they can wait 5 days to import the export they go pale. This time-to-recovercan also be a problem with RMAN backups, you need to read in everything you wrote out.

I’ve said it before but I’m going to say it again – a backup is not a backup until you have done a successful test full recovery. This would certainly highlight how long your recovery takes.

So, how do you solve the problem of backing up a VLDB?

Well, one solution is to not bother. I know of a couple of sites that have two physical copies of the database, at different locations, and write all data to both. If they lose one copy, they can keep running on the other copy whilst the lost version is rebuilt. Your swap-over could be almost instant.
Drawbacks here are:

If you lose one copy you have no redundancy until the second system is rebuilt. This is like losing a disk out of a RAID5 array, another failure is disaster. As databases get bigger, this period of zero redundancy gets longer and thus the chance of a second failure increases (which again is just like the RAID5 array – yet another argument against massive discs).
As you write to both systems, if the damage is caused by the feed (eg accidentally deleting data) then both are damaged, unless you have a delay on one system, in which case you now have issues with catching up on that delay if you have to swap to the second system. Flashback may save you from damage caused by the feed.
The cost of the second system and the complexity of the double-writes can both be issues.

Another solution is physical DataGuard. I see this as slightly different from the double-system approach as above as you have more options, such as replicating to more than one other system, opening and reading the DataGuard copy, opening and using the copy before flashing it back and re-recovering, even Active DataGuard, where you can have the standby database open and being used, even whilst it is kept up-to-date. Again, you can set things up so that the gap between primary system failure and bringing up a new production system is small. A few issues to be mindful of are:

You have to ensure that your primary database is running in forced logging mode or you are extremely, and I mean extremely, careful about what you do against the database that is unrecoverable. The latter option is just asking for trouble actually. Which is a shame, as all those performance tricks of doing direct IO, append operations and nologging activities to help you process all the data in your VLDB are no longer available to you. This might be a show-stopper.
You have to take care in setting it all up and may need extra licence.
You still have the issue of damage being transmitted to your “backup” before you spot it.
The main issue? Someone will get clever and use your DataGuard systems for other things {Such as opening the standby to changing it and then flashing the data back, or use active data guard for reporting which becomes critical to your business} and now you actually have a production critical system split across the DataGuard architecture. It has stopped being a backup, or at least not a dedicated backup. Ooops.

There is actually no need to backup the whole database every night, though some sites seem fixated on achieving this. Or even every week. There is nothing wrong in having an RMAN level 0 {zero} backup that is a copy of everything and then just keep backing up the archived redo logs for eg 2 weeks before doing another level 0. So long as you thoroughly test the recovery and ensure you can recover the level 0, get hold of all those redo logs and apply them in a manner timely enough to support your business. I’ve recovered a level 0 backup over a month old and then run through all the archived redo logs to recreate the system, it worked fine as the volume of redo was pretty small compared to the database. Some considerations with this method are:

If you ever have trouble getting files from different days out of your backup area, or occasionally find files from your backup system are corrupt, do not even think of using this method. One missed archive redo file from 13 days back and you are in serious trouble.
You need to do those level zero backups and they take a while. remember what I said about issues during a long backup?
It can get complex.
There is going to be a fairly significant delay in recovering your system.

There are several options with RMAN of doing incremental and cumulative incremental level 1 backups against a level 0 baseline backup. They have the same pros and cons as above, often trading more complexity with shorter recovery times. All good so long as you practice the recovery.

Physical copy at the storage level. These solutions seems to come and go every few years, but the principle is usually either (a) splitting mirrors – you have eg 3 copies of the data duplicated across the storage, you can un-couple one copy and do to it what you want, like copy it to tape- and then reintroduce the copy and catch up on changes, ie “resilver” the mirror. (b) use fancy logging within the storage layer to create a logical copy of the whole live DB at a point in time by tracking and storing changes. You can then take your time copying that logical version to your backup destination. Taking the initial copy is usually instantaneous and with (b) can take up a surprisingly small amount of space. Disadvantages?

Cost. These clever IO units that can do this seem to be eye-wateringly expensive
Tie-in. You move storage provider, you need to re-plan and implement a new backup strategy
Probably personal this one, but can you trust it? I saw it go horribly wrong in about 1998 and even now I kind of wince internally thinking about it.

Export and Import. OK, I kind of rubbished this approach earlier and who in their right minds would try and export/import a VLDB of 20TB? You don’t. You export the critical few percent of the database that you need to recreate some sort of temporary production-capable system. Many applications can actually get by with all the reference/lookup data and the latest month or two of active business data. It gets a workable system up and running to keep your business process ticking over whilst you sort out recovering the rest of the system. The beauty of an export is that it can be imported to any working Oracle database of a high enough release level.

3 months ago I would have said this consideration needed to have been designed into you system architecture from the start, to stand any reasonable change of working, but I know of one site that managed just this technique recently. Only because they absolutely had to, but they managed it.

My final backup methodology I’m going to mention here is – you do not need to back up all of your database in the same way. If you can move a large percentage of your database into readonly tablespaces, you can back up that section of the database once {disclaimer, by once I mean two or three times to two or three places and check you can read what you wrote and padlock the door to the vault it is in, and repeat said once-only backup every 6-12 months} and drop that section out of your backup. Now you only need to back up the remaining, hopefully small, active section of the database with whatever method you want. You can tie in the previous above of only needing to recover a critical subset of the system to get going again, ie what is not readonly, the two approaches complement each other. A few issues:

It only really works when you design this into the system from the start.
potentially complex recovery spread over a lot of tapes. Can you rely on being able to get at them in a timely manner?
People have a habit of wanting to update some of the stuff you made readonly. Sometimes only a few records but spread over a lot of readonly tablespaces.

All the above can be mixed and matched to come up with an overall solution. Personally I like having a physical standby database for immediate failover and an incremental backup off site for recovery beyond some situation that gets both primary and standby database.

Comments»

1. Niall Litchfield - October 13, 2009: Hey Martin

fantastic post, and really good to see someone responsible for design articulating it. Pet peeve, you keep using architect as a verb rather than a noun. I’m pretty sure you meant design every place you used it (except in job titles which are meaningless).

Reply
mwidlake - October 13, 2009: Thanks for that Niall.

“you keep using architect as a verb”…Well, I read my post back to myself and I would like to disagree with you – but I can’t, you are right. So I have updated the post :-).

I could not think of a way to replace calling myself an “Architect” without using some sentence including phrases like”hardware infrastructure” and “ongoing maintenance”, so I stuck with Architect. Sorry. It’s a better title than I usually get called…

Reply
2. coskan - October 21, 2009: Martin,

At One of my old companies we were suffering from backup timings and one thing came into our mind was backing up only data tablespaces + system tablespaces (index tablespaces were already separated )
The idea was Most of the time you will only need the data and indexes can be rebuild or recreated every time.
Do you think this can be added as a backup strategy as well at least as a different partial backup ?

Reply
3. mwidlake - October 21, 2009: It is certainly a valid consideration Coskan. It would make the backup smaller but extend the recovery time as you would need to rebuild all those indexes. I think you would need to be in the situation where the speed and size of the backup is very important and speed of recovery of the backup is much less important – maybe you have a standby as you primary method of providing continuity and the backup is for disasters. This particular system is not top of the pile in importance, so you would be recovering more critical systems first so you woul leave this one to rebuild it’s indexes in a corner whilst actively working on the more critical systems.

I’m actually still a proponent of “indexes in different tablespaces” for many reasons, but I should post a blog dedicated to that.

Reply
coskan - October 21, 2009: Thanks for the quick reply Martin. IMHO as long as I have the data restored and recovered I feel are much more secure than not having the data restored or backed up, as a DBA.
I started to believe that unless there isnt a catastrophic failure most of the time DBA’s do the recovery for point in time so just having the data only can save huge amounts of time, for the full recovery and you are absolutely right that this is a decision you can not choose without hesitation.

Again at the end of the day most of the time answer to the perfect model is “It depends” 🙂

It will be very good to read the reasons of being proponent to different index tablespace , so fingers crossed again.

Reply
Neil Chandler - October 22, 2009: Coskan,

I once had to do a recovery for a large German bank under this scenario. Disk failure with index tablespaces missing and they had never backed them up (by error, rather than design). Oracle 7.3.4 (it was 1996 IIRC). It took 2 days, with some rather unusual ini.ora parameters set to allow all kinds of transactional corruption before I got the database open again. The database was then unstable so I needed to export it and import it into a fresh database. I got it all back, but it was touch and go and it took a long time. I would think long and hard about not backing up index tablespaces. There is also the possibility that someone might put a table into the index tablespace by accident, too…
4. Blogroll Report 09/10/2009-16/10/2009 « Coskan’s Approach to Oracle - October 22, 2009: […] Martin Widlake-VLDB Backups […]

Reply
5. dombrooks - October 22, 2009: Backup and recovery is not something I tend to get involved in, not so much through choice but as a result of strictly compartmentalised IT departments.
However, I was reading with great interest these presentations from OOW:

“Oracle Database Backup-and-Recovery Best Practices and New Features” Material

Using a Fast Recovery Area, an initial backup, incremental backups, and then using those incrementals to roll forward the full backup seemed pretty neat (whilst backuping up the whole FRA to tape/etc)

Reply
6. kocakahin - October 24, 2009: I am happy to hear that you like it.

Reply