jump to navigation

What is a VLDB? September 18, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: ,
trackback

In a post just a couple of days ago on testing, I complained that VLDBs break at the edges. Coskan posted a comment asking I blog on what I consider to be a VLDB and thus what a VLDB DBA is, and I am more than happy to oblige, especially as Coskan not only comments a lot but provides a really good summary of blogs on his own blog. {in fact, I need to add Coskan to my blog roll, something I have been meaning to do for ages}.

Hopefully, this will link to the comments section of that post as Piet deVisser added a wonderful comment answering the question for me. Go and read, I suspect it is better than my attempt here!

VLDB stands for Very Large DataBase. It is not an acronym I like as it sounds suspiciously like a sexually transmitted disease, but maybe that is just a problem with my mind. The term ULDB appeared for a while but seems to have failed to gain traction. U stands for “Ultra” of course.

So what is a Very Large DataBase?

A VLDB is a database who’s very size gives you, as a DBA or database architect, extra work.

Maybe a simpler rule that you can apply is “you can’t back up the database in 24 hours using standard tools”. You can chuck more tape drives and IO channels at a DB but you will quickly hit a limit where you infrastructure or budget can’t cope.

Another, and one I pinch outrageously from Piet is “you can’t afford to duplicate the database for test/QA purposes”. That leads to a whole raft of issues.

I put forward another definition of a VLDB in a comment on the original blog. “Any database who’s size makes 8 out of 10 DBAs comment ‘that is a big database’.” That definition takes into account whether a database is generally beyond the experience of most DBAs /Designers. Why do I think that is significant? Because it means most DBAs/Designers will not have worked with a database that size and thus dealt with the associated problems. The database engine may {or may NOT, as I have complained about} cope with the database size, but you need staff to design it and look after it who know how to do so.

The definitive size of a VLDB, of course, goes up year by year. A few weeks ago I found a document I have mentioned in presentations a couple of times, an internal White Paper by Oracle Corp on what a VLDB is, written around 1994. Next time I am at home I’ll scan it. If I remember correctly, at that time 30GB and beyond on a VMS or Mainframe system was considered a VLDB and, in Unix (the new kid on the block back then), 20GB was the threshold.

Right now, as of September 2009, I would judge any database over 10TB of used space is a VLDB. In 12 months, that will be 20TB. In another 12 months, 40 or maybe 50TB.

“Moore’s Law” traditionally states that compute power doubles every 18 months, but I have just suggested that the VLDB limit doubles every 12 months. I say that as, over the last 10 years, I have worked on several systems, systems most DBAs would consider as “challengingly large”, which double in 12 months or less. Data is exploding. More and more of us are dealing with VLDBs.
This “doubling in 12 months” was not the case (I think) back in 1995, it started in 2000 or so. Before then, database size was doubling about in line or less than with Moore’s law I would say, but that is only my opinion.

What changed? People swapped from thinking you could store only what you really needed to thinking you could store “everything”. Which is “everything” your Moore’s-law expanding CPUs can process PLUS all the summary and metadata you extract from that data.

I could be wrong in my figures though. If you took size as the determining factor and doubled 20GB every 18 months from 1994, you would now class a VLDB, in 2009, as approx 20TB.

What main issues do you face with a VLDB?

  • Backing up the database. With a VLDB, a daily backup of everything via RMAN or Hot Backup is simply not possible, as you can’t run the backup in 24 hours. You need to: Backup less often; backup only part of the DB; use hardware such as mirror splitting or deltas; some other trick like, say, never backing it up but having 3 standbys. I’ve seen it done.
  • Performance. You need to consider radical changes such as removing RI or designing around full table scans and ignoring the block buffer cache for the largest tables.
  • The number or size of objects starts causing bits of Oracle to break or work less efficiently (so many tables it takes 2 minutes to select them all or you hit an unexpected limit like th 2TB disk size in ASM, because you need to use bigger disc sizes as otherwise you need more discs than you can easily manage).
  • Maintenance tasks become a challenge in their own right. This could be stats gathering, it could be adding columns to a table, it could be recreating global indexes, all of which now take more time than you can schedule in the maintenance windows {so part of the definition of a VLDB could be down to how active a database is and how small you maintenance windows are – 1TB could be a VLDB if you can never spend more than an hour doing anything!}
  • GUIs are no use to you. Listing all the tablespaces in your database with OEM is a pain in the proverbial when you have 962 tablespaces. You can’t keep track of all of them, visually.
  • You can’t properly test or prototype as you cannot afford to create a full sized test system

I’d like to pick up that last point. With A VLDB, you often end up doing things on the live system that you have been unable to test or prove because you simply lack a test system that is even within an order of magnitude the size of your live system. RAC is a particular issue, it seems many sites are happy to have the live system as a RAC system but not the test or development systems. When you raise the issue, the response is often “well, there is not that much difference between RAC and non-RAC systems is there?”. You are not allowed to get violent with the client,or even deeply sarcastic. Yes, there is a huge difference.

A VLDB DBA is someone who has had to consider the above for more than a few months, or on more than one system. Or who simply cries when you mention Oracle breaking when it exceeds size limits.

How do you know when you are dealing with a ULDB? When you can find no one else who will speak publically about a database bigger than yours. When I was working with the Sanger Institute on my pet “it is really quite huge” database I would often have these frustrating conversations with Oracle Corp:

“X is giving me trouble as the database is just so large”

“It is not that large, we have customers with bigger database”

“Can you introduce me, so we can talk about these issues?”

“Errr, no , they don’t like to talk about their systems”.

Thanks. Thanks a bunch.

Remember, there is always someone with a bigger DB than you. But they probably won’t talk about it.

 

Enough for tonight….

About these ads

Comments»

1. Dion Cho - September 18, 2009

Very interesting and insightful post.

Thanks for sharing your thoughts.

mwidlake - September 18, 2009

Hi Dion,

Thanks for that. I’m glad to see you stopping by, I look at your blog from time to time, or land on there when googling issues, it is very good.

Martin

2. coskan - September 18, 2009

Thank you thank you thank you. Clear as day

I think this post can be a kick off for possible VLDB problems and solutions series (I start to wish again :) )

The best definition to me is VLDB is a DB that “you can’t back up the database in 24 hours using standard tools”

Thank you again

mwidlake - September 18, 2009

I’m glad to make you happy Coskan :-)

3. Greg Rahn - September 20, 2009

Some comments on VLDB performance:

I would not recommend RI in a VLDB DW. There are better and faster ways validate the data than to use slow row-by-row RI. Same goes for primary keys. Parallel table scans and joins can be used to validate relationships in the ELT phase very fast and effectively.

Indexes, as most OLTP DBAs know them, are becoming a thing of the past in a DW. Data elimination has become the primary objective in a VLDB. Things like partitioning and Exadata Storage Indexes are much better technologies for returning the data for a query on very large fact tables. Indexes take up space, slow down loads and simply are not really useful on fact tables for other than (maybe) the single key lookup. Besides, who really needs indexes when your full rack Sun Oracle Database Machine can blaze through table/partition scans with up to 21 GB/s of uncompressed raw disk bandwidth. Add on some compression and the logical scan rate goes to 60GB/s (3x compression) to over 200GB/s (10x compression).

mwidlake - September 20, 2009

The role of RI in DW VLDBs is an interesting one Greg, thanks for posting.

My knee-jerk response to removing Referential Integrity from a relational database is always “I’m not comfortable with that”. Which is a step down from what it was 10 years ago of “are you kidding?”. I suspect most DBAs who do not deal with terabytes of data and/or work primarily on OLTP databases would still say “are you kidding?” and rightly so in my opinion.

But the performance improvements, as you say, can be considerable and you can save storage (though I see that as secondary). These benefits from removing RI are more significant the larger the VLDB, as you need to be able to process all that data in a timely way to justify gathering and keeping it.

If you turn off RI, there is a likelihood (if not a certainty) that bad data will start to accumulate, especially as different people and teams bring in data from more sources or if data in the VLDB gets re-processed (columns updated, data removed etc). I’ve seen RI turned off, I’ve done it myself, but it seems rare that anyone then puts in place an alternative system to check for bad data accumulating.

If your source systems have RI and you verify the cross-system RI during the ELT phase you can be pretty sure that the data in the DW has integrity. Further, most DW systems are pulling out summarised data or marketing/trend analysis data where the odd mistake is probably not significant. You can afford to let some bad data get through for the sake of the performance gains, unlike on an OLTP system dealing with hospital records or finance, where even one RI issues is one too many.

This should be a blog thread all of it’s own I guess.

Oh, and one last comment. If you don’t need RI, why use a relational database engine in the first place? MySQL with ISAM tables can be incredibly fast and it is hard to beat flat files and ‘C’ {is this a red rag to an Oracle Corp real-world performance “bull” eh Greg? :-) }

Greg Rahn - September 20, 2009

Allow me to clarify my comments on RI. When I said “I would not recommend RI in a VLDB DW”, I should have qualified that with “RI…enforced by constraints”. I encourage people to build in to their data flow a validation step for data before it is published to the final table. One can use simple joins as filters to check for the existence or absence of key values. In our “The Terabyte Hour with the Real-World Performance Group” session at OOW we will demonstrate how to do these validations with joins. I think folks will be amazed how fast this can be done using Parallel Execution on a Oracle Database Machine.

Bottom line: We can validate the RI of the data with parallel execution and joins much faster than using PK/FK constraints. This gives is “clean” data fast!

mwidlake - September 21, 2009

I agree, validating the data with large scans and hash joins/anti-joins at the ETL phase is powerful, and certainly fast with nice shiny new kit like a Database Machie or Exadata….If you have the spending power to obtain it :-)

As an extra element to the potential speed for doing large scans to validate data, which works with lowly 10.2 (and I think earlier) and does not need extra technology, you can get a significant performance boost on full segment scans by invoking Parallel Execution with Asynchronous IO enabled as the scans are direct, bypassing the SQL layer. It’s the direct element, not the parallelisation that really helps. It can be almost an order of magnitude faster. I know this feature is extend and enhanced in 11.2 but I have not had a chance to look at it myself. As I understand it, now non-parallel IO con be direct and Oracle can identify automatically when to use it?

But I had a timely reminder this very weekend about the problems with growing data damage I cautioned about. A client of mine has found data issues in a DW VLDB and this is taken from their email:

“Not sure how the duplicate rows have come about – The API to generate the charge checks if a charge has been published and will only allow it to be regenerated if a force option is used . The first thing it does is delete the data. I guessing maybe it can be invoked twice concurrently, but as far as I know no one has been charged twice.”

They had removed RI, are relying on the ETL process {which seemed rigorous} and only spotted the slow data damage very recently. You have to watch for these things.

I’d love to see the presentation at OOW this year but I’m having to miss the event again. It’s too many years since I came along. I hope it goes well Greg.

4. Juan Perez. - September 22, 2009
5. Kamran Agayev A. - October 6, 2009

Hi Martin
Very interesting article on VLDB. Thank you very much


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 160 other followers

%d bloggers like this: