jump to navigation

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,
7 comments

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful! :-)

An Oracle server – How Fast for £1,000 July 27, 2010

Posted by mwidlake in Architecture, performance, Testing.
17 comments

Question? How fast an Oracle server can you create for £1,000 pounds?

{I’d really appreciate feedback and suggestions on this particular post}

The power of domestic PCs continues to grow, with four-core chips become pretty much standard and starting RAM looking more like 4GB than 2GB, with 8GB quite reasonable. So, how quick an Oracle server can you make based on a domestic PC? After all, those of us who play with Oracle in our spare time tned to use such machines and, in fact, they are often not far off what are our smaller servers at the office really are. When I worked at the Wellcome Trust Sanger Institute, we had to make our IT budget pounds go a long way. We were, after all, a charity with a limited budget but also a scientific organisation with a huge demand for data and processing. So we used a lot of cheap kit.

I’m seriously thinking of giving this a go. I need a new PC anyway and so I am willing to use it, at least initially, to see what can be done.

If I do this, I’m going to need to set some boundaries on the exercise. How about:

  • The oracle licence is being ignored in the cost {and please, I don’t need to be told how the licence can be more than the hardware costs!}. OS cost is though.
  • I am not aiming for enterprise-level resilience, so I am not going to consider hot-swappable components, dual redundant power supplies or things like that.
  • I am going to use new kit, so no scavenging or buying second-hand. It must all be easily available and repeatable.
  • I will use local storage in the server or connected to ports available on the server.
  • It will support a database of 1TB in size {yet to be designed}.
  • Oracle v11. Enterprise edition but nothing special like TimesTen or Exadata (unless Oracle are willing to sell me an Exadata box for a grand, then I’ll consider it).
  • I’m not considering backup and recovery performance {and this would be a serious oversight if this was a real system, but most places have central backup/recovery facilities}.

I would also have a few other things to decide.

The main one is “Do I use Linux or Windows?” Yes, you are all probably shouting “Linux!!!” but I have never been a Linux sys admin (I was an incredibly poor HP-UX system admin for 3 months though) so it will take me more time to deal with issues under Linux – in work situations I have always had access to people who know all this stuff to sort out issues but in this case I will be doing this on my own. On the other hand, you can just chuck Oracle on a standard windows box and it works, and as a rule hardware just works under Windows. If I decide to use USB3 ports, for example, is it going to be a major pain getting drivers under Linux? But then if I want the fastest oracle box under a grand why would I slow it down with windows and spend money on the licence? I just want the box to run Oracle and a workload.

The second “software” decision is, how do I measure performance? I think I could be getting to grips with Dom Giles’ excellent Swingbench {BTW, nice tag line on that page, Dom :-)}. But it runs on Java and guess what boys and girls? I’ve never been a Java developer. How limited are my skills! So that would take some of my precious spare time up too.

I’d love feedback on this, I’d love to know what hardware suggestions you would make, what you think about the overall idea, what else I need to consider to make the tests valid… I have a few ideas already for the hardware architecture and the intention would be to try lots of things but I’ll save that for a second post. After all, if I get no feedback I might just spend the money on a gaming machine and a week’s walking in the Lake District instead.

And if anyone want to help with the cost, please send cheques to….

More Memory Meanderings – IOPS and Form Factors July 19, 2010

Posted by mwidlake in Architecture, Management, performance.
Tags: ,
8 comments

I had a few comments when I posted on solid state memory last week and I also had a couple of interesting email discussions with people.

I seriously failed to make much of one of the key advantages of solid-state storage over disk storage, which is the far greater capacity of Input/output operations per second (IOPS), which was picked up by Neil Chandler. Like many people, I have had discussions with the storage guys about why I think the storage is terribly slow and they think it is fast. They look at the total throughput from the storage to the server and tell me it is fine. It is not great ,they say, but it is {let’s say for this example} passing 440MB a second over to the server. That is respectable and I should stop complaining.

The problem is, they are just looking at throughput, which seems to be the main metric they are concerned about after acreage. This is probably not really their fault, it is the way the vendors approach things too. However, my database is just concerned in creating, fetching, and altering records and it does it as input/output operations. Let us say a disk can manage 80 IOPS per second (which allows an average 12.5 ms to both seek to the record and also read the data. Even many modern 7,200 rpm discs struggle to average less than 12ms seek time). We have 130 disks in this example storage array and there is no overhead from any sort of raid or any bottleneck in passing the data back to the server. {This is of course utterly unbelievable, but if i have been a little harsh not stating the discs can manage 8ms seek time, ignoring the raid/hba/network cost covers that}. Each disc is a “small” one of 500GB. They bought cheap disk to give us as many MB/£ as they could {10,000 and 15,0000 rpm disks will manage 120 and 160 IOPS per second but cost more per MB}.

Four sessions on my theoretical database are doing full table scans, 1MB of data per IO {Oracle’s usual max on 10.2}, Each session receiving 100MB of data a second, so 400MB in total. 5 discs {5*80 IOPS*1MB} could supply that level of IOPS. It is a perfect database world and there are no blocks in the cache already for these scans to interrupt the multi-block reads.

However, my system is primarily an OLTP system and the other IO is records being read via index lookups and single block reads or writes.

Each IOP reads the minimum for the database, which is a block. A block is 4k. Oracle can’t read a bit of a block.

Thus the 40MB of other data being transferred from (or to) the storage is single block reads of 4k. 10,000 of them. I will need 10,000/80 disks to support that level of IO. That is 125 discs, running flat out.

So, I am using all my 130 discs and 96% of them are serving 40MB of requests and 4% are serving 400MB of requests. As you can see, as an OLTP database I do not care about acreage or throughput. I want IOPS. I need all those spindles to give me the IOPS I need.

What does the 40MB of requests actually equate to? Let us say our indexes are small and efficient and have a height of 3 (b-level of 2), so root node, one level of branch nodes and then the leaf nodes. To get a row you need to read the root node, branch node, lead node and then the table block. 4 IOs. So those 10,000 IOPS are allowing us to read or write 10,000/4 records a second or 2,500 records.
You can read 2,500 records a second.

Sounds a lot? Well, let us say you are pulling up customer records onto a screen and the main page pulls data from 3 main tables (customer, address, account_summary) and translates 6 fields via lookups. I’ll be kind and say the lookups are tiny and oracle just reads the block or blocks of the table with one IO. So that is 9IOs for the customer screen, so if our 40MB OLTP IO was all for looking up customers then you could show just under 280 customers a second, across all users of your database. If you want to pull up the first screen of the orders summary, each screen record derived from 2 underlying main tables and again half a dozen lookups, but now with 10 records per summary page – that is 80 IOs for the page. Looking at a customer and their order summary you are down to under thirty a second across your whole organisation and doing nothing else.

You get the idea. 2,500 IOPS per second is tiny. Especially as those 130 500GB disks give you 65TB of space to host your database on. Yes, it is potentially a big database.

The only way any of this works is due to the buffer cache. If you have a very healthy buffer cache hit ratio of 99% then you can see that your 2500 records of physical IO coming in and out of the storage sub-system is actually supporting 250,000 logical-and-physical IOPS. {And in reality, many sites not buffer at the application layer too}.

Using Solid State Storage would potentially give you a huge boost in performance for your OLTP system, even if the new technology was used to simply replicate disk storage.

I think you can tell that storage vendors are very aware of this issue as seek time and IOPS is not metrics that tend to jump out of the literature for disk storage. In fact, often it is not mentioned at all. I have just been looking at some modern sales literature and white papers on storage from a couple of vendors and they do not even mention IOPS – but they happily quote acreage and maximum transfer rates. That is, until you get to information on Solid State Discs. NOw, because the vendor can say good things bout the situation then the information is there. On one HP white paper the figures given are:

				Modern super-fast		Top-end 
				SAS disk drive Top-end 	Solid State Disk
Sustained write     	150MB/s			180MB/s
Sustained read			90MB/s			180MB/s
Random write			285				5,000+
Random read				340				20,000+ 

More and more these days, as a DBA you do not need or want to state your storage requirements in terms of acreage or maximum throughput, you will get those for free, so long as you state your IOPS requirements. Just say “I need 5000 IOPS a second” and let the storage expert find the cheapest, smallest disks they can to provide it. You will have TBs of space.

With solid-state storage you would not need to over-specify storage acreage to get the IOPS, and this is why I said last week that you do not need solid state storage to match the capacity of current disks for this storage to take over. We would be back to the old situation where you buy so many cheap, small units to get the volume, IOPS are almost an accidental by-product. With 1GB discs you were always getting a bulk-buy discount :-)

I said that SSD would boost performance even if you used the technology to replicate the current disk storage. By this I mean that you get a chunk of solid-state disk with a SATA or SAS interface in a 3.5 inch format block and plug it in where a physical disk was plugged in, still sending chunks of 4k or 8k over the network to the Block Buffer Cache. But does Oracle want to stick with the current block paradigm for requesting information and holding data in the block buffer cache? After all, why pass over and hold in memory a block of data when all the user wanted was a specific record? It might be better to hold specific records. I suspect that Oracle will stick with the block-based structure for a while yet as it is so established and key to the kernel, but I would not be at all surprised if something is being developed with exadata in mind where data sets/records are buffered and this could be used for data coming from solid state memory. A second cache where, if using exadata or solid-state memory, holding single records. {I might come back to this in a later blog, this one is already getting bloated}.

This leads on to the physical side of solid-state discs. They currently conform to the 3.5” or 2.5” hard disc form factor but there is no need for them to do so. One friend commented that, with USB memory sticks, you could stick a female port on the back of a memory stick and a joint and you could just daisy-chain the USB sticks into each other, as a long snake. And then decorate your desk with them. Your storage could be looped around the ceiling as bunting. Being serious, though, with solid state storage then you could have racks or rows of chips anywhere in the server box. In something like a laptop the storage could be an array 2mm high across the bottom the chasis. For the server room you could have a 1u “server” and inside it a forest of chips mounted vertically, like row after row of teeth, with a simple fan at front and back to cool the teeth (if needed at all). And, as I said last time, with the solid state being so much smaller and no need to keep to the old hard disk format, you could squeeze a hell of a lot of storage into a standard server box.

If you pulled the storage locally into your server, you would be back to the world of localised storage, but then LANs and WANs are so much faster now that if you had 10TB of storage local to your server, you could probably share it with other machines in the network relatively easily and yet have it available to the local server with as many and as fat a set of internal interfaces as you could get your provider to manage.

I’m going to, at long last, wrap up this current instalment on my thoughts with a business one. I am convinced that soon solid-state storage is going to be so far superior a proposition to traditional disks that demand will explode. And so it won’t get cheaper. I’m wondering if manufacturers will hit a point where they can sell as much as they can easily make and so hold the price higher. After all, what was the argument for Compact Discs to cost twice as much to produce as old cassette tapes, even when they had been available for 5 years? What you can get away with charging for it.

Memory Changes Everything July 12, 2010

Posted by mwidlake in Architecture, performance.
Tags: , , ,
9 comments

I’ve got this USB memory stick which I use to carry around my scripts, documents, presentations, Oracle manuals and enough music to keep me going for a few days. It is on an 8GB Gizzmo Junior and it is tiny. By tiny I mean as wide as my little finger, the length of a matchstick and about the same thickness of said matchstick. So small that I did indeed lose the damn thing for 6 months before I realised it had got trapped behind a credit card in my wallet.

It cost me ten British pounds about 15 months ago (less than most 4GB USB sticks seem to cost now, but then it is nothing more than the memory chip and connectors wrapped in plastic) and it highlights how cheap solid-state “storage” is becoming.

Connected to this, I was looking at buying a new PC this week and this machine comes with 10 USB slots, if you include the ones on the supplied monitor and stubs on the motherboard.
10 USB slots, 8GB gizzmo memory sticks… That would be 80GB of cheap and fast storage. Now get a few USB hubs and bulk-buy a few dozen cheap USB2 sticks and you could soon have a solid-state database of a few hundred GB for a thousand pounds. Then of course you can have fun seeing where the pinch-points in the system are (USB2 has a maximum speed per port and going USB3 right now is going to break that 1 grand barrier. But give it a year…).

This really started me thinking about when memory-based storage would take over from spinning disk as the best option for enterprise-level storage and my gut feeling is in about 5 years. I think it will be both technically possible and financially viable in much less than that, say as little as 2 years, but the cost of solid-state storage per MB will still be higher than disk by then but potentially much faster. A few considerations going through my mind were:-

  • Disk is getting a lot slower in relation to acreage. By this I mean that, for a single disc drive, capacity is doubling about every 18 months but seek time has hardly reduced in a decade and transfer rate (reading from the physical platters to the units buffer) is again almost stationary, at about 120MB/s for 10,000rpm disk and up towards 180 for those very expensive and noisy 15,000 rpm disks. Being a tad ridiculous to make the point, with modern 3TB disks you could build most Oracle database on one disc. Let’s make it two in a raid 10 configuration for redundancy. My point is, your 3TB database could well be being run right now, for real, across say 5 physical disks with a total sustainable physical throughput of around 500MB a second.
  • Solid state storage seems to be halving in price in more like 8-10 months.
  • IO subsystems are made faster by using RAID so that several physical discs can contribute to get towards the 300MB or so speed of the interface – but solid state is already that fast.
  • IO subsystems are made faster by building big caches into them and pre-fetching data that “might” be requested next. Oh, that is kind of solid state storage already.
  • Solid state storage, at least the cheap stuff in your USB stick, has the problem that you can only write to each bit a thousand or so times before it starts to get unreliable. But physical disk has exactly the same issue.
  • There are new methods of solid-state memory storage coming along – “New Scientist” had a nice article on it a few months ago, and these versions will be even higher density and more long-term reliable.
  • Seek time on solid-state memory is virtually zero, so random IO is going to be particularly fast compared to spinning disk.

Solid state memory needs less power, and thus less cooling, is silent, is potentially denser and is less vulnerable to temperature and humidity fluctuations. I can see it not needing to be kept in a specialist server room with the need for all that air con and ear defenders when you go in the room.
Just somewhere with normal air con and a lock on the door should suffice.
We do not need Solid State storage to match the size of current disks or even be as cheap to take over. As I have already pointed out, it is not acreage you need with physical disks but enough spindles and caches to make it fast enough in relation to the space. Further, we can afford to pay more for solid state if we do not need to keep it in such expensive clean-room like environments.

I can see that in a couple of years for a given computer system, say a mixed-workload order processing system, to support the storage needs we will have maybe a dozen solid-state chunks of storage, perhaps themselves consisting of several small units of memory in some sort of raid for resilience, all able to flood the IO channels into our processing server and the issue will be getting the network and io channels into the server to go fast enough. So don’t, stick all the storage directly into the server. You just got rid of half your SAN considerations.

I’m going to stop there. Partly because I have run out of time and partly because, in checking out what I am writing, I’ve just spotted someone did a better job of this before me. Over to James Morle who did a fantastic post on this very topic back in May. Stupid me for not checking out his blog more often. Jame also mentions that often it is not total throughput you are interested in at all but IOPS. That zero latency of solid-state memory is going to be great for supporting very high IOPS.

Friday Philosophy – Madness demands Attention May 28, 2010

Posted by mwidlake in Architecture, Friday Philosophy.
Tags: , ,
2 comments

Many years ago I had a good friend who was a psychiatric nurse. We were talking about his job once and he was saying how some patients just took up much more time than others. These were generally the ones who would be deemed “the most mad” in a non-clinical manner {and is pretty much how my friend the psychiatric nurse put it}. These patient’s actions or need for intervention would put demands on the staff far more than other patients. As a result, all the nurses tended to get to know (or know of) such patients better than others.

I thought of this the other day when a few of us were talking about some awful bit of application we were concerned about. This thing inserts rows from a MSSQL database into a table in an Oracle database. Triggers on the initial table fire and populate another table, in a 1-to-many relationship. This second table also has a trigger on it that further inserts into another set of tables. A regular process then aggregates this data – and sticks it back into the MSSQL database it came from.

Said process is done as a single transaction for all rows inserted for the day. Irrespective of the growth in rows. Or the fact that one source “application” has grown to 10 and soon will grow to 50. All rows in one transaction. The intermediate tables are never cleared out and get bigger and bigger. No one else needs any of this data in the Oracle side of the system.

There are several “madnesses” to this process – why put it into Oracle only to put it back in the source system, why use a busy production system to hold and process transient data, why no clean up, why are the records not processed in batches, the cascading triggers magnifies transaction data volumes…

This process is well-known in our group. I’ve been involved. Both the guys I was talking to have been involved. I can see from my desk 4 or 5 other people who have been roped into bullying this process thought before now. In fact, I reckon half the department have had to work on this damned thing at some point.

Can you see why I was reminded of my conversation with my old nurse friend?

The application is simply mad. And as a result it is demanding not only on our database but on all of the team, as so many of us have had to get involved working on it. We have all got to know it so well.

I’m glad to say that treatment for the application is planned and, hopefully, it will soon be a lot happier.

So will we.

Relax with Good Security :-) April 14, 2010

Posted by mwidlake in Architecture.
Tags: , ,
1 comment so far

It’s nice to know that your computer system is very secure. It allows you to relax.

Is this because you feel confident that your data is protected and that your systems cannot be compromised by malicious actions?

Well maybe, but in my case, today (*), this is not why I can relax. Security is allowing me to relax as I’ve been locked out of my account at my client’s site and it is hard to tune SQL/review design documents if you can’t log on. It is a little bizarre that my account expired today as I know the date my account is set to self destruct – I filled in the relevant forms myself and simply had my boss sign them before I popped them over to the security team. Unfortunately, I can’t pop back over to them to jolly things along as I am now in a remote location (really, very, very remote!).

I just have to wait patiently for HR, Security and Management to complete the complex little Morris Dance they have to do to confirm I am a real person, need the access and have not recently been asked to leave {I hope I am not going to be asked to leave!}.

I’m not really complaining. Having worked for the NHS back in the late 80’s and early 90’s, I have shocking tales to tell, such as of taking home confidential patient clinical reports in order to play with “SQL*report writer” on my personal desktop, and similar stories about modern organisations in the private sector {well what do you know, person X owes HOW MUCH?}. So I am glad security is an order of magnitude better now. {No, honestly, it really is! No matter how bad you think things are where you are now, you should see the stuff I saw back then. These days, you at least have the ability to lock things down, if not the real-world reality.}

But, like all process-oriented things, it has to be designed to work well to be, well, effective. I got locked out of the same systems in a similar way just before Christmas. How many people are being paid a day’s wage to ring up the helpdesk and drink endless tea?

(*) I wrote this blog a while back. It might even have been before I had a blog. But a similar incident reminded me. Oh, and what was my reward for an enforced day being paid to drink tea and read a copy of “Oracle Scene”? Three days of working like a madman to catch up on that lost day. *sigh*. I did not get paid overtime.

Partitions are Not for Performance October 29, 2009

Posted by mwidlake in Architecture, performance, VLDB.
Tags: , , ,
16 comments

There is a myth that Partitions in Oracle magically aid SQL Query performance, even a general assumption that the main role of partitioning is to aid such query performance. This is not so. There are certainly many cases where Partitioning can aid performance but in my experience they just as often hinder rather than help overall performance.

The myth is so strong that when asked at interview, most DBAs {and even some performance specialists} will site query performance as the main (and occasionally only) benefit of partitioning.

Why can partitions hinder SQL query performance? Let’s say for example that you have a 10 million row table and an index on that table. That index has a B-Level of 3, which means it has a root block, one level of branch nodes, a second layer of branch nodes and then the leaf-node level. To access a row in the table via the index Oracle needs to read the root block, two branch blocks and then a leaf block to get the rowid of the record. This allows the table block {and from that the row} to be read. This is depicted in the below diagram, the numbered orange squares represent the blocks as selected in turn from the index and then table:

index_access_global

That is 5 I/O operations to access that row.

Now we partition the table into 5 partitions, 2 million rows each. The index is locally partitioned. If you are lucky, you may, just may, end up with local index partitions with a B-level 1 less then the original table, so in our case a B-level of 2. The often suggest process is now that one partition is considered and thet the CBO will read one root node, a branch level block, a leaf block and then the block from the partition.

single_index_partition_access

4 I/Os and a saving of 20% {I’m ignoring for now caching and whether it is physical or logical IO}.

A saving of 20% IF {and only if} you have local index partitions with a lower B-Level than the equivalent non-partitioned index. And the access is to one partition alone.

However, I keep seeing situations where the index look up does not include the partition key. So you get the below situation:

index_lookup_partition

Lacking the partition key, the CBO cannot exclude any partitions – so it has to scan each one. For most partitions, maybe for all but one, no records are found, but the index has to be checked with 3 IO operations each. so in my example 16 I/Os are done to get the one record of interest.

16 I/O operations instead of 5.

The situation is often not spotted, at least initially, as the time taken to carry out the extra local index partition scans is “small”, especially for specific lookups. Usually any testing is done on a table with only a small numer of partitions.

I remember well the first time I came across this {on an Oracle 9.0 database I think}, there was well over 100 partitions and a process that checked many thousands of individual records had slowed down significantly, taking 4 or 5 times as long as before.

An indicator that the problem is occurring is when a single record lookup against the index and then table is taking perhaps several dozen to several hundred consistent gets rather than the 5 or 6 it should. Also, you will see the partition iterator in the explain plan. In that first case where I came across the issue, consistent gets of about 380 per record fetched were being seen for a select that returned 1 record 99% of the time from a single table lookup. I’ve since seen the same problem on tables with over a thousand partitions, each local index being scanned for a total of almost 5000 consistent gets per record.

You may think that this would be an unusual situation as access against very large tables is either full/partial table scans or lookups on the PK/with a WHERE clause including the partitioning key – but it is actually very common. Partitioned tables are being used more and more in large but generally OLTP applications or sets of records are identified in a datawarehouse that are then checked more specifically with generally row-specific logic. With VLDBs which have many, many partitioned tables with many, many partitions each, the problem is common and often not recognized in the fog of other issues and massive I/O levels.

I’ve only covered a general performance issue with partitions here, I’ll expand on the theme and this simplistic example in the next post.

And yes, there are many ways partitioning CAN aid performance. I aim to get to those too. I really love partitioning.

VLDB Backups October 13, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: , , ,
9 comments

One of the indications that your database classes as a VLDB is that your backups are giving you problems simply due to the size of the database.

As some of you will know from previous posts of mine, like this one about how vital it is to prove your backup and this one about how you maybe need to back up more than just the database, I have a thing about backups. Or, more specifically, recovery.

My focus on backups may surprise some people who have worked with me as I often state “I am not a production DBA”. {And no Dave, this is not me saying I am too good to be a proddy DBA, it is me saying I do not have current, strong skills in many of the daily proddy DBA tasks}. However, I am an architect. Whenever I work on a new system or heavily modify an existing system, I try and keep the need for backup and recovery at the front of my mind.

The most common issue encountered with backing up a VLDB is the time it takes to run the backup, it can’t be done in the time between backup cycles, usually a day. 

The second most common issue is the impact on the live system of running the backup. Sometimes this impact is overstated, after all if the backup is one thread running on the server it is only going to consume as much resource as one thread can, which may well leave enough over for the daily processing requirements, but usually for large systems steps have been taken to run the backup in parallel, thus creating considerable load on the system.

A third issue, which is related to the first, is that the backup takes so long and uses so many tapes (or space) that it rarely completes – a network glitch, a failure of the backup suite, running out of media, all stop the backup finishing. I’ve been in the situation of attempting 4 or 5 backups for each one that succeeds as something crops up in the 2 or 3 days it takes to run the backup. {In our case it was the flaky backup software, grrrrr}.

The final issue I’ll mention is one that is often overlooked. You can’t afford the time to recover the backup if it was ever needed. I’ve seen this especially with export or expdp-based backups – Some sites still use export and it has it’s place with smaller systems – often it seems to be used with OLTP systems that have more than 75% of the database volume as indexes. The export runs fine overnight, it is only processing that 25% of the system that is data. But when you ask the client if they can wait 5 days to import the export they go pale. This time-to-recovercan also be a problem with RMAN backups, you need to read in everything you wrote out.   

I’ve said it before but I’m going to say it again – a backup is not a backup until you have done a successful test full recovery. This would certainly highlight how long your recovery takes.

So, how do you solve the problem of backing up a VLDB?

Well, one solution is to not bother. I know of a couple of sites that have two physical copies of the database, at different locations, and write all data to both. If they lose one copy, they can keep running on the other copy whilst the lost version is rebuilt. Your swap-over could be almost instant.
Drawbacks here are:

  • If you lose one copy you have no redundancy until the second system is rebuilt. This is like losing a disk out of a RAID5 array, another failure is disaster. As databases get bigger, this period of zero redundancy gets longer and thus the chance of a second failure increases (which again is just like the RAID5 array – yet another argument against massive discs).
  • As you write to both systems, if the damage is caused by the feed (eg accidentally deleting data) then both are damaged, unless you have a delay on one system, in which case you now have issues with catching up on that delay if you have to swap to the second system. Flashback may save you from damage caused by the feed.
  • The cost of the second system and the complexity of the double-writes can both be issues.

Another solution is physical DataGuard. I see this as slightly different from the double-system approach as above as you have more options, such as replicating to more than one other system, opening and reading  the DataGuard copy, opening and using the copy before flashing it back and re-recovering, even Active DataGuard, where you can have the standby database open and being used, even whilst it is kept up-to-date. Again, you can set things up so that the gap between primary system failure and bringing up a new production system is small. A few issues to be mindful of are:

  • You have to ensure that your primary database is running in forced logging mode or you are extremely, and I mean extremely, careful about what you do against the database that is unrecoverable. The latter option is just asking for trouble actually. Which is a shame, as all those performance tricks of doing direct IO, append operations and nologging activities to help you process all the data in your VLDB are no longer available to you. This might be a show-stopper.
  • You have to take care in setting it all up and may need extra licence.
  • You still have the issue of damage being transmitted to your “backup” before you spot it.
  • The main issue? Someone will get clever and use your DataGuard systems for other things {Such as opening the standby to changing it and then flashing the data back, or use active data guard for reporting which becomes critical to your business} and now you actually have a production critical system split across the DataGuard architecture. It has stopped being a backup, or at least not a dedicated backup. Ooops.

There is actually no need to backup the whole database every night, though some sites seem fixated on achieving this. Or even every week. There is nothing wrong in having an RMAN level 0 {zero} backup that is a copy of everything and then just keep backing up the archived redo logs for eg 2 weeks before doing another level 0. So long as you thoroughly test the recovery and ensure you can recover the level 0, get hold of all those redo logs and apply them in a manner timely enough to support your business. I’ve recovered a level 0 backup over a month old and then run through all the archived redo logs to recreate the system, it worked fine as the volume of redo was pretty small compared to the database. Some considerations with this method are:

  • If you ever have trouble getting files from different days out of your backup area, or occasionally find files from your backup system are corrupt, do not even think of  using this method. One missed archive redo file from 13 days back and you are in serious trouble.
  • You need to do those level zero backups and they take a while. remember what I said about issues during a long backup?
  • It can get complex.
  • There is going to be a fairly significant delay in recovering your system.

There are several options with RMAN of doing incremental and cumulative incremental level 1 backups against a level 0 baseline backup. They have the same pros and cons as above, often trading more complexity with shorter recovery times. All good so long as you practice the recovery.

Physical copy at the storage level. These solutions seems to come and go every few years, but the principle is usually either (a) splitting mirrors – you have eg 3 copies of the data duplicated across the storage, you can un-couple one copy and do to it what you want, like copy it to tape- and then reintroduce the copy and catch up on changes, ie “resilver” the mirror. (b) use fancy logging within the storage layer to create a  logical copy of the whole live DB at a point in time by tracking and storing changes. You can then take your time copying that logical version to your backup destination. Taking the initial copy is usually instantaneous and with (b) can take up a surprisingly small amount of space. Disadvantages?

  • Cost. These clever IO units that can do this seem to be eye-wateringly expensive
  • Tie-in. You move storage provider, you need to re-plan and implement a new backup strategy
  • Probably personal this one, but can you trust it?  I saw it go horribly wrong in about 1998 and even now I kind of wince internally thinking about it. 

Export and Import. OK, I kind of rubbished this approach earlier and who in their right minds would try and export/import a VLDB of 20TB? You don’t. You export the critical few percent of the database that you need to recreate some sort of temporary production-capable system. Many applications can actually get by with all the reference/lookup data and the latest month or two of active business data. It gets a workable system up and running to keep your business process ticking over whilst you sort out recovering the rest of the system. The beauty of an export is that it can be imported to any working Oracle database of a high enough release level.

3 months ago I would have said this consideration needed to have been designed into you system architecture from the start, to stand any reasonable change of working, but I know of one site that managed just this technique recently. Only because they absolutely had to, but they managed it.   

My final backup methodology I’m going to mention here is – you do not need to back up all of your database in the same way. If you can move a large percentage of your database into readonly tablespaces, you can back up that section of the database once {disclaimer, by once I mean two or three times to two or three places and check you can read what you wrote and padlock the door to the vault it is in, and repeat said once-only backup every 6-12 months} and drop that section out of your backup. Now you only need to back up the remaining, hopefully small, active section of the database with whatever method you want. You can tie in the previous above of only needing to recover a critical subset of the system to get going again, ie what is not readonly, the two approaches complement each other. A few issues:

  • It only really works when you design this into the system from the start.
  • potentially complex recovery spread over a lot of tapes. Can you rely on being able to get at them in a timely manner?
  • People have a habit of wanting to update some of the stuff you made readonly. Sometimes only a few records but spread over a lot of readonly tablespaces.

All the above can be mixed and matched to come up with an overall solution. Personally I like having a physical standby database for immediate failover and an incremental backup off site for recovery beyond some situation that gets both primary and standby database.

What is a VLDB? September 18, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: ,
10 comments

In a post just a couple of days ago on testing, I complained that VLDBs break at the edges. Coskan posted a comment asking I blog on what I consider to be a VLDB and thus what a VLDB DBA is, and I am more than happy to oblige, especially as Coskan not only comments a lot but provides a really good summary of blogs on his own blog. {in fact, I need to add Coskan to my blog roll, something I have been meaning to do for ages}.

Hopefully, this will link to the comments section of that post as Piet deVisser added a wonderful comment answering the question for me. Go and read, I suspect it is better than my attempt here!

VLDB stands for Very Large DataBase. It is not an acronym I like as it sounds suspiciously like a sexually transmitted disease, but maybe that is just a problem with my mind. The term ULDB appeared for a while but seems to have failed to gain traction. U stands for “Ultra” of course.

So what is a Very Large DataBase?

A VLDB is a database who’s very size gives you, as a DBA or database architect, extra work.

Maybe a simpler rule that you can apply is “you can’t back up the database in 24 hours using standard tools”. You can chuck more tape drives and IO channels at a DB but you will quickly hit a limit where you infrastructure or budget can’t cope.

Another, and one I pinch outrageously from Piet is “you can’t afford to duplicate the database for test/QA purposes”. That leads to a whole raft of issues.

I put forward another definition of a VLDB in a comment on the original blog. “Any database who’s size makes 8 out of 10 DBAs comment ‘that is a big database’.” That definition takes into account whether a database is generally beyond the experience of most DBAs /Designers. Why do I think that is significant? Because it means most DBAs/Designers will not have worked with a database that size and thus dealt with the associated problems. The database engine may {or may NOT, as I have complained about} cope with the database size, but you need staff to design it and look after it who know how to do so.

The definitive size of a VLDB, of course, goes up year by year. A few weeks ago I found a document I have mentioned in presentations a couple of times, an internal White Paper by Oracle Corp on what a VLDB is, written around 1994. Next time I am at home I’ll scan it. If I remember correctly, at that time 30GB and beyond on a VMS or Mainframe system was considered a VLDB and, in Unix (the new kid on the block back then), 20GB was the threshold.

Right now, as of September 2009, I would judge any database over 10TB of used space is a VLDB. In 12 months, that will be 20TB. In another 12 months, 40 or maybe 50TB.

“Moore’s Law” traditionally states that compute power doubles every 18 months, but I have just suggested that the VLDB limit doubles every 12 months. I say that as, over the last 10 years, I have worked on several systems, systems most DBAs would consider as “challengingly large”, which double in 12 months or less. Data is exploding. More and more of us are dealing with VLDBs.
This “doubling in 12 months” was not the case (I think) back in 1995, it started in 2000 or so. Before then, database size was doubling about in line or less than with Moore’s law I would say, but that is only my opinion.

What changed? People swapped from thinking you could store only what you really needed to thinking you could store “everything”. Which is “everything” your Moore’s-law expanding CPUs can process PLUS all the summary and metadata you extract from that data.

I could be wrong in my figures though. If you took size as the determining factor and doubled 20GB every 18 months from 1994, you would now class a VLDB, in 2009, as approx 20TB.

What main issues do you face with a VLDB?

  • Backing up the database. With a VLDB, a daily backup of everything via RMAN or Hot Backup is simply not possible, as you can’t run the backup in 24 hours. You need to: Backup less often; backup only part of the DB; use hardware such as mirror splitting or deltas; some other trick like, say, never backing it up but having 3 standbys. I’ve seen it done.
  • Performance. You need to consider radical changes such as removing RI or designing around full table scans and ignoring the block buffer cache for the largest tables.
  • The number or size of objects starts causing bits of Oracle to break or work less efficiently (so many tables it takes 2 minutes to select them all or you hit an unexpected limit like th 2TB disk size in ASM, because you need to use bigger disc sizes as otherwise you need more discs than you can easily manage).
  • Maintenance tasks become a challenge in their own right. This could be stats gathering, it could be adding columns to a table, it could be recreating global indexes, all of which now take more time than you can schedule in the maintenance windows {so part of the definition of a VLDB could be down to how active a database is and how small you maintenance windows are – 1TB could be a VLDB if you can never spend more than an hour doing anything!}
  • GUIs are no use to you. Listing all the tablespaces in your database with OEM is a pain in the proverbial when you have 962 tablespaces. You can’t keep track of all of them, visually.
  • You can’t properly test or prototype as you cannot afford to create a full sized test system

I’d like to pick up that last point. With A VLDB, you often end up doing things on the live system that you have been unable to test or prove because you simply lack a test system that is even within an order of magnitude the size of your live system. RAC is a particular issue, it seems many sites are happy to have the live system as a RAC system but not the test or development systems. When you raise the issue, the response is often “well, there is not that much difference between RAC and non-RAC systems is there?”. You are not allowed to get violent with the client,or even deeply sarcastic. Yes, there is a huge difference.

A VLDB DBA is someone who has had to consider the above for more than a few months, or on more than one system. Or who simply cries when you mention Oracle breaking when it exceeds size limits.

How do you know when you are dealing with a ULDB? When you can find no one else who will speak publically about a database bigger than yours. When I was working with the Sanger Institute on my pet “it is really quite huge” database I would often have these frustrating conversations with Oracle Corp:

“X is giving me trouble as the database is just so large”

“It is not that large, we have customers with bigger database”

“Can you introduce me, so we can talk about these issues?”

“Errr, no , they don’t like to talk about their systems”.

Thanks. Thanks a bunch.

Remember, there is always someone with a bigger DB than you. But they probably won’t talk about it.

 

Enough for tonight….

Testing is Not Just for Code. September 16, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: , , ,
7 comments

Someone I am currently working with has a wonderful tag line in her emails:

Next time we want to release untested, why don’t we just release undeveloped?

Testing is not limited to testing code of course. I have recently posted about how a backup is not a backup until you have tested it with a practice recovery.  How you think the database will work by looking at the data dictionary is just a nice theory until you run some actual tests to see how the database responds, as I have been doing with Histograms lately. Sadly, you could even say an Oracle feature is not an Oracle feature until you have tested it.

In my experience, this is particularly true when you test the edges of Oracle, when you are working on VLDBs {Very Large DataBases}.

Last month Jonathan Lewis posted about a 2TB ASM disc size bug, where if you allocated a disc over 2TB to ASM, it would fill it up, wrap around and write over the begining of the file. This week I heard from some past colleagues of mine that they hit this very same bug.
With these very same colleagues we hit a big in 10.1 where you could not back up a tablespace over 8TB in size with RMAN {I can’t give you a bug number for it as we were working with HP/Oracle direct at the time and they “handled it internally”, But when I mentioned it to him, Jonathan found a similar one, bug 5448714 , which stated a 4TB limit on backups. It could be the same bug}.

Yet another VLDB issue was we wanted to move just under one thousand tablespaces from one database to another {again, 10.1}, using transportable tablespaces. We tried to use the utility for checking you are working on a consistent set of tablespaces, but it could not cope with that many. But to plug them into the new tablespace you have to export the metadata and we found a 4000 character limit on the variable stating the tablespaces to transport. That’s 2.3 characters per tablespace, as you need comas to delimit them…Yes, you could manage if you renamed all tablespaces to AA, AB, AC…BA.,BB, BC etc.  If memory servers, the problem was with data pump export and we reverted to old style export which did not have the problem.

Another limit I’ve blogged on is that the automated stats job chokes on very large objects.

Some Data dictionary views can become very slow if you have several tens of thousands of tables/extents/tablespace/indexes

I can appreciate the issues and problems Oracle has with testing their code base, it is vast and people use the software in odd ways and it has to run on many platforms. You might also feel I am being picky by saying Oracle breaks a little when you have 8TB tablespaces or a thousand tablespaces. But

  • Oracle will say in big, glossy presentations, you can build Petabyte and Exabyte databases with Oracle {and have a product called Exadata, don’t forget}.
  • More and more customers are reaching these sizes as data continues to grow, for many site, faster than mores law.
  • Some of these limits appear with databases well below a Petabyte (say a tiddly small 50TB one :-) ).

I’ve been running into these issues with VLDBs since Oracle 7 and they are often with pretty fundamental parts of the system, like creating and backing up tablespaces! I think it is poor show that it is so obvious that Oracle has been weak in testing with VLDB-sized database before release. 

I wonder whether, with 11gR2, Oracle actually tested some petabyte data sizes to see if it all works? After all, as is often said, disk is cheap now, I’m sure they could knock one up quite quickly…

Follow

Get every new post delivered to your Inbox.

Join 158 other followers