jump to navigation

With Modern Storage the Oracle Buffer Cache is Not So Important. May 27, 2015

Posted by mwidlake in Architecture, Hardware, performance.
Tags: , , , ,
trackback

With Oracle’s move towards engineered systems we all know that “more” is being done down at the storage layer and modern storage arrays have hundreds of spindles and massive caches. Does it really matter if data is kept in the Database Buffer Cache anymore?

Yes. Yes it does.

Time for a cool beer

Time for a cool beer

With much larger data sets and the still-real issue of less disk spindles per GB of data, the Oracle database buffer cache is not so important as it was. It is even more important.

I could give you some figures but let’s put this in a context most of us can easily understand.

You are sitting in the living room and you want a beer. You are the oracle database, the beer is the block you want. Going to the fridge in the kitchen to get your beer is like you going to the Buffer Cache to get your block.

It takes 5 seconds to get to the fridge, 2 seconds to pop it open with the always-to-hand bottle opener and 5 seconds to get back to your chair. 12 seconds in total. Ahhhhh, beer!!!!

But – what if there is no beer in the fridge? The block is not in the cache. So now you have to get your car keys, open the garage, get the car out and drive to the shop to get your beer. And then come back, pop the beer in the fridge for half an hour and now you can drink it. That is like going to storage to get your block. It is that much slower.

It is only that much slower if you live 6 hours drive from your beer shop. Think taking the scenic route from New York to Washington DC.

The difference in speed really is that large. If your data happens to be in the memory cache in the storage array, that’s like the beer already being in a fridge – in that shop 6 hours away. Your storage is SSD-based? OK, you’ve moved house to Philadelphia, 2 hours closer.

Let's go get beer from the shop

Let’s go get beer from the shop

To back this up, some rough (and I mean really rough) figures. Access time to memory is measured in Microseconds (“us” – millionths of a second) to hundreds of Nanoseconds (“ns” – billionths of a second). Somewhere around 500ns seems to be an acceptable figure. Access to disc storage is more like Milliseconds (“ms” – thousandths of a second). Go check an AWR report or statspack or OEM or whatever you use, you will see that db file scattered reads are anywhere from low teens to say 2 or 3 ms, depending on what your storage and network is. For most sites, that speed has hardly altered in years as, though hard discs get bigger, they have not got much faster – and often you end up with fewer spindles holding your data as you get allocated space not spindles from storage (and the total sustainable speed of hard disc storage is limited to the total speed of all the spindles involved). Oh, the storage guys tell you that your data is spread over all those spindles? So is the data for every system then, you have maximum contention.

However, memory speed has increased over that time, and so has CPU speed (though CPU speed has really stopped improving now, it is more down to More CPUs).

Even allowing for latching and pinning and messing around, accessing a block in memory is going to be at the very least 1,000 times faster than going to disc, maybe 10,000 times. Sticking to a conservative 2,000 times faster for memory than disc , that 12 seconds trip to the fridge equates to 24,000 seconds driving. That’s 6.66 hours.

This is why you want to avoid physical IO in your database if you possibly can. You want to maximise the use of the database buffer cache as much as you can, even with all the new Exadata-like tricks. If you can’t keep all your working data in memory, in the database buffer cache (or in-memory or use the results cache) then you will have to do that achingly slow physical IO and then the intelligence-at-the-hardware comes into it’s own, true Data Warehouse territory.

So the take-home message is – avoid physical IO, design your database and apps to keep as much as you can in the database buffer cache. That way your beer is always to hand.

Cheers.

Update. Kevin Fries commented to mention this wonderful little latency table. Thanks Kevin.

“Here’s something I’ve used before in a presentation. It’s from Brendan Gregg’s book – Systems Performance: Enterprise and the Cloud”

Advertisements

Comments»

1. aa - May 27, 2015

I’m going to test your beer examples today … never believe beer experts’ articles without test cases 🙂

mwidlake - May 27, 2015

Just as you should. Always check out the experts, ALWAYS check out the beer 🙂

2. Bos - May 27, 2015

bitter(ly) disappointed you are keeping Elgood’s Golden Newt in the fridge, surely that should be outside the fridge or as an In-memory option?

mwidlake - May 27, 2015

Real beer gets popped in for 20-30 mins before consuming in this weather – the cool room I keep beer in is not quite cool enough. If it gets left in the fridge too long, it’s removed and left to warm a little.

Bos - May 27, 2015

spoken like a real (beer) expert!!

3. Efstathios Efstathiou - May 27, 2015

There are a few other factors other matter: platform architecture, offloading capabilities and os memory management, e.g. on a mainframe you have a way more sophisticated I/O stack with dedicated I/O processors. Something that is still missing on number cruncher architectures like RISC or x86. With x86 you have to “compensate” this lack with a lot of ASICs e.g. your LSI SAS HBA or FusionIO Card to offload your main processor, reduce latency and tune interrupts to perform optimally on the NUMA architecture aka “pseudo SMP with alot of interconnect overhead”. Also there’s a sweep spot, where too much memory/buffer cache can lead to memory management overhead “eating” valuable cpu cycles on you database server hosts. So per design you want to offload every task, that does not need to run on your database server to another box and cache wherever you can. Also you should reduce concurrency wherever you can. Oracle Exadata “addresses” most of these problems by simplifying the I/O stack to be more “mainframe like”. Storage Cell’s are like your “dedicated” SAN Controllers, where you can burn cpu cycles and “offload” work to. The infiniband network solves another design problem with RAC: there are a lot more messages in a 2+ node setup 😉

mwidlake - May 28, 2015

Thanks Efstathios, all good points. I refer to you comment in my reply to Nikolay

4. savvinov - May 28, 2015

Hi Martin,

while agree with your message, I think that 0.5 microseconds is way too optimistic an estimate for a buffer cache get. It’s complex activity which cannot be reduced just to a memory access (e.g. it includes acquisition and release of latches, walking hash chains etc.). And even in the simplest case it takes microseconds or even tens of microseconds.

Considering that nowadays we have all-flash arrays with sub-millisecond access times, I would say the performance gap between physical I/O and logical I/O isn’t widening, it’s closing. However, it is not a good reason to discard buffer cache, because the difference is still quite significant.

Another thing to consider is that like the database, storage also has cache. But even this doesn’t make buffer cache less important, because even aside from better performance, buffer cache can also be configured, monitored and controlled directy by the DBA — but he cannot do any of those things with SAN or filesystem cache.

Best regards,
Nikolay

mwidlake - May 28, 2015

Thanks for that NIkolay. You are correct of course, simple memory access is not like the work that goes into reading a block buffer from the database buffer cache. In my defense, I do make a brief mention of the latching and “messing around” and I actually base myself on the memory read being 0.1% of the physical read time, so about 5us not 500nm. I should devise a little test on my laptop to see how quick oracle accesses a block.

SSDs are a game changer. Some top-end systems are seriously fast and one thing I don’t even mention is that it is the IOPS they can manage as opposed to all-out read and write speeds that make the most difference.

Soon they will be moving us to the suburbs of Washington.

I wonder how things will change further when memristors become a commercial reality?

The inability to “control” the storage is of course one of the advantages of Exadata and other engineered systesm, as Efstahios says in his comment – and of course these systems also avoid the contention you often suffer from with corporate, shared storage buckets. Every damned system could ask the storage for some data at the same time your database does, but with dedicated storage that problem goes away.

5. Kevin Fries - May 28, 2015

Here’s something I’ve used before in a presentation. It’s from Brendan Gregg’s book – Systems Performance: Enterprise and the Cloud

If you can’t quite get to the the link, it’s also searchable as “Latency Time Scale” in Google images. .

mwidlake - May 28, 2015

Ohh, I like that, thanks Kevin. I’ve seen similar before (James Moorle sometimes puts up a very similar slide when he presents on storage). I’ll put it in the main post too


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: