Buffer Cache Hit Ratio – my “guilty” Confession | Martin Widlake's Yet Another Oracle Blog

Buffer Cache Hit Ratio – my “guilty” Confession November 1, 2009

Posted by mwidlake in Perceptions, performance.
Tags: knowledge, performance, rant
trackback

My Friday Philosophy this week was on Rules of Thumb on buffer gets per row returned.

Piet de Visser responded with a nice posting of his own, confessing to using ratios to help tuning {We seem to be playing some sort of blog-comment tag team game at the moment}.

Well, I have a confession so “guilty” or “dirty” that I feel I cannot inflict it on someone else’s blog as a comment.

I use the Buffer Cache Hit Ratio.

And the Library Cache Hit Ratio and the other Ratios.

As has been blogged and forum’d extensively, using these ratios is bad and stupid and anyone doing so does not know what they are doing as they do not help you solve performance problems. I mean, hell, you can download Connor McDonald’s/Jonathan Lewis’s script to set it to what you want so it must be rubbish {go to the link and chose “tuning” and pick “Custom Hit Ratio” – it’s a rather neat little script}.

The point I am trying to make is that once the Buffer Cache Hit Ratio (BCHR) was wrongly elevated to the level of being regarded as a vital piece of key information but the reaction against this silly situation has been that it is now viewed by many (I feel) as the worst piece of misleading rubbish. Again a silly situation.

I think of the BCHR as similar to a heart rate. Is a heart rate of 120 good or bad? It’s bad if it is an adult’s resting heart rate, but pretty good if it is a kitten’s resting heart rate. It’s also probably pretty good if it is your heart rate as you walk briskly. Like the BCHR it can be fudged. I can go for a run to get mine higher, I can drain a couple of pints of my blood from my body and it will go up {I reserve the right not to prove that last one}. I can go to sleep and it will drop. Comparing my resting heart rate to yours (so like comparing BCHRs between systems) is pretty pointless, as I am a different size, age and metabolism to you {probably} but looking at mine over a year of dieting and exercising is very useful. If only I could keep up dieting and exercising for a year…

So what do I think the much-maligned Buffer Cache Hit Ratio gives me? It gives me what percentage of sql access, across the whole database activity, is satisfied from memory as opposed to disc. Or, put another way, the percentage of occurences a block has to be got from the I/O subsystem. Not how many blocks are read from storage or memory though, but you can get that information easily enough. As Physical IO is several orders of magnitude slower than memory access {ignoring I/O caches I should add} , it gives me an immediate feel for where I can and can’t look for things to improve.

If I am looking at a system that is overall very slow (eg high process wait queues under l/unix, the client has said the system is generally slow) and I see that the BCHR is low, say below 90%, this tells me I probably can get some performance increase by reducing physical access. I’ll go and look for those statements with the highest physical IO and the hottest tablespaces/objects in the DB.
If the BCHR is already up at the 99% level, I need to look at other things, such as tuning sort, looking at removing activity in the database, to be very mindful of nested loop access where maybe it is not the best access method (very likely due to old stats on tables).

When I have got to know a system and what it’s BCHR generally sits at, a sudden change, especially a drop, means there is some unusual physical IO going on. If the phones start going and someone is complaining “it’s all slow”, the BCHR is one of the first things to look at – especially as it is available from so many places.

Another thing the BCHR gives me is, if I am looking at a given SQL statement or part of an application, it’s specific BCHR can be compared to the system BCHR. this does not help me tune the statement itself, but I know if it’s specific BCHR is low then it has unusually high IO demands compared to the rest of the system. Further, Reducing it might help the whole system, so I might want to keep an eye on overall system throughput. If I reduce the statement’s execution time by 75% and the whole system IO by 1%, the client is likely to be more happy, especially if that 1% equates to other programs running a little faster “for free”.

So, I don’t use the BCHR to tune individual statements but I feel confident using it to track the general health of my database, so long as I am mindful of the impact of new functionality or upgrades. It’s a rule of thumb. It’s a database heart rate. (and so is redo generation and half a dozen other things).

Comments»

1. Noons - November 1, 2009: Not in the same “ratio” mood, but I use a query on v$sql pulling in sql text and counters for the 20 most busiest statements by logical reads.

If two or three of them use orders of magnitude more buffers than all the others, it’s a pretty fair bet they’d be prime candidates for a second detailed look.

That might entail extracting the plan, checking the duration/waits, talking to the developers/application people, etcetc.

It’s not a catch-all, but it pretty much narrows down bad sql in one smpe statement. And bad sql is by far the biggest contributor to bad performance I’ve ever seen.

Reply
2. mwidlake - November 1, 2009: Hi Noons,

I have a similar script, chk_sga.sql {it’s on my web site if anyone wants it} which has several WHERE clauses I can comment in and out. It pulls out SQL from the SGA by most buffer gets per execution/most disk IOs per execution, most buffer gets/disk IOs irrespective of executions.

It works really well for OLTP systems but not so well for DW systems – as the SQL it pulls out tends to be several pages of A4 long.

At that point I don’t tend to look at ratios or try and tune the code . I ask how in heck anyone knows if the SQL is bringing back the correct data!

Reply
3. oraculix - November 1, 2009: Good to see someone has the guts to post such a confession! If this was a manifesto, this comment would be my signature, as I regard hit ratios in almost the same manner as you do. At least for OLTP systems.
A quick glance at the ratios in an ADDM or Statspack report, maybe combined with OS performance metrics, can often give you a good first hint on where to dig deeper.
If you know that your goal isn’t “make that ratio better” but “make that LIO lower” or “use more binds” (OLTP, again), I don’t see any heresy in that approach.
Best regards,
Uwe

Reply
mwidlake - November 2, 2009: Thanks Uwe, it is good to know I am not alone 🙂 I know these ratios have their flaws, but so do almost every rule/best practice/key principle that comes up.

Reply
4. oakesgr - November 2, 2009: Nice post Martin. If they were completely useless then they wouldn’t have been part of Statspack etc. It’s all about the context they’re used in as you rightly point out.

Reply
5. Log Buffer #168: a Carnival of the Vanities for DBAs | Pythian Group Blog - November 6, 2009: […] Chen! But do you have guilty feelings like Martin Widlake does? He makes a guilty confession. The sin? “I use the Buffer Cache Hit […]

Reply
6. Jonathan Lewis - November 9, 2009: Martin,

There is a lot to be said for your human/kitten analogy. And I think the most important thing to be said is that when it comes to databases some DBAs can’t tell the difference between a human and a kitten, and others have a huge zoo full of databases and can’t remember which database is supposed to be what animal.

Oakesgr,

Two reasons why the ratios are on the AWR report: inertia and backwards compatibility; there were there in bstat/estat. Think about the huge amount of rubbish on the support system formerly known as Metalink – because it’s there it get repeated by support analysts, and because it has been repeated recently people think it must be true.

I think all numbers can be abused, misused and misunderstood, but the hit ratios seem to be most open to the problem. Perhaps it’s because they are one extra step away from the level of human comprehension – the difference between user_indexes.clustering_factor and user_indexes.avg_data_blocks_per_key makes for an interesting analogy.

Reply
mwidlake - November 9, 2009: Certainly the “species” of your database is going to be a major factor in what you would expect the various ratios to be, Jonathan.

Thinking a little about your comment, about how some people don’t know what would be sensible ratios for their system (either through lack of experience/understanding or due to having too many systems to look at to no longer recognise the individuals in the crowd) then reliance on a given ,rule on a ratio is of course dangerous.
But then you could look at it this way – The BCHR is an indicator of how your database is working, how much information is found in the database cache. The ratio is not the problem, how people have been taught to interpret it is. Correct use of the ratio is good.
Which is a big part of why medicine ( and biology as a whole) is a very complex and difficult topic. Very few fixed rules, lots of ratios, often contradictory!

Reply
Jonathan Lewis - November 11, 2009: “The BCHR is an indicator of how your database is working, how much information is found in the database cache”

But the amount of work your database is doing can change enormously without the BCHR changing at all. Conversely, how much of a change in the BCHR would you have to see before you were alerted to the fact that your database might be doing more work that usual ?

To me, one of the nicest things about the BCHR is that it appears below the Load Profile in the statspack/AWR report so I’ve already seen the “Logical reads” and “Physical reads” before I get to the BCHR – so I don’t need to look at it.

Regards
Jonathan Lewis
7. Blogroll Report 30/10/2009-06/11/2009 « Coskan’s Approach to Oracle - November 10, 2009: […] Martin Widlake-Buffer Cache Hit Ratio my guilty Confession […]

Reply
8. mwidlake - November 11, 2009: Hi Jonathan,

“seen the “Logical reads” and “Physical reads” before I get to the BCHR – so I don’t need to look at it.”
So you have looked at the Logical and Physical reads, done the quick calculation in your head and come up with the BCHR…. Or you can just read the BCHR and save doing the mathematics in your head 🙂

But flippancy aside, I agree, yes, knowing the actual values for Logical and Physical reads is more information than the ratio itself. As you point out, the ratio is not how much work your database is doing {my comment was sloppy, I think I made the point correctly in the posting itself}. If my BCHR stays the same when the actual number of logical reads and physical reads go up, that could give me some indication of how well the database was designed to scale {if no new applications had been added} or that new application code was as “I/O” efficient as the existing code?

I used the BCHR yesterday. We are looking at compressing data, In our testing, we saw that for queries visiting the I/O subsystem, the performance boost was 55% {query times halved} but from cache there was 10% or so overhead. Is it worth doing? Well, our BCHR is 98-99.4% so for the whole database probably not {98% slowed down by 10%, 2% sped up by 55%}.. For older data not in the cache, then yes, so we will probably only compress the older data (via partitions).

If only the AVG_CACHE_HIT_RATIO in DBA_TABLES was populated, we could use that information to really help us know which segments to compress…

Reply
Jonathan Lewis - November 12, 2009: Martin,

You’ll have to explain what you mean by “from buffer cache there was 10% overehead”, and how you turn your 55% performance boost into a figure that connects to the BCHR. I can make a couple of guesses – but can’t make your description fit the way I would use the statistics available from testing.

If you want to experiment with object level cache hit ratios you could look at v$segstat (or wrh$_seg_stat / dba_hist_seg_stat) to compare physical reads with logical reads.

Regards
Jonathan Lewis

Reply
mwidlake - November 12, 2009: By “10% overhead” I mean we created two tables, one compressed and the other not but with exactly the same data. When we selected the same data from the two tables several times (so the data was in the buffer cache and we ignored the first itteration) then the compressed data took longer to come back and 10% more CPU was consumed doing so {information was obtained from v$sqlarea}.
The time difference seemed to be less significant, but it was on a busy system so the variance was a little high, but 6-8% slower. I was focused on the extra CPU as one of our RAC nodes is running with very high CPU usage.

As for the 55%, that was the reduction in physical gets and time elapsed in fetching data from the compressed table over the normal table when the data was not in the buffer cache.
So the simplistic (very simplistic) logic following from that is “if we compressed everything would we over-all gain or not?”
For that to be true, the percentage of work by the database increased by 55% would have to outweigh the percentage of work decreased by 10%..
eg 80% *10% slowed down versus 20% sped up by 55% would be an overall gain (a BCHR of 80%)
And yes, I know you are way ahead of me Jonathan – as I/O is so much slower than CPU activity you need to factor in the contribution to overall database time of the I/O workload compared to CPU workload….
However, our BCHR on this system is aroun 98-99.4, even though it is a DW and it would be reasonable to expect our elephant to have a slower heart rate than an OLTP mouse, but the guys before me have put the database through a pretty good fitness program already 🙂
So speeding up the 0.6-2% of the select activity by 55% is going to be beneficial but we need to be mindful of avoiding compression of that data making up 98% plus of the requests by our applications.

You know, I am coming around to the point you made a few comments back – the BCHR only really helps when you know to factor in a load of other things and consider it in context….
9. Donald K. Burleson - December 28, 2009: These claims about the BCHR being useless are solely because it’s a ratio, nothing more!

I’ve never understood why anybody who knows what a ratio is would not innately realize the limitations of the BCHR.

Now that Oracle 11g has embraced the idea of holistic, workload-based tuning (e.g. the SPA), the BCHR is a great way to tell when the characteristics of your SQL workload have changed . . .

Reply