Friday Philosophy – Is the CBO becoming way too complex? | Martin Widlake's Yet Another Oracle Blog

Friday Philosophy – Is the CBO becoming way too complex? October 19, 2012

Posted by mwidlake in Friday Philosophy, performance.
Tags: explain plan, perception, performance
trackback

I was at the SIOUG annual conference in Slovenia this week (and a very good conference it was too) and I was watching a presentation by Christian Antognini about how the CBO learns by it’s mistakes. This was of course mostly about adaptive cursor sharing and cardinality feedback. Chris was also able to share a few tid-bits about 12c enhancements in this area. I can’t go into details, but basically it looks like the CBO is going to not only enhance those two features but there is a new one where the CBO can change the plan on the fly, as the same query progresses.

As I watched the presentation I found I was in two minds about this. Part of me was thinking “Yes, these features do help in the CBO realising it has chosen a poor plan and improving things for the next execution – and it will get better at it”. The other part was thinking “Oh dear, yet another source of confusion and of performance of queries changing when ‘nothing has changed’“.

It also solidified an idea I have for a new presentation, about how what you see in the execution plan may well not be what is actually executed. ie there are optional bits in there that do not get run at execution time, depending on the results of prior steps.

But I digress. What I ended up wondering at the end of Chris’s talk was this. Is the CBO getting just too complex? It is all very good that extra abilities are being added so that there is a better chance of a good plan being chosen in the first place and the various ways we can control the choice of the plan grows and grows. But we are long past the point when you could write even a simple sql statement and know what the plan will be before you explain it – let alone run it, check the plan and then run it again a few minutes later and be sure the plan will be the same.

Why does this bother me? For three reasons:

1) Performance will not be stable, so the user experience will not be consistent – and so they will be unhappy. Users will generally be happy if they run a report and it takes one minute, if it takes one minute each time. If it takes one minute and then it takes 5 seconds, they want it to always run in 5 seconds, else they fell cheated or that the system is broken. In fact, I am pretty sure a user will be happier if a report always takes 2 minutes rather than it take 5 seconds or 40 seconds but they never know which! (So long, that is, that they never, ever see it run in less than 2 minutes).

2) If performance of a particular SQL statement is absolutely crucial, I need to know how the CBO can and probably will satisfy it so that I can work that into my solution. Otherwise I am reduced to trying out informed options (or maybe random changes 🙂 ) in order to find the optimal solution – and I cannot be sure that the plan won’t suddenly swap later unless I fix the plan. OK, it might swap to be faster, but I only notice when it swaps to be a lot slower.

3) I’ve said this before, but I am starting to really wonder how anyone new to this starts understanding oracle performance tuning any more. Reading blogs and books about how Oracle works and what impacts the CBO teaches you a lot about what is possible but it only makes sense when you do little tests and proofs of concepts. ie you actually run simple examples and see what happens. But with all these options, considerations and now on-the-fly alterations to plans by the CBO, it is very confusing. Even I, after 20 years of doing performance tuning on and off, am constantly finding myself looking at relatively simple SQL and having to work out why the plan is as it is and satisfying myself that it has not changed. I started with the CBO when it was a lot simpler and I’ve been introduced to the complexities gradually, as they have been introduced, so learning about it has been easier for me.

Perhaps I should not worry about this. I mean, the CBO is getting better at doing what it does, that is good. It is getting hard for those younger and smarter than me to learn about it, so my career is protected. And there is more stuff to talk about so I can keep going to conferences and talking about them.

And with that happy idea, I’m off to start the weekend. Where was that cork screw?

Comments»

1. oraclebase - October 19, 2012: To paraphrase a notable Oracle employee, the optimizer team are always looking for the “best” plan, when the rest of us just want a consistent plan!

On the positive side, tuning has now become, “Just run the query 10 times in a row and the plan will be perfect!” 🙂

Note. You can turn off adaptice execution plans and I guess you can also prevent plan changes using SQL Plan Management, although not being on the beta, I’ve not tested either in the context of 12c.
Cheers

Tim…

Reply
2. Steve C - October 19, 2012: There is always SPM if you are looking for stability. This change appears to be directly aimed at OLAP use cases. I personally am looking forward to solutions for what I call the “Nested Loop Ascension of Death” nested loops are done and hash should have been used.

Reply
3. Martin Preiss - October 19, 2012: Hi Martin,

when I first heared about this “real-time optimization” in 12c I remebered a sentence from Cost based Oracle: “Of course, an optimizer that changes its mind every five minutes because of the ongoing acitivity could be more of a threat than a benefit – I think I might favor predicatbility and stability to intermittently failing perfection” (p. 3). I think that in current versions even the most simple join can bring strange costing effects if you hit some special conditions. Oracle has done a great job creating instrumentation for the analysis of SQL operations (I love the sql monitor in 11g). Since I have worked a little bit with MDX and SQL Server Analysis Services the last few years – where you have no execution plan and the path to optimization is trial-and-error – I value these options extremly high: it’s a lot of fun to be able to unterstand the work that is done by the CBO and the runtime engine and to be able to correct some errors. But with “real-time optimization” things could become quite strange – “any sufficiently advanced technology is indistinguishable from magic”, as Arthur C. Clarke wrote. Perhaps it will be a feature to be deactivated – but of course we all need to take a look at it, before there is any judgement.

Regards

Martin

Reply
4. jgarry - October 19, 2012: It’s always been too complex. It makes unwarranted assumptions about the problem domain it covers. It’s not sufficiently advanced to be good magic, and sufficiently obfuscated to require acolytes to learn incantations.

None of that would matter if it “just worked.” But its capabilities seem to outrun its own ability to do that.

Look inside the bottle for that corkscrew.

Reply
mwidlake - October 22, 2012: “It’s not sufficiently advanced to be good magic, and sufficiently obfuscated to require acolytes to learn incantations.”

I’ll tell Maria Colgan you said that, and where you live 🙂
I think there is a large danger that people will just “learn incantations” eg a set of hints they always apply or just re-gathering stats on certain tables with certain settings (for all columns size 1) rather than understand why the optimizer made the decision it did.

Maybe I should be glad that I know enough to at least make a stab of working out the whys and wherefores and thus stay ahead of the mystic pack.

Reply
5. Andreas Buckenhofer - October 22, 2012: Regularly I’m confronted with “DBA make integration environment the same as production environment so we can reproduce performance issues”. It’s difficult to explain “non-Oracle people” why there will still be different plans in integration (or different statistics as soon as production and integration environment continue to gather statistics or any other differences). It’s getting now more difficult.

Reply
mwidlake - October 22, 2012: Hi Andreas, That is a very good aspect that I did not even consider. Yes, it is getting to the point that even if you copy stats from Prod to Integration, the CBO will spot that things are not acting as the database’s stats suggest and change plans.

Reply
6. Timur Akhmadeev - October 22, 2012: Hi Martin

Is the CBO getting just too complex?

absolutely it is. However this is natural code evolution driven by Oracle customers, their requirements and bug reports. Nothing is done in this area just to improve CBO – the improvement is done on purpose, to cover some specific SQL types (like cost-based transformations added to cover cases from TPC-H benchmark). The more time goes by, the more cases are on the development list and inevitably CBO becomes more and more complex. That’s how it will be until in some point in time Oracle probably would decide to do something completely different and re-work CBO into something else, i.e. make a fundamental change in the costing approach. I hope it won’t happen in the near future though. I like the way it is now. New features are usually controllable and they can be turned off to stabilize the system.

Reply
7. mwidlake - October 22, 2012: Thanks all for your comments.

I don’t think I would agree with Joel that the CBO has always been too complex – in Oracle 8 it was not really complex enough to acutally work to a level it could be used! But yes, as several of you say, it’s complexity has been driven by solving particular issues.
One of my main problems with the CBO, as I see it, is how Histograms can save your bacon 5% of the time but give you serious issues 50% of the time, especially before Oracle 11.
And yes Steve, the main times the CBO seems to give performance issues is when the use of Nested Loop rather than Hash Joins are prompted and it is the wrong choice. I’ve been wondering about that as it seems to me that NL is only best when there really is only a small set of records to fetch and that if HJ is good enough for “worst case” it is also often good enough for “usual case”. If I was coding the logic I would get the CBO to either check after the first execution if the number of rows that was expected and produced an NL plan really WAS low and change if not, or if I was really on the ball, to do a little check before executing.
I am aware that I had simplyt come up with the existing complexity of Cardinality Feedback and Forced Sampling! I do wonder if the optimiser should just check that a table with stats saying it has 0 rows really should check that (look at the first block of data) but then Oracle would have to keep track of which block really held the first row(s) of data for that to be efficient. Maybe just checking the first block would be enough.
The big saving grace is of course, as several of you say, that plans can be fixed and the various options turned off. But choising which ones to turn off and fixing plans is sooo complex! 🙂

Reply
8. Noons - October 22, 2012: The best thing Oracle could do to improve the optimiser is to remove DISTINCT from SQL.
The number of times I’ve seen that keyword being used as a substitute for good join technique – thereby activating the consequent performance disaster – defies any quantification or description!

Reply
mwidlake - October 22, 2012: Nods. I know what you mean. Whenever I see DISTINCT a little alarm bell rings in my head. I have a post half-written about an example of that…

Reply
9. Christian Antognini - October 23, 2012: Hi Martin

First of all I’m pleased that my talk gave you something to think about 😉

About your question… IMO the CBO is not getting too complex. Really. If you look back at its first 20 years, the key concepts didn’t change so much. In fact, for costing an execution plan, only an handful of features where introduced (star transformation, bind peeking, dynamic sampling, extended stats, feedback-based optimization). All other features are related to either plan stability or new objects.

IMO other components were improved much more than the CBO.

Best,
Chris

CBO Days 2012 – http://www.trivadis.com/cbo

Reply
mwidlake - October 23, 2012: Hi Chris,

Your talk gave me a lot more to think about than just this one topic of course 🙂

I’m glad to see someone arguing the other side – I’m posing the question in my post and I really am in two minds. On the one hand the CBO is doing a better and better job each release, but there is, I think, a lot of complexity – what happens to the costs if you gather system stats? How is the multi-block-read-count adjusted? Automated stats gathering was stealth-introduced in 10.1 and it took ages to find out (or work out) what it really did.

Don’t forget Chris, you have been working with the CBO for more years than you probably want to think about and know it inside out. To you it all makes sense.

At the lowest level it is just doing one thing though – estimating the equivalent number of single-block-reads to answer the query in the various ways it can.

Reply
10. Dominic Brooks - October 26, 2012: It seems to me that much of the complexity is caused by two (incompatible) features – bind variable peeking and histograms – and the constant ebb and flow of enhancements to deal with that.

Even in 11gR2 the auto stats job will occasionally come up a histogram on a sequence-generated primary key… Why would I want such a thing?

And bind variable peeking… whoever thought that was a good idea? Especially as default behaviour?

Once upon the time the world was simple.
If you wanted reusable SQL and estimates/plans suitable for all supplied values, you used bind variables.
If you wanted specific estimates/plans for specific values, you used literals.
Simples.

But BVP comes along and now all your carefully crafted, deliberate shareable SQL behaves as if you’ve passed it literals.
And then you need ACS to sort it all out on the nth execution, etc.

And frankly, isn’t this all just bollocks? 🙂

So all our environments start looking like:
– bind variable peeking off
– auto stats job off
– adaptive cursor sharing off
– cardinality feedback off
– any new feature off

And have you looked at the sheer number of wrong results bugs in Metalink? It’s a horrific testament to this complexity.
It’s as if Oracle cannot be trusted to give you the right data – the bread and butter of what it’s there to do.

Reply
mwidlake - October 27, 2012: You are an Angry Young {young?} Man Dom.

As you state though, at least Oracle let you turn off anything that you decide does not help your system. But you need to know about the feature and you need to know how to turn it off and you have to decide whether to turn it back on again for any specific statements. Did I mention testing to see if it helps your system or not?

Reply
Noons - October 29, 2012: That’s all very well, Martin. But I can’t help be in agreement with Dom.
Thing is: these changes (and changed defaults) are sprung on us, without any warning. It may come as a surprise to Oracle but not everyone is called Jonathan or Chris and spends all their time investigating the inner workings of the very last detail of a query.
We simply cannot afford to run databases in that fashion, when there are dozens or even hundreds of the beasts around!
Another one that still rattles my cage to this day is the change to the optimizer that assumed a high value of a column (like a date set to 31/12/4000) implied a linear population up to that value! There is simply no reason whatsoever an optimizer should EVER assume something like that!
Assumptions about value ranges are completely wrong in the context of a CBO. Of course the problem is easily solved by using histograms on that column! But it shouldn’t have been there in the first place: a CBO must NOT make ANY assumptions about data distribution unless we ask it to via histograms, end of story, period!
If said distribution becomes important for good performance then make a histogram of it, not an assumption!
(off the soap box and back to checking table growth…)
mwidlake - October 29, 2012: I understand your point Noons about changes just coming in and we don’t all have spare time to learn everything – but Oracle can’t ask all it’s customers what they should add in. If they did not add new functionality the product would fall behind (and you CAN opt to not upgrade – to a point 🙂 )

However, on the issue of learning what is new I am not so sure. I admit, I do struggle to keep up and that is where people called Chris, Jonathan, and I hope to a little degree Martin, help. We read there stuff. What gets my goat is when a new feature comes in and it is either never mentioned or is not explained in the documentation. Grrr. And of course it leads back to my original point. I can keep up as I already know a lot. How does someone new to Oracle get to that point?

I’m curious – what would you do about the CBO’s assumption (or not) about data distribution? The initial situation was that the CBO “saw” the data as linearly spread, now it also allows Histograms. If you are not going to make an assumption, are you going to use a fixed percentage or what? The CBO has to do something to decide on a cardinality for column comparison so that calculations can be made.
Dom Brooks - October 29, 2012: I’m not angry… you wouldn’t like it when I’m angry.

Interestingly, in relation to the subsequent comments from yourself and Noons about assumptions of linear distribution, my comment was partly motivated by being wrong (so maybe I was a bit cross but only with myself) in a conversation with a fellow professional down the pub about out-of-range predicates.

He had a problem with statements using bind variables where the initial peeked bind was out-of-range and therefore got a selectivity estimate that was not suitable for general reuse of the execution plan.
The implemented solution – turn off bind variable peeking.

I think this is wrong – not the solution necessarily but the default behaviour.

More than 90% of the time, I think I do want the optimizer to use 1/num_distinct, i.e. density, because I have supplied a bind variable and I expect the execution plan to be suitable for all values supplied for that parameter.
Otherwise I’ll use a literal, etc, etc, histograms, etc.

On the other hand, a feature that would be interesting (but only occasionally useful) would be a hint that was the complete opposite of cursor_sharing = force. i.e. it says “just ignore the bind variables and treat the statement as if literals were used, peek all the binds, don’t reuse it for other bind values, don’t do the adaptive sharing thing” – much less subtle than ACS and much more of a sledgehammer but actually much simpler. Or similarly, even a hint that allowed you, in a single statement to say treat this bind as a literal but don’t peek that one.

Is it not just a bit weird that after all the years of banging on about the importance of bind variables that so many problems are caused by these built-in, default features peeking and trying to adapt to the specific values of these binds?

Features that should be “the exception” are supplied as “the rule”.

But the bottom line is that I don’t have an issue at all with the complexity or the multitude of changes every release – after all, being aware to varying degrees of these specific features and complexities is how many of us a) earn our living and b) get our kicks. I love it – I wouldn’t have it any other way.
jgarry - October 29, 2012: Testing to see if it helps your system? That has become so difficult and expensive as to be a worthless suggestion.

Histograms? Defaults? What the customer would want as defaults varies so much it can’t be predicted by Oracle, so any that are there are arbitrary at best. In the case of histograms, how many people have said the default stats gathering is wrong for the general case? Dom’s point that there have been conflicting requirements is highly relevant.

I think the situation of all data being added and updated beyond the high point of the statistics is so common it makes one wonder why that isn’t a default. Having to play games with the histograms to fix such things is silly in my opinion. Having to avoid special values to keep from confusing the optimizer is back assward, it should serve us, not the other way round.

I was going to say something about “too complex” depending on whether you are talking about the point of view of users (us) or the cbo, but I’ve forgotten exactly what. I had gone looking back 10-15 years on what people were saying about it, but found too many interesting things but not whatever I had been looking for. However, it’s still a reasonable thing for people to want a simple set of rules for a set of simple data access patterns, especially ones long in use.

Too bad the “sufficiently advanced” comment wouldn’t fit on a bumper sticker! 😀
11. Noons - October 29, 2012: I think Joel has it in a nutshell: “sufficiently advanced”. Really like that one – I might get a wider bumper!
Not sure about Oracle asking customers on new features to add. AFAIK, it never did – it’s always been what the marketing folks think is the “new black” that gets added!
That’s why we got that monstrosity called OO in Rel 8. Then later on native Java and XML.
Obviously – there are sooooooo many customers out there coding Java *inside* the db to access SQL! (facepalm)…

Look, I’m all for progress and new features. But for once ASK the correct people what those should be: the DBAs, the SQL and PL/SQL coders and designers. NOT the “kewl” brigade – they couldn’t care less!
No, *new* is not a synonym for *good*, last time I looked in the dictionary.

As for the data distribution: the CBO should either do nothing – carry on with *no* assumptions about linear distribution – or (*if* the DBA or designer ask it to) then use histograms. The killer here is the “assumption”. Not the result.
The number of apps and dbs I’ve seen with broken performance because the CBO assumed the wrong thing when looking at “special values” is staggering. And quite frankly, the old “fix the app” nonsense just does not gel at all.
I’m all for facilitating the life of those who have to run applications on dbs. Not hindering it. As such, I fail to see the value of blaming the dba for an app that he/she is forced to run and over the design of which they can’t possibly have a say…

Reply