Friday Philosophy – Explaining How Performance Tuning Is Not Magic? March 9, 2018
Posted by mwidlake in Friday Philosophy, performance, SQL.Tags: cost, explain plan, perception, performance, SQL
8 comments
Solving performance issues is not magic. Oh, I’m not saying it is always easy and I am not saying that you do not need both a lot of knowledge and also creativity. But it is not a dark art, at least not on Oracle systems where we have a wealth of tools and instrumentation to help us. But it can feel like that, especially when you lack experience of systematically solving performance issues.
I recently spent a couple of days with a client discussing how you solve performance issues and I am preparing a talk on exactly that for the up-coming OUG Ireland conference on the 22nd-23rd March. The guys I was talking to at the client are very, very capable production DBAs, but traditionally the developers at their site initially look at any performance issues. If they think it is the database but not something they can fix, they throw it over the fence at the DBAs. It’s not always the database of course and, if it was a simple fix (missing index, obviouosly inefficient query), then the developers fixed it. So these two guys are expected to only solve the more difficult issues. That’s not really fair as, if you are not practising on the simple problems how are you supposed to gain the experience and confidence to solve the harder ones?
Anyway, a part of the discussion was about Explain Plans. What does the COST mean in the plan, they asked? They saw it as some number that in an undefined way gave an indication of how expensive the step in the plan was, but they could not link it back to anything solid that made sense to them. It looked like a magically produced number that was sometimes very, very wrong. Like most (good) technical people, they want to know the details of things and how they work, they don’t want to simple accept something as working.
So I took them through some simple examples of plans and of how the COST is just a result of simple maths estimating the number of IOs needed to satisfy the step.
I won’t go into the full details here but have a look at the below, this is how I started:
I explained how you read “down” the slope of plan to the end (so step 3) and then worked back up the slope. So the first thing Oracle does is the index range scan. I showed them the BLEVEL of the index, the number of blocks per indexed value and why Oracle knew it would, on average, need 3 IOs to get the leaf block entries for the provided “DOB=to_date(’08-OCT-1934′,’DD-MON-YYYY’)”. Each DOB matched, on average, 20 rows. So the cost of step 3 was passed up to the step 2 of accessing the table rows. This would be done 20 times so the cost was 20+3. 23.
OK, they could accept that, it made sense. So let’s extend it…
I took the original query against PERSON for a given DOB and now joined it to a second table PERSON_NAME. Why is not important, it’s just a demonstration of a simple table join:
Now I explained that as you work “back up the slope of the plan” from the first, most indented step (so from step 5 to 4 to 3) at 3 there is a nested loop joining the rows passed to step 4 to the step in line below it, i.e. step 6. They had already seen steps 5 and 4 in our first example, Oracle is expecting to get 20 rows for a cost of 23. Look at line 4. And for each of those 20 rows, it will do a range scan of the index in step 6 and for each row it finds in the index, collect rows from the joined table in step 7.
So for each of the 20 original rows it does a scan of an index on the joined table for a cost of 2 (I showed the stats to them how this cost is calculated) and expects on average to find 5 matching rows so it needs to do 5 IOs to the PERSON_NAME to get those rows. Add that together and that cost of 7 is done 20 times. 7*20 is 140, plus the 23 from the orginal scan of the PERSON table, the whole COST is… 163.
Light bulbs came on and they got it! My job here is done.
But it was not. I then went on to explain how it is now hard to get such a simple example. This one is, I think, from an early version of Oracle 11. I told them how histograms on a column will make the estimated cardinality (number of records per given value for a column) more accurate, but harder to work out. I showed them how the cost of a unique index scan is reduced by 1. I explained how Oracle was blind to the correlation of column values unless you took steps to tell the optimiser about it (you know, how for a given value of car manufacturer there will be only a few values of car model, the two values are related)…
Worry was creeping back into their faces. “so it is simple mathematics – but the rules are complex? It’s complex simple mathematics?!?” Err, yes. And with 11 and 12 Oracle will use various methods to spot when the simple, complex mathematics does not match reality and will swap plans as a result…
I think I lost them at that point. Especially when they asked about the SQL Profiles and how they modified Costs… Baselines controlling how plans are used… Bind variables…
That is a real problem with Oracle Performance tuning now. Even something as simple as COST is based on a lot of rules, factors and clever tricks. And they are not the only things controlling which plan is executed anymore.
So I took a step back.
I told them to accept that the COST is really, honestly based on maths, and the expected number of ROWS is too. But the actual, specific values could be hard to totally justify. And it is when the estimated COST and (possibly more importantly) the estimated ROWS goes wrong you have problems. So look out for ROWS of 1 (or at least very low) in a plan for a statement that takes more than a few milliseconds. And for very, very large COSTS/ROWS in the millions or more. And what really helps id if you get the ACTUAL ROWS as opposed to the ESTIMATED RIWS. Where there is a significant difference, concentrate your focus there. Of course, getting the ACTUAL ROWS is not always easy and is for a later (and properly technical) post.
So, they asked, if they could not get the actual ROWS and there were no 1’s or millions’s in the plan ROWS/COSTS? How did they know where to concentrate? “Well, you just get a feel for it… do the costs feel reasonable?…”
Damn – I’m not sure I really delivered on my aim of proving Performance Tuning is science and not magic.
Any pointers anyone?
(Update – Vbarun made a comment that made me realise I had messed up the end of this post, I was talking about estimated ROWS and still had the words COST there. Now fixed. And the other thing I was hoping someone was going to suggest as a pointer was – to split the query down to individual tables & two-table joins and *check* how many rows you get back with the where predicates being used. It takes some time but it shows you where the estimates are going wrong.)