jump to navigation

If you Really Can’t Solve a “Simple” Problem.. March 11, 2011

Posted by mwidlake in Friday Philosophy, Perceptions, Testing.
Tags: , ,
trackback

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

   9 |          NESTED LOOPS                       |                           |     1 |   139 |    37   (3)| 00:00:01 |       
* 10 |           HASH JOIN SEMI                    |                           |     1 |    50 |    11  (10)| 00:00:01 |       
* 11 |            TABLE ACCESS BY INDEX ROWID      | PARTY_ABCDEFGHIJ          |     3 |   144 |     4   (0)| 00:00
* 12 |             INDEX RANGE SCAN                | PA_PK                     |     3 |       |     3   (0)| 00:00:01 |       
  13 |            COLLECTION ITERATOR PICKLER FETCH|                           |       |       |            |          |       
  14 |           PARTITION RANGE ITERATOR          |                           |    77 |  6853 |    26   (0)| 00:00:01 | 
* 15 |            INDEX RANGE SCAN                 | EVEN_PK                   |    77 |  6853 |    26   (0)| 00:00:01 | 

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

|   9 |          NESTED LOOPS                 |                           |     1 |   125 |  1828   (3)| 00:00:16 |
|  10 |           NESTED LOOPS                |                           |     1 |    63 |     2   (0)| 00:00:01 |
|* 11 |            TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ          |     1 |    45 |     2   (0)| 00:00:01 |
|* 12 |             INDEX UNIQUE SCAN         | PA_PK                     |     1 |       |     1   (0)| 00:00:01 |
|* 13 |            INDEX UNIQUE SCAN          | AGR_PK                    |     1 |    18 |     0   (0)| 00:00:01 |
|  14 |           PARTITION RANGE ITERATOR    |                           |     1 |    62 |  1826   (3)| 00:00:16 |
|* 15 |            INDEX FAST FULL SCAN       | EVE_PK                    |     1 |    62 |  1826   (3)| 00:00:16 |

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

In this case, it was this:

Here is my primary key:

EVEN_PK EABCDE 1 AGR_EXT_SYS_ID
EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX
EVEN_PK EABCDE 3 AAAMND_DT
EVEN_PK EABCDE 4 EVT_EXT_SYS_ID
EVEN_PK EABCDE 5 EABCDE_ID

Except, here is what it is on Live

EVE_PK EABCDE 1 EVT_EXT_SYS_ID
EVE_PK EABCDE 2 EABCDE
EVE_PK EABCDE 3 AGR_EXT_SYS_ID
EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX
EVE_PK EABCDE 5 AAAMND_DT

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

About these ads

Comments»

1. Tim Hall - March 11, 2011

Forgive me for trying to teach my grandmother to suck eggs, but you could do with using “SET TABS OFF” before doing a copy/paste from SQL*Plus. It makes the output a lot easier to read. :)

Cheers

Tim…

mwidlake - March 11, 2011

How exactly does one suck an egg? :-)

Cheers Tim

Tim Hall - March 11, 2011

I suspect the improvements in dental care of the last few decades and left the number of grandmothers possessing no teeth at an all time low, so maybe there is a pressing need for someone to teach grandmothers to suck eggs these days… :)

2. Tim Hall - March 11, 2011

And as if by magic, I can read it properly. I guess you noticed and corrected it as I was typing my comment. :)

mwidlake - March 11, 2011

Yeah, sorry about that Tim. It was not the fault of SQL*Plus, it was the fault of WordPress. Due to my inability to spell, I try and remember to do a spellcheck on my stuff just before I publish. Only, the spellcheck on wordpress also has a habit of chewing up sections of code and spitting them out all garbled. Thankfully I am so old-hat that I often pre-edit the whole thing in some sort of from-the-ark editor like notepad and so had the original to re-paste.

3. Tim Hall - March 11, 2011

Don’t you just love it when tools are so intelligent they actually make your job harder. :)

4. joel garry - March 12, 2011

How to suck eggs.

Somehow that reminds me of many blogs (not yours or Tim’s, of course).

mwidlake - March 12, 2011

Nice one Joel :-). I have to confess though, I came unstuck at point 5 – I had failed to position myself in front of a desk and the egg fell on the floor…

Deciding at what level to aim your blog is a bit tricky, actually. Part of me wants to explain the most esoteric and tricky things I have got into my skull, to show how much I “know” and to appeal to fellow teckies. Another part of me worries about leaving out details that the uninitiated might need…

joel garry - March 18, 2011

That is tricky, I think the best answer is to write for yourself, what you would like to see, maybe put some links in for the newbies. Tim gets it right for what he does. The thing about esoteric and tricky on a blog is, someone eventually is likely to run into it and think you are great. A problem with blogs is they have an artificial expiration date, great blogging shouldn’t. Even weird little obscure things can be great.

5. Oracle Musings » It doesn’t make a difference… - March 12, 2011

[…] thoroughly enjoyed Martin Widlake’s post about a “simple” problem – and how you should never take anything for granted when trying to […]

6. Tony Sleight - March 14, 2011

This is another good post. It’s very easy when the pressure is on to go deeper into complex solutions, and hard to take a step back and look at a problem starting with the basics. When all eyes are on you, it takes a brave person to say, ‘I’ll just have a coffee, and a think, I’ll be back in 15 minutes!’.

7. Bernard Polarski - March 17, 2011

Nice article, thanks for sharing.

Meantime, I am trying to suck an egg but I still do have all my teeth so what can id do? Please help me! In despair, I swallowed WordPress manual but it did not help.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 171 other followers

%d bloggers like this: