jump to navigation

I am Neo off the Matrix (apparently) March 30, 2011

Posted by mwidlake in AWR, performance.
Tags: , ,
8 comments

I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.

As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:

This will not be an interesting picture to you, but to me it tells me a lot about my system

“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”

“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.

That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.

The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.

However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.

That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.

Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.

Telling the Truth in IT March 17, 2011

Posted by mwidlake in Meeting notes.
Tags: , ,
8 comments

I’ve been doing presentations for many years, mostly on Oracle Technology, occasionally on management topics. However, my favorite presentation to give is one about when things go wrong.

The title is usually something like “Surviving Survivable Disasters” or “5 ways to Seriously Screw up a Project” and, though the specific examples and flow may vary, the general content is the same. I talk about IT situations that have gone wrong or things that strike me as daft/silly/mindless in IT. My aim is to be entertaining and have a laugh at the situations but I also want to explore what causes disasters and how we might go about avoiding at least some of them.

When doing the presentation I have a couple of ground rules:

  • I must have witnessed the situation myself or know personally, and trust, the individual who is my source.
  • I do not name organisations or individuals unless I am specifically given permission {by individuals that is, organisations never get named. Except one}.
  • I try to resist the temptation to embellish. It’s not hard to resists, a good disaster usually stands on it’s own merits.

It’s a great talk for introducing some light relief into a series of very technical presentations or for opening up a day of talks, to get people relaxed. It’s also the only talk I get seriously nervous about doing – if you are aiming to be entertaining and you miss, you stand to die on stage. The first time I did the talk I was physically sweating. However, it went down a storm. I did it 4 or 5 more times over as many years and it always went down well.

However, about 4 years ago I did the presentation just as I was about to go back to being self employed. After the talk a very good friend came over and said something like “Really entertaining talk but…maybe you should tone it down? A lot. Potential employers are going to take a dim view of you doing this, they will worry they will appear in the next talk”. I protested that I never mention companies or people and, surely, all organisations are able to admit that things go wrong and it is to everyone’s benefit if we all learn from them? My friend was adamant that though companies want to benefit from other disasters, they never, ever want to in any way be the source of that benefit. He was sure it would be very damaging to my potential career. Hmmmm…. I could see his point.

I was already scheduled to do the talk again in a couple of months and I took heed of his advice for it. I toned down the material, I removed some of the best stories and I added several disclaimers. I also died on stage. It went from an amusing 45 minutes to a preachy and stodgy affair.

I have not done it since.

The question is, should I have pulled back from doing that talk? Is it really going to harm my potential employability? (After all, no work has ever come my way from presenting). Why can’t we be honest that issues occur and that learning from them is far more valuable than covering them up? After all, do we believe a person who claims never to have made mistakes?

What prompted this thread is that I have been asked to do the talk again – and I have agreed to do so. I’ll be doing it next week, with the title “5 ways to advance your career through IT Disasters” for the UK Oracle user group Back to Basics event. This is a day of introductory talks for people who are fairly new to Oracle, the brain-child of Lisa Dobson. Lisa realised a few years ago that there were not enough intro-type presentations, most technical talks are by experts for fellow experts {or, at least, people wanting to become experts}.

I’m very happy to support helping those who are new to Oracle and I think it is important that people who are new to IT are exposed to what can go wrong – and any advice that might help them avoid it. After all, it’s better they learn from our mistakes than just repeat them again. OK, they’ll just repeat them again anyway, but they might spot that they are doing so just in time :-)

Is this a good idea? What the hell, I want more free time to do things like this blog – and get on top of the garden.

There is life in the old dog yet… March 14, 2011

Posted by mwidlake in Blogging.
Tags:
11 comments

My Blog has been unforgivably quiet of late. All I can say in my defence is “work”. I’ve blogged about this before, and mentioned it in presentations, but most of us are too tied up in the day job to test things properly, let alone blog or in other ways present on how things work. We fix the problem in front of us and move on to the next “critical, absolute priority 1A plus, must-be-fixed-today” issue. So like most of you, that has been my life for…ohh, months.
{I makes me even more thankful to those who continually find the time to test properly and blog about what they have found, on top of their “earning” work}

However, my working life is looking like it will return to a more reasonable balance soon.

So, this blog post is a statement of intent that I WILL be blogging again in the next week or so. OK, so this post is not of any use to anyone reading, but it means I now feel morally obliged to follow up on the statement (and that is actually the intent of this blog post).

I’m going to blog on two technical areas.

One is gathering Stats, which I am utterly sick of and tired of doing in my working life. So I figure if I tell anyone who will listen all I know about gathering system, fixed object, dictionary and object stats and give some hints as to what I have come to think of as methods and techniques for doing so, I might not have to think about it any more. I can tell people to read my blog and not hire me to do it for them. Maybe a career limiting move but I was never that bright :-)

The second topic will be Index Organised Tables (IOTs). They are great. Ohhh, they have drawbacks and concerns, after all no tuning trick is For Free and I know one person (who I shall refer to only as Dave) who’s career was almost destroyed due to an Oracle 9 to 10 “feature” on IOTs involving corruption. But IOTs are {in my opinion} a vastly under-used feature of Oracle and could be useful to many Oracle sites. If anyone wants help with them, I’ll let you hire me for that and I will come and help gladly. So long as no bl00dy stats are involved :-)

So, having drawn my own line in the sand to do some Technical Blogs (I actually have enough waffle-based Friday Philosophy topics to last 2 years but have promised myself to balance them with decent technical posts) I better go and write them.

If you Really Can’t Solve a “Simple” Problem.. March 11, 2011

Posted by mwidlake in Friday Philosophy, Perceptions, Testing.
Tags: , ,
12 comments

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

   9 |          NESTED LOOPS                       |                           |     1 |   139 |    37   (3)| 00:00:01 |       
* 10 |           HASH JOIN SEMI                    |                           |     1 |    50 |    11  (10)| 00:00:01 |       
* 11 |            TABLE ACCESS BY INDEX ROWID      | PARTY_ABCDEFGHIJ          |     3 |   144 |     4   (0)| 00:00
* 12 |             INDEX RANGE SCAN                | PA_PK                     |     3 |       |     3   (0)| 00:00:01 |       
  13 |            COLLECTION ITERATOR PICKLER FETCH|                           |       |       |            |          |       
  14 |           PARTITION RANGE ITERATOR          |                           |    77 |  6853 |    26   (0)| 00:00:01 | 
* 15 |            INDEX RANGE SCAN                 | EVEN_PK                   |    77 |  6853 |    26   (0)| 00:00:01 | 

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

|   9 |          NESTED LOOPS                 |                           |     1 |   125 |  1828   (3)| 00:00:16 |
|  10 |           NESTED LOOPS                |                           |     1 |    63 |     2   (0)| 00:00:01 |
|* 11 |            TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ          |     1 |    45 |     2   (0)| 00:00:01 |
|* 12 |             INDEX UNIQUE SCAN         | PA_PK                     |     1 |       |     1   (0)| 00:00:01 |
|* 13 |            INDEX UNIQUE SCAN          | AGR_PK                    |     1 |    18 |     0   (0)| 00:00:01 |
|  14 |           PARTITION RANGE ITERATOR    |                           |     1 |    62 |  1826   (3)| 00:00:16 |
|* 15 |            INDEX FAST FULL SCAN       | EVE_PK                    |     1 |    62 |  1826   (3)| 00:00:16 |

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

In this case, it was this:

Here is my primary key:

EVEN_PK EABCDE 1 AGR_EXT_SYS_ID
EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX
EVEN_PK EABCDE 3 AAAMND_DT
EVEN_PK EABCDE 4 EVT_EXT_SYS_ID
EVEN_PK EABCDE 5 EABCDE_ID

Except, here is what it is on Live

EVE_PK EABCDE 1 EVT_EXT_SYS_ID
EVE_PK EABCDE 2 EABCDE
EVE_PK EABCDE 3 AGR_EXT_SYS_ID
EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX
EVE_PK EABCDE 5 AAAMND_DT

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

Follow

Get every new post delivered to your Inbox.

Join 163 other followers