jump to navigation

If you Really Can’t Solve a “Simple” Problem.. March 11, 2011

Posted by mwidlake in Friday Philosophy, Perceptions, Testing.
Tags: , ,
12 comments

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

   9 |          NESTED LOOPS                       |                           |     1 |   139 |    37   (3)| 00:00:01 |       
* 10 |           HASH JOIN SEMI                    |                           |     1 |    50 |    11  (10)| 00:00:01 |       
* 11 |            TABLE ACCESS BY INDEX ROWID      | PARTY_ABCDEFGHIJ          |     3 |   144 |     4   (0)| 00:00
* 12 |             INDEX RANGE SCAN                | PA_PK                     |     3 |       |     3   (0)| 00:00:01 |       
  13 |            COLLECTION ITERATOR PICKLER FETCH|                           |       |       |            |          |       
  14 |           PARTITION RANGE ITERATOR          |                           |    77 |  6853 |    26   (0)| 00:00:01 | 
* 15 |            INDEX RANGE SCAN                 | EVEN_PK                   |    77 |  6853 |    26   (0)| 00:00:01 | 

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

|   9 |          NESTED LOOPS                 |                           |     1 |   125 |  1828   (3)| 00:00:16 |
|  10 |           NESTED LOOPS                |                           |     1 |    63 |     2   (0)| 00:00:01 |
|* 11 |            TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ          |     1 |    45 |     2   (0)| 00:00:01 |
|* 12 |             INDEX UNIQUE SCAN         | PA_PK                     |     1 |       |     1   (0)| 00:00:01 |
|* 13 |            INDEX UNIQUE SCAN          | AGR_PK                    |     1 |    18 |     0   (0)| 00:00:01 |
|  14 |           PARTITION RANGE ITERATOR    |                           |     1 |    62 |  1826   (3)| 00:00:16 |
|* 15 |            INDEX FAST FULL SCAN       | EVE_PK                    |     1 |    62 |  1826   (3)| 00:00:16 |

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

In this case, it was this:

Here is my primary key:

EVEN_PK EABCDE 1 AGR_EXT_SYS_ID
EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX
EVEN_PK EABCDE 3 AAAMND_DT
EVEN_PK EABCDE 4 EVT_EXT_SYS_ID
EVEN_PK EABCDE 5 EABCDE_ID

Except, here is what it is on Live

EVE_PK EABCDE 1 EVT_EXT_SYS_ID
EVE_PK EABCDE 2 EABCDE
EVE_PK EABCDE 3 AGR_EXT_SYS_ID
EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX
EVE_PK EABCDE 5 AAAMND_DT

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,
7 comments

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful!:-)

Friday Philosophy – Alternatives to Testing February 19, 2010

Posted by mwidlake in Friday Philosophy, Testing.
Tags: ,
5 comments

We all know that we should test everything before we do it on a production system.

Or should we? Well yes, you should, but sometimes there is no suitable test system (or even any test system, which seems to be becoming a more common problem) or the testing regime takes weeks and you need a fix right now.

So given that there really is a proper business reason why doing the testing is expensive {in effort or time} or impossible, what can you do? I believe there is an alternative.

To be able to reduce or remove the need for proper testing you need to be absolutely sure the following three things are true, in order of importance:

  1. Be able to monitor the live system for negative impact of your changes.
  2. Have a proven and quick rollout.
  3. Be confident that the chance of negative impact on your system is low likelihood and low impact.

Note that, in my opinion, the least important {yet still very important} consideration is that the change be almost certain to work and won’t cause too much grief if something unexpected happens.  However, this seems to be the main concern when people are considering a “Quick Fix” on a production system. “Is it likely to go wrong? Is it going to matter if it does not help?”. At this point all eyes tend to swivel to the DBA or the Lead Developer who has to go on what I call DBA Instinct. Given your experience to date and what you know about how the database works, give it your best guess.

The problem with DBA Instinct is that if you are not very experienced you are unlikely to get it right. If you are very experienced, most of that experience was on earlier version of Oracle and you probably do little hands-on work now as you are a manger. So you are still likely unlikely to get it right:-)

So, I argue that point 1 (spotting things going wrong) and 2 (Getting back to where you started) are most important.

Let us take the classic example of just adding an index as our untested change (as we have no system capable of generating a realistic workload and thus the impact across the board).

3) Most of us would agree that adding an index is easy and the impact is low and the chance of plans going terribly slow by using the new index unsuitably are low…

2) Rollback is relatively simple, you drop the index again. Question, can you just drop an index on a table if a SQL statement is in-flight and using that index? You know, a SQL statement that the CBO has decided is going to be useful and, despite your DBA Instinct that nothing is likely to go wrong, now takes 20 seconds to run rather than 0.2 seconds. And it is being run all the time on your OLTP system so now there is no time when there are not 2 or 3 queries in flight.

3) Verifying impact. This point is vital and can be tricky. You can just rely on staring at your monitoring tool and hoping any serious impact shows up so you can catch it and do something about it. Or people ring up and complain. That would not be so good, you have impacted the business. 

I would pull out code from the SGA that is referencing the table to be indexed before you start (I’ll stick a bit of code at the bottom for doing this). If the list is not too large, you can do the same check after the code has gone in and compare changes to buffer gets and disk reads per execution.

A more targeted check is to pull out of v$SQL_PLAN anything that starts using the new index and check to make sure it is not now a problem.

A second example, I want to gather system statistics as I am sure the system does not appreciate how blindingly fast my disks are for multi block reads.

3) Impact? Well, I would suggest the potential impact is wide and across the board, else why do it?

2) Rollback? Delete the system statistics. What happens when you delete the system statistics? Are they all set to null? Are the defaults put back in place? Could I use DBMS_STATS.RESTORE_SYSTEM_STATS? What is the impact of any of them? {RESTORE appears to work reliably and, if you delete the system stats, you get the “out of the box” values again, which could be very different to what you had before your spur-of-the-moment gather}

1) How to assess negative impact? You are going to have to monitor the whole system and hope to pick up any code that starts running slowly before it impacts the business too much.

I would probably add the index but I would need to do at least some testing and proof of rollback before I gather system statistics. And I would only do so without proper testing if said proper testing was impossible due to a lack of test systems.

So, I do think you can implement changes without testing or with reduced testing, but it is not always the easy option.

I also think it is a valid (and often quicker, in terms of elapsed time) way of changing the productions system without going through full unit/integration/regression/NFR testing.

Now you just need to persuade the Change Advisory Board that it is OK to do the change. Good luck:-)

Oh, that code:-

-- chk_sga_txt.sql
-- Martin Widlake - look for sql in sga containing the provided txt
--
set pause on
set pages 36
col first_load_time form A20
spool chk_sga_txt
select
first_load_time
,parse_calls prse
,executions  excs
,buffer_gets buffs
,disk_reads   discs
,rows_processed rws
--,address        address
,hash_value     hash_value
,sql_id
--,plan_hash_value
,sql_text
from v$sqlarea
--where parsing_schema_id !='0'
where upper(sql_text) like upper('%'||nvl('&sql_txt','whoops')||'%')
and rownum <50
--order by first_load_time desc
order by (greatest(buffer_gets,1)/greatest(rows_processed,1)) desc
/
spool off

Turning on SQL Audit February 2, 2010

Posted by mwidlake in development, performance, Testing.
Tags: , , ,
6 comments

<Previous post…

I want to test out using the SQL AUDIT functionality.

The first thing you have to do (having read up on it a bit) is to enable it for the database. You do this by setting the initialization parameter AUDIT_TRAIL. On version 10.2 you have the options to write the audit entires to:

  •  The DB, by setting it to DB or DB_EXTENDED {not DB,EXTENDED as the manual says, that is the old format}. This puts the entries into the table  SYS.AUD$
  • The operating system, by setting it to OS, XML or XML_EXTENDED. If you set it to XML or XML_EXTENDED then the data is written out in XML format. You also optionally set AUDIT_FILE_DEST to where you want to data to be written to.

Writing the audit trail to the OS is potentially more secure, as you can stop those cunning and devious DBAs messing with the audit trail. {I’m not so sure that this really helps that much – if anyone knows of any DBAs caught out being naughty solely as a result of using the OS to store the SQL AUDIT records, it would be a fascinating comment}

I want to write to the DB as I want to be able to get at the audit data easily and I am not sure how I want to interrogate it. I’m faster with SQL than SED and AWK.

I also decided up front I wanted to use DB_EXTENDED so that the triggering SQL statement and all bind variables are caught, so I can see more about what it triggering the audit record. I am cautious of the impact of storing CLOBs though, which these two values are stored as. I’ve had performance issues moving lots of CLOBS around and I know from some old colleagues that Secure Files are a lot faster. If Secure Files are faster, that means CLOBs are slower:-). If the audit trail seems to add too much burden on my system, swapping back to just DB will be my first step.

Now for the bad news. You can’t just turn on AUDIT. That initialization parameter is not dynamic. You can’t even enable it for your session. It will need a restart of your database.

This tells me something. Oracle needs to do some setting up for SQL AUDIT when it starts the instance. Either start a new process, enable functionality in one of it’s regular processes or set up structures in memory to cope. Or a mixture thereof. I strongly suspect the need for memory structures {but this is only because, in reality, I have done some testing and I am writing this up afterwards}.

I should not really need to say this but DON’T go turning this on for a production system without extensive testing somewhere else first. There is not a lot “Out There” about the details of the performance impact of AUDIT but the general opinion is there is some; and that is reasonable given it is going to write database records for every action audited. Also, you have no idea yet of any knock-on effects. You know, things you did not expect that causes your database to lock or crash and you to get fired.

{Question, what happens if you alter the initialization file and restart only one node of a RAC database? I don’t know and so I should test that. My current test system is not RAC, but the final destination for this stuff is RAC}.

You probably also want to check that no one has gone and tried turning on SQL AUDIT on things already. You never know if someone else decided to have a play with this and issued a load of AUDIT statements only to find nothing happened – and left what they did in place “as nothing happened”. I already know of one example of this happening…

Here is a little script I knocked up to see what is currently set to be audited:

-- what_is_audited.sql
-- Martin Widlake 11/01/10
-- simple listing of what auditing is currently set
set pages 100
set pause on
spool what_is_audited.lst
select * from dba_priv_audit_opts
order by user_name,privilege
/
select * from sys.dba_stmt_audit_opts
order by user_name,audit_option
/
select * from DBA_OBJ_AUDIT_OPTS
order by owner,object_name
/
spool off
clear col
--
-- EOF
--

And some sample output. I’m not going to explain it in this post, but you can have a look though it.

DEV3&gt; @what_is_audited
USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
PRIVILEGE                                SUCCESS    FAILURE
---------------------------------------- ---------- ----------
MDW1
ALTER SYSTEM                             BY ACCESS  BY ACCESS
MDW1
AUDIT SYSTEM                             BY ACCESS  BY ACCESS
MDW1
CREATE SESSION                           BY ACCESS  BY ACCESS

ALTER SYSTEM                             BY ACCESS  BY ACCESS

AUDIT SYSTEM                             BY ACCESS  BY ACCESS

CREATE SESSION                           BY ACCESS  BY ACCESS

6 rows selected.

USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
AUDIT_OPTION                             SUCCESS    FAILURE
---------------------------------------- ---------- ----------
MDW1
ALTER SYSTEM                             BY ACCESS  BY ACCESS
MDW1
CLUSTER                                  BY ACCESS  BY ACCESS
MDW1
CONTEXT                                  BY ACCESS  BY ACCESS
MDW1
CREATE SESSION                           BY ACCESS  BY ACCESS
MDW1
DATABASE LINK                            BY ACCESS  BY ACCESS
MDW1
DELETE TABLE                             BY ACCESS  BY ACCESS
MDW1
...
TYPE                                     BY ACCESS  BY ACCESS
MDW1
UPDATE TABLE                             BY ACCESS  BY ACCESS
MDW1
USER                                     BY ACCESS  BY ACCESS
MDW1
VIEW                                     BY ACCESS  BY ACCESS

ALTER SYSTEM                             BY ACCESS  BY ACCESS

CLUSTER                                  BY ACCESS  BY ACCESS

CONTEXT                                  BY ACCESS  BY ACCESS

CREATE SESSION                           BY ACCESS  BY ACCESS

DATABASE LINK                            BY ACCESS  BY ACCESS

DIMENSION                                BY ACCESS  BY ACCESS

DIRECTORY                                BY ACCESS  BY ACCESS

INDEX                                    BY ACCESS  BY ACCESS

MATERIALIZED VIEW                        BY ACCESS  BY ACCESS
...
USER                                     BY ACCESS  BY ACCESS

VIEW                                     BY ACCESS  BY ACCESS

56 rows selected.

OWNER                          OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------------------ --------------
ALT     AUD     COM     DEL     GRA     IND     INS     LOC     REN     SEL
------- ------- ------- ------- ------- ------- ------- ------- ------- ----
UPD     REF EXE     CRE     REA     WRI     FBK
------- --- ------- ------- ------- ------- -------
MWPERF                         FORN_M_SEQ                     SEQUENCE
-/-     -/-     -/-     -/-     -/-     -/-     -/-     -/-     -/-     A/A
-/-     -/- -/-     -/-     -/-     -/-     -/-
MWPERF                         PERSON                         TABLE
A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A
A/A     -/- -/-     -/-     -/-     -/-     A/A
MWPERF                         ROAD_TYPE                      TABLE
-/-     -/-     -/-     A/A     -/-     -/-     A/A     -/-     -/-     A/A
A/A     -/- -/-     -/-     -/-     -/-     -/-

If you discover you have a lot of things set to be audited, ESPECIALLY if they are auditing select access, think about turning some or all of it off before you enable AUDITING by setting that initialization parameter.

Once you have turned on the feature, you can start testing it…

Testing Methodology – How to Test #2 January 26, 2010

Posted by mwidlake in performance, Testing.
Tags: ,
1 comment so far

<previous post …

Last post I (see link above) I said something about how I like timing how long things {usually SQL queries but you can apply this to more complex situations} take as it is simple. But I highlighted that it is an approach that misses a lot of information and also can catch you out if you do not appreciate that the first run of any test script is likely to be slow (due to parsing and data caching overheads which are reduced for subsequent itterations).

I want to continue on that theme now, of simple testing with just how long things take. And how to cope with variation.

Hands up all of you who have Oracle on a PC at home? Great, you can test what you want on a dedicated machine. No one else is running anything on your machine, no one else is using your storage, no one else is using your network (if you have a network at home) and no one is running the SQL statement from hell on your database.

BUT. You might have some download of the latest security patch set running in the background, your virus scanner may kick in (or you might just have Norton full-security-everything-running-on-your-cpu-all-the-time turned on). Your miles per gallon may vary, especially on Windows but also on Linux.

For many of us we are testing on a server at work, where developers are working, where the storage is shared, where the network is dealing with that lady behind you watching “Celebrity Strictly come roller-skate dancing impersonations” {well, she sat behind me for 4 months and she was in the Quality team, what does that tell you}. Bottom line, the performance of your server will vary, your testing needs to cope with that.

How do you cope with a varying workload on the system you are running tests on? Well, you run the new and old version of your test several times, in succession, interleaved. This is ideal when , in the very un-ideal situation of having to test on a Live situation. Testing on “Live” is wrong, in the same way that spelling “initialisation” with a Z is wrong, but it is often forced upon us.

As an example, I want to check into the questionable suggestion someone made that selecting count(*) from dba_tables where owner=user is faster than selecting count(*) from user_tables.

So, I knock up a quick sql*plus script that runs the two alternatives in quick succession, interleaved.

--testa.sql
set timi on pause off
spool testa.lst
--
prompt run 1
select count(*) from user_tables;
select count(*) from dba_tables where owner=user;
--
prompt run 2
select count(*) from user_tables;
select count(*) from dba_tables where owner=user;
--
prompt run 3
select count(*) from user_tables;
select count(*) from dba_tables where owner=user;
--
prompt run 4
select count(*) from user_tables;
select count(*) from dba_tables where owner=user;
--
prompt run 5
select count(*) from user_tables;
select count(*) from dba_tables where owner=user;
--
spool off
-- eof

Just a couple of things to mention. I always spool out to a file called the same thing as the script, with .lst at the end (which is the default file extension for spool commands but I like to state it) and I like to use “prompt” to record in the output what is going on. Here it is obvious, sometimes it is not. Oh, and I like to put –EOF at the end of my files, so I can spot cut-and-paste errors. I know, I am very “command line” centric. I love pictures too Doug. Finally, I should not run the two versions with the same one first each time, I should swap them over after every two runs (not each run as then you are running A,A,B,B,A,A,B,B. A more complex A,B,A,B,B,A,B,A is better). But the results look messy, so I didn’t.

Here is the output (trimmed a little, for the sake of brevity):-

run 1

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.58

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.73

run 2

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.28

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.10

run 3

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.28

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.10

run 4

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.28

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.10

run 5

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.29

  COUNT(*)                                                                      
----------                                                                      
        35                                                                      
1 row selected.
Elapsed: 00:00:00.09

At this point I might pull the results into Excel, as it is so good at drawing pretty graphs, but I think we can interpret the results “by eye” in this case.

Run    User Tabs   DBA_Tabs
1        0.58          0.73
2        0.28          0.10
3        0.28          0.10
4        0.28          0.10
5        0.29          0.09

AVG      0.28          0.10

My Average is ignoring the first run for each version of the code, as previously explained {parse overhead/warming caches}

You can see that version B is considerably faster than A. In this case, there is no overlap between the two result sets, by which I mean every run o version B is faster than every run of version A. Also, there is little variation in the performance of each version.

What if you got a result set like:-

Run    User Tabs   DBA_Tabs
1        0.77          0.68
2        0.35          0.25
3        0.42          0.34
4        0.48          0.38
5        0.48          0.30
6        0.41          0.29
7        0.40          0.28

AVG      0.42          0.31

If you look at the figures, both A and B slow down and then start to speed up. Something happened to slow down the system in the middle of the test and it was recovering but not back to “normal” by the end. If we had done 7 runs of A followed by 7 runs of B, B would have been hard hit by the temporary system slow-down {I opened a WORD document:-) }.

Also, there is some overlap in the results for A and B. The slowest result for B, 0.38, is slower than for the fastest result for A, 0.35. When the spread of times for test A and test B overlap like this, you need to do a few more tests to ensure the results are consistent. I did 7 but I should really have done 9 or maybe 15.
In any case, B is consistently 25% faster than A. It is Faster.

How about this result set:

Run    User Tabs   DBA_Tabs
1        0.87          0.71
2        0.41          0.42
3        0.47          0.43
4        0.37          0.42
5        0.39          0.34
6        0.51          0.38
7        0.42          0.40

AVG      0.43          0.40

The variation in the result sets is higher and if fact if you only looked at results for paired runs 2,3 and 4 you would come to the conclusion A was very slightly faster than B. In fact, B is slightly faster than A.

Or is it? There are only 6 tests {remember, ignore run 1}, the variation within the sets is high (.37 to .51 in A, .34 ot .42 in B) and the overall difference is low. You could run a statistical analysis on this, a “Student T” test I think, to see if the difference was significant. But unless I was looking at something utterly business critical at this point where the smallest fraction of improvement was important, I would not bother. I would be far better off looking for solution C or doing something else completely. If you ever find yourself doing scientific statistical analysis to decide which of two options is faster, it is probably time to go home and consider what else you could be doing to help your employer more…

Enough for tonight and HOPEFULLY the wordpress pixies will not eat the end of this post (yesterday’s effort was better than tonights I think…)

Testing Methodology – How To Test January 26, 2010

Posted by mwidlake in performance, Testing.
Tags: , ,
1 comment so far

<Previous Post…

On those rare but pleasant occasions when I find myself running a training course on performance, something I always want to press home is “How to Test”.

I believe you can learn everthing in every manual on oracle technology and internals and still be very poor at performance tuning. Similarly, I think you can be lacking an awful lot of knowledge and still be good at performance tuning. It comes down to how you test. Not just how to test, but how you as an individual design and run your tests.

I joke with whoever I am talking to that what you need most of all to test is a watch and someone calling out “start” and “Stop”. ie you need to be able to time things. It is a bit of a throw-away statement, but actually most of us will do exactly this on a regular basis. We will be working on something, run it and it will come back in a few seconds. Then we will tweak it and run it again and mentally {or in my case, just very quietly under my breath} count. We even discuss it with people like that “How long did it take that time?” Ohh, a couple of seconds faster than last time”.

I like tuning by timing very, very, very much.

Firstly, it is simple – If a SQL query runs faster, it runs faster. If you tune by looking at the explain plan and you see a full table scan being replace with a nested loop and an index look up, is it faster? It depends. If you tune by looking at the buffer gets and other statistics from the SGA (or wherever), if the “buffer gets” go down, is it faster? This depends again. If the disk reads went up, even by a realtively small amount, maybe not. If the memory usage went through the roof because of a sort that spills down to disc, well quite probably not. Obviously if all of buffer gets, disk reads and cpu usage went down, you can be pretty certain you are having a positive impact in speeding up the statement. But timing the statement from start to finish gives you a nice, simple answer on if it runs faster.

Secondly, it is simple. You do not have to look at plans, at runtime statistics, at deltas of session statistics, at trace files. All those things are good but you need more knowledge and experience to use them and time to set up, collect and study them. This is not to say you do not get a lot more out of tuning if you understand all of the stuff about trace files, explain plans etc, I hasten to add – but you do not need that to get going with performance tuning.

Thirdly, it is simple. You tell your manager’s manager that you use 47% less buffer gets, 12% less disk reads but 9% more CPU then they will ask you what all that means. You tell them that version A runs in 2 minutes and 9 seconds and version B in 34 seconds, they will understand. Better still, graph it and give them a power point…

Fourthly, it is simple. You can see that statement (a) is faster or slower than statement (b). Now you can look at the plan, at the statistics for the statement, at what wait events are occuring and start tying up book knowledge with real-world results.

Of course, timing with a watch is very crude. You have a very capable piece of computing power underlying that database of yours, so let’s get it to do the timing.

As a simple start, in SQL*Plus use “set timing on” (it can be abbreviated to “set timi on”). Most GUI SQL tools will either default to telling you the elapsed time or have an obvious option to switch on the functionality.

{Oh, I’ll just mention that if you cannot get timing in sql*plus or wherever to work you, might want to check what the initialisation parameter “TIMED_STATISTICS” is set to. Some ancient memory is telling me that if it is set to FALSE, you may not be able to use TIMING in SQL*Plus but another memory is telling me that stopped being true a while back, version 8 maybe. I tried setting TIMED_STATISTICS to false in my session on 10.2 but TIMING still worked, but then I left STATISTICS_LEVEL alone and that cause TIMED_STATISTICS to be set. It is so long ago that I worked on a system that had TIMED_STATISTICS set to false! Even a quick google did not throw up an immediate answer}.

DB10.2> select count(*) from sn
  2  /
any key> 

  COUNT(*)
----------
       116

1 row selected.

DB10.2> set timi on
DB10.2> select count(*) from sn
  2  /
any key> 

  COUNT(*)
----------
       116

1 row selected.

Elapsed: 00:00:01.09

There you go, the count of SN records, all 116, took 1.09 seconds.

Yes, I had “set pause on” and the timing includes how long it takes me to spot that the query has finished and press a key. We all do it. However, sometimes the need for user input it still missed {Something I see quite often is, for example, not pressing the button in PL/SQL developer to get ALL the rows for the SQL statement, rather than just the first screenful}. A key thing is to try and remove from testing all and any waiting for user response, as we humans are slow and irratic.

so SET PAUSE OFF.

Now you have TIMING on and pause off. Time to run something and see how long it takes, then try and speed it up and run again:

DB10.2> select count(*) from person where to_char(dob,'YYYY') = 1988
  2  /

  COUNT(*)
----------
        29
Elapsed: 00:00:01.15

DB10.2> create index per_dob on person(dob);

Index created.

Elapsed: 00:00:01.31
DB10.2> select count(*) from person where to_char(dob,'YYYY') = 1988;

  COUNT(*)
----------
        29
Elapsed: 00:00:00.14

There you go, I added an index and the query went down from 1.15 seconds to 0.14, that is 8 times faster. Timing is timing.
Well, no, and this is something you need to be careful of if you are new to tuning.

The second itteration is nearly always faster.

Why? Well, the first time you run a piece of SQL it has to be parsed for a start, and that takes some time. More importantly, the first time you select the data, it is going to be on disk. Oracle has read it in and put it into memory. The second time you query the same data, it will be found in memory. That {nearly always} makes the next access to the data a lot faster. The index was not being used as I had a function on the column in the WHERE clause and this stops the index from being used.

So having said I love testing by timing, you need to be cautious about one-off tests. Oh, and here below is proof that the index I created is making no real difference to the speed of the SQL query:

DB10.2> set autotrace on
DB10.2> select count(*) from person where to_char(dob,'YYYY') = 1988;

  COUNT(*)
----------
        29
Elapsed: 00:00:00.12

Execution Plan
-----------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost 
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |        |     1 |     9 |   335
|   1 |  SORT AGGREGATE    |        |     1 |     9 |      
|*  2 |   TABLE ACCESS FULL| PERSON |    36 |   324 |   335
-----------------------------------------------------------

Statistics
----------------------------------------------------------
         20  recursive calls
         19  db block gets
        727  consistent gets
          0  physical reads
       1992  redo size

DB10.2> drop index per_dob
  2  /

Index dropped.

Elapsed: 00:00:00.03
DB10.2> select count(*) from person where to_char(dob,'YYYY') = 1988
  2  /

  COUNT(*)
----------
        29
Elapsed: 00:00:00.15

Execution Plan
----------------------------------------------------------
| Id  | Operation          | Name   | Rows  | Bytes | Cost
----------------------------------------------------------
|   0 | SELECT STATEMENT   |        |     1 |     9 |  335
|   1 |  SORT AGGREGATE    |        |     1 |     9 |      
|*  2 |   TABLE ACCESS FULL| PERSON |    36 |   324 |  335
----------------------------------------------------------


Statistics
----------------------------------------------------------
        261  recursive calls
         19  db block gets
        841  consistent gets
          0  physical reads
       1992  redo size

We see 0.12 seconds goes to 0.14 seconds, exactly the same explain plan for both statements, a small increase in consistent gets, no physical gets by either statement (so the data is cached).
Why is the second statement a little slower and has a large increase in recursive calls and db block gets? Because I dropped the index and the statement had to be reparsed. Now, if you are new to tuning you would almost certainly not have appreciated what the recursive calls and DB block gets were all about, it could distract you from the question of “does it run faster”. It is certainly good to know all about that, but when you are starting off, you want to keep things simple and learn in stages.

What the above demonstrates, I hope, is that the second thing you run will have an advantage and could probably run faster even though, in reality, it is using more resource. And we tend to run old code first and new code second. So swap them over, give the old code the advantage of being run second.

Do not test things once, as you can easily be caught out. Test each version several times. And ignore the first run of each version. This is not perfect advice, code in production may well be being parsed and gathering data from disk, but unless you can allow for this in your testing, I think it is generally better, for simple testing, to run each version 6 times. Of the six runs, ignore the first run of each and average the results of the other 5. Which ever one runs faster on average is, well, fastest. IF the difference is significant.

Oh. Where is the SQL AUDIT in all this? Well, ponder on why am I generating REDO for a simple select…

Testing Methodology – Getting Community Information January 22, 2010

Posted by mwidlake in Testing, Uncategorized.
Tags: ,
4 comments

<Last post I tried to highlight that knowing specifically what you want to understand before you start testing is important and that it is beneficial to start with the booring official documentation. This is so that you can filter what is Out There in the web, as well as get a good grounding in what is officially possible.

There is no doubt that you can get a lot more information on Oracle features from the community than you can from the manuals and that it is often presented in a more easily understood and relevant way. But some of it is not right for your version or is just plain wrong. I can do no better than to refer to this recent posting by Jonathan Lewis on the topic. In this case, it is the persistance of old and now irrelevant information, perpetuated by the kind but flawed habit people have of pulling information together – but not testing it.

Treat everything on the net with caution if it has no proof. Even then be cautious:-) – I know I have “proved” a couple of things that turned out to be misunderstandings…

You can of course post to forums to ask for information and help, and people will often come to your aid. But you are often not really understanding the subject by doing that, you are just learning facts. It’s like learning the dates of the two World Wars and not understanding why the wars occured. I would point you to this posting by Lisa Dobson from a couple of years back, which makes the point about what you stand to learn, and also says something about how to ask for help. If it helps you to do the “right thing”, think of it selfishly. You will get better and more specific help if you have already looked into your issue and can ask specific questions. Also, if you properly learn something rather than just get a couple of facts about it, you are less likely to look foolish when you repeat those facts in a context they are not valid in. And you will.

Sorry, side tracked a little into one of my pet annoyances. Back to topic.

I want to know about SQL AUDIT. So I google SQL AUDIT ORACLE. Lots of hits. Let’s start filtering. First off, anything by “experts-exchange” and the other pay-for forums can forget it. Sorry guys but if I am going to spend money I’ll buy a manual. I will also ignore any stuff by sites that constantly suggest you go on training courses on boats with them or are “Team America” {everyone seen the film by Trey Parker? Puts the claim “We are the American Team” into a different light…}.

Now I go and look at the sites and I try making my search more intelligent, maybe adding in the word DBA_AUDIT_TRAIL and the like. I try and think what words and phrases someone explaining the topic would use which would not be common for other topics. That is why I often use database object names and commands in my searches.

In the case of SQL AUDIT, there are plenty of sites that generally pull together the basics from the manuals and tell you how to audit a few things and see the results and give some nice examples. It gets you further forward, but it’s mostly not very detailed. Just cookery instructions on what to do but not how it works. Thankfully, there is an excellent article by one of the experts in the field, Pete Finnigan.
Maybe I could have replaced this post by simply suggesting you just go to Pete’s web site, read what is there and follow the link from it to the ones he likes. It would work for the topic of SQL AUDIT. However, although it is always good to go to the sites of people you know and trust, it is always worth doing a google and going beyond the first page or two of results. Those first pages are mostly for a handful of very popular sites and sources. A new article or someone who is relatively unknown but has the answers you need may be further down the list, like pages 3 or 4. It is worth 5 mins just checking.

However, you are not likely to find everything you need. Certainly with SQL AUDIT you won’t find much about performance impact other than bland and not-very-helpful generic warnings that “Use of SQL AUDIT can carry a significant performance overhead”. How much overhead? for what sort of audit? And how do you know? In fact, when searching I found this, admittedly old, comment by Pete about there being relatively little “out there” about performance impact of Auditing. That made me smile, as that information was exactly what I was after.

*sigh*. The information I want does not seem to be out there. I need to do some testing.

That is for the next post (“at LAST”, I hear you all cry).

Testing Methodolgy – The Groundwork January 20, 2010

Posted by mwidlake in Perceptions, Testing, Uncategorized.
Tags: , ,
add a comment

<Previous PostNext Post …>

I want to start learning about using SQL AUDIT, as I mentioned a couple of days ago.

First question. What do I want to know about SQL AUDIT? I might just have thought “well, I do not know much about this area and I think I should – so I want to know everything”. That is OK, but personally I find it helps to make up a specific aim. Otherwise you can just flounder about {well, I do}. In this case I have decided I want to know:

  1. The options for auditing who is selecting key tables.
  2. How to audit when those key tables get changed.
  3. The impact on performance of that audit, and if it could be an issue.
  4. (4) follows on from (3), in that I want to find out how to control that performance impact.

For any of you who have been or are code developers, you hopefully appreciate test-driven coding. That is, you decide up front what the new code must achieve and design tests to ensure the code does it. You do not write any code until you have at least thought of the tests and written them down in a document. {Ideally you have written the tests and built some test data before you start, but then in an ideal world you would get paid more, have more holidays and the newspapers would tell the truth rather than sensational rubbish, but there you go:-) }

I do not think that learning stuff/testing as much different from developing code, thus the list above. I now know what I want to understand.

What next? I’m going to go and check the Oracle Documentation for the feature. And I am very careful to check the documentation for the version I will use. This is 10.2 for me. I know, the Oracle documentation can be pretty dry, it can be rather unhelpful and it is not very good at examples. But you can expect it to be 90% accurate in what it tells you. You can also expect it to be not-very-forthcoming about the issues, gotchas and bits that don’t work. {I have this pet theory that the official documentation only mentions how things do not work once a feature has been out for a version and people have complained that the documentation should let on about the limitations}.

So, for SQL AUDIT I suggest you go and read:

  • Concepts Manual, chapter 20 Database Security. If I am not rushed I would read the whole chapter, I might realise that what I want to do is better done with some other tool (If I wanted to see who had changed records months down the line, I think I would pick up that database triggers were a better bet, for example).
  • SQL Reference, chapter 13 , the section on AUDIT (no surprises there). I do not do much more than read through the SQL manual once though, as frankly I find it pretty useless for explaining stuff, but it puts into your head what the parts of the command there are and pointers to other parts of the documentation. I’ll read the concepts manual with more attention. In this case, the manual will lead me to:
  • Database Security Guide chapter 8. Which is pretty good, actually.
  • My next source of information, may not immediately spring to mind but I find it very valuable, is to find out which data dictionary objects are involved in the feature. In this case, the previous sources would lead me to go to the Database Reference and check out:
  • DBA_AUDIT_TRAIL, DBA_OBJ_AUDIT_OPTS, DBA_PRIV_AUDIT_OPTS, DBA_STMT_AUDIT_OPTS. And of course, SYS.AUD$. {I actually queried DBA_OBJECTS for anything with the word “AUDIT” in it, check out all the tables and also had a quick peak at the text for any views, which would have led me to SYS.AUD$ if I did not already know about it}.

Why do I go and look at the data dictionary objects and the reference guide? After all, is it not nerdy enough to have put myself through reading the SQL Reference manual? {and let us be honest, it is rarely enjoyable reading the SQL Reference manual}. Because  I want to know how it works, not just how to use it. Seeing the table give me a lot of information and the description of the columns may well tell me a lot more. First thing, SYS.AUD$ only has one index, on a column SESSIONID# (I know, there is another column in the index, I need to get to a DB to check this). Any queries not via this index are going to scan the whole damned thing. Right off, I suspect this will be an issue.

I will DESC the tables, see if there is already any information in them. In this case, there was not. A clean sheet.

Why did I not just go to Google (Bing, Yahoo, whatever) and search? Because I can trust the manuals to tell me stuff which is mostly true and is valid for my release. Not stuff about older versions, about later versions, urban myths or just outright fallacies. The manuals certainly will not tell me everything, far from it, but what it does say is solid stuff. With this reliable start I can now go to other sources and have some chance of filtering  all the other stuff that is Out There. Once filtered, it is probably a lot more informative and easier to digest than the manuals.  

I’ll ramble on about that next posting.

Testing Methodology January 19, 2010

Posted by mwidlake in Testing.
Tags:
3 comments

This is my 101st posting and {almost} my start for 2010. I was going to kick off 2010 on something I knew well, so I could look all smart – specifically, handling stats for big databases. But I have decided to kick off on something I know very poorly. Security. Namely, use of SQL AUDIT. Why? Why blog about something I know little about?

Because I want to blog about the process of finding out about stuff and testing that stuff. It’s actually going to be a thread about testing – the SQL AUDIT is just an excuse {and matches my work life, which is jolly important when time is limited and you need to keep the employer happy}. I started out my whole blog (before almost anyone paid any attention to my meanderings) on what you could find out just from doing a “select count(*)” from a table. You can find out a lot from simple beginings.

How do you test something? Well, that question is actually a step or two down the line from “how do I understand something”. Sorry, I was trained as a scientist at college and you get sold this line as a trainee scientist about Baconian Science whereby you record all the data about something and then come to conclusions. It does not work really. You need an idea first and then you need to test that idea.

So, in the case of IT, you come across something (the initial idea) and you want to know how it works. It is usually a feature of the software new to you. With software (and the Oracle Database is really just a huge pile of software) there is a source of starter information and it is called the “documentation”. If I was a Baconian scientist I would just use the feature and record everything I could think of about it and record what I discovered. But this is software and some human (or humans) designed it and implemented it. And documented it.

So, First thing, don’t google it. Go and read the official manual on it. It is not going to tell you everything (it is, after all, written by people desperate to put it in the best light possible – and yet leave space of consultancy and training dollars to be earned) and remember that the manual can be wrong, but it is likely to be more accurate than 80% of what you come across on web searches. {If I was a tabloid newspaper I would go and ask a bunch of people who knew little about the subject what they thought and then publish an article on it, emphasising any shocking or outrahgeous parts. Remember that when you Google or Bing or Yahoo anything, Anything. }.

Having read (and I would suggest you start with) the Concepts Manual on the feature and then any specific manual covering the feature, now do your web search and…
(a) look for sources you have grown to trust. Not popular sources necessarily, sources where you have rarely found them wrong or they publish examples and worked proofs of their utterences.
(b) look for repeated claims about that feature and keep them in mind. They could be urban myths, they could be accurate. Just keep them in mind.

Now go and do some tests. Yourself. Otherwise, you just never know…

And this is the nub, I think, of testing. Get some basic information about how it should work, throw in there your experience to date (eg, this AUDIT stuff is putting data into an oracle table, so it is inserting and updating rows in a table, it should react like data going into a normal table. What do I know already about insert and updates to tables?) and ask yourself a few simple questions:-
– How do I think this should work?
– How could this go wrong?
– What scenarios can I think of for using this?
– What are the simplest examples I can think of for this process?
– How can I reduce tests down to trying only this feature or even one aspect of this feature?
– How could this go wrong?
Yep, I ask that last question twice and it is no oversight. I want to try and understand where this new feature might break. I need to ask the “ahh but” questions.

Testing is Not Just for Code. September 16, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: , , ,
7 comments

Someone I am currently working with has a wonderful tag line in her emails:

Next time we want to release untested, why don’t we just release undeveloped?

Testing is not limited to testing code of course. I have recently posted about how a backup is not a backup until you have tested it with a practice recovery.  How you think the database will work by looking at the data dictionary is just a nice theory until you run some actual tests to see how the database responds, as I have been doing with Histograms lately. Sadly, you could even say an Oracle feature is not an Oracle feature until you have tested it.

In my experience, this is particularly true when you test the edges of Oracle, when you are working on VLDBs {Very Large DataBases}.

Last month Jonathan Lewis posted about a 2TB ASM disc size bug, where if you allocated a disc over 2TB to ASM, it would fill it up, wrap around and write over the begining of the file. This week I heard from some past colleagues of mine that they hit this very same bug.
With these very same colleagues we hit a big in 10.1 where you could not back up a tablespace over 8TB in size with RMAN {I can’t give you a bug number for it as we were working with HP/Oracle direct at the time and they “handled it internally”, But when I mentioned it to him, Jonathan found a similar one, bug 5448714 , which stated a 4TB limit on backups. It could be the same bug}.

Yet another VLDB issue was we wanted to move just under one thousand tablespaces from one database to another {again, 10.1}, using transportable tablespaces. We tried to use the utility for checking you are working on a consistent set of tablespaces, but it could not cope with that many. But to plug them into the new tablespace you have to export the metadata and we found a 4000 character limit on the variable stating the tablespaces to transport. That’s 2.3 characters per tablespace, as you need comas to delimit them…Yes, you could manage if you renamed all tablespaces to AA, AB, AC…BA.,BB, BC etc.  If memory servers, the problem was with data pump export and we reverted to old style export which did not have the problem.

Another limit I’ve blogged on is that the automated stats job chokes on very large objects.

Some Data dictionary views can become very slow if you have several tens of thousands of tables/extents/tablespace/indexes

I can appreciate the issues and problems Oracle has with testing their code base, it is vast and people use the software in odd ways and it has to run on many platforms. You might also feel I am being picky by saying Oracle breaks a little when you have 8TB tablespaces or a thousand tablespaces. But

  • Oracle will say in big, glossy presentations, you can build Petabyte and Exabyte databases with Oracle {and have a product called Exadata, don’t forget}.
  • More and more customers are reaching these sizes as data continues to grow, for many site, faster than mores law.
  • Some of these limits appear with databases well below a Petabyte (say a tiddly small 50TB one:-) ).

I’ve been running into these issues with VLDBs since Oracle 7 and they are often with pretty fundamental parts of the system, like creating and backing up tablespaces! I think it is poor show that it is so obvious that Oracle has been weak in testing with VLDB-sized database before release. 

I wonder whether, with 11gR2, Oracle actually tested some petabyte data sizes to see if it all works? After all, as is often said, disk is cheap now, I’m sure they could knock one up quite quickly…

Follow

Get every new post delivered to your Inbox.

Join 230 other followers