jump to navigation

Exclusion of Unioned SQL in Views? August 16, 2012

Posted by mwidlake in performance, SQL.
Tags: , , ,
13 comments

Question – you have a view definition of the following format:

select 1 as d_type, col_a,col_b,col_c
from TAB_X, TAB_Y, TAB_Z
where {your predicates}
UNION
select 2 as d_type, col_a,col_b,col_c
from TAB_P, TAB_Q, TAB_R
where {your predicates}
UNION
select 3 as d_type, col_a,col_b,col_c
from TAB_X X, TAB_Y, TAB_Z
where {your predicates}

You now select from the view and your code includes a filter predicate such as:

“WHERE D_TYPE = 1″

What will the CBO do? Is it intelligent enough to do what a human would do and exclude the parts of the query that could not possibly return data you want? Ie exclude the SQL statements where d_type is set to 2 and 3. Or is that funcitonality not built into the CBO? After all, it is only software and someone would have to decide that there could be benefit of looking into set of unioned SQL selects to see if any could be excluded and then code that into the CBO.

After all, the CBO is not smart enough to realise it could easily satisfy “select min(col_a), max(col_a) from tab x” via two fast lookups on a supporting index. It scans the index.

So – what is your guess?

I asked 3 people. Not a massive straw poll but I got one vote for it will not exclude, one for it will and one smart answer of “I would expect it to exclude- but as you asked the question I am thinking it does not”.

Let’s do the test.

DB_TEST> @demo_union_view_2
DB_TEST> drop table mdw_source purge;
Table dropped.

  1  create table mdw_source
  2  tablespace users
  3  as select * from dba_objects
  4  /
Table created.

  1  create index mdso_idx on mdw_source(owner,object_name)
  2  tablespace users
  3  /
Index created.

DB_TEST> exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'MDW_SOURCE',method_opt=>'for all columns size 1')
PL/SQL procedure successfully completed.

  1  select table_name,num_rows from dba_tab_statistics
  2  where owner=USER and table_name='MDW_SOURCE';

TABLE_NAME      NUM_ROWS
--------------- ----------
MDW_SOURCE      55895

  1  create or replace view mdw_v1 as
  2  select
  3    '1'              src_type
  4   ,ms.OWNER
  5   ,ms.OBJECT_NAME
  6   ,ms.SUBOBJECT_NAME
  7   ,ms.CREATED
  8  from mdw_source ms
  9  where ms.object_name like 'A%'
 10  union all
 11  select
 12    '2'              src_type
 13   ,ms.OWNER
 14   ,ms.OBJECT_NAME
 15   ,ms.SUBOBJECT_NAME
 16   ,ms.CREATED
 17  from mdw_source ms
 18  where ms.object_name like 'B%'
 19  UNION ALL
 20  select
 21    '3'              src_type
 22   ,ms.OWNER
 23   ,ms.OBJECT_NAME
 24   ,ms.SUBOBJECT_NAME
 25   ,ms.CREATED
 26  from mdw_source ms
 27  where ms.object_name like 'C%'
 28  /
View created.

So, a simple table is created from DBA_OBJECTS and a view over the top of the table consisting on 3 simple selects unioned together, with a static value (1, 2 or 3) generated for each of the 3 parts as SRC_TYPE

Now let’s run some tests:

DB_TEST> set autotrace on
DB_TEST> --
DB_TEST> -- below will do a select of a non-indexed column across the view
DB_TEST> select count(created) from mdw_v1
  2  /

COUNT(CREATED)
--------------
          1284

Execution Plan
----------------------------------------------------------
Plan hash value: 4259489107
-----------------------------------------------------------------------------------
| Id  | Operation            | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |            |     1 |     9 |   551   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE      |            |     1 |     9 |            |          |
|   2 |   VIEW               | MDW_V1     |  2463 | 22167 |   551   (1)| 00:00:07 |
|   3 |    UNION-ALL         |            |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  5 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  6 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   5 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   6 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
       2417  consistent gets


DB_TEST> -- below will do a select of a non-indexed column across the view
DB_TEST> -- limiting it to "one" of the types
DB_TEST> select count(created)
  2  from mdw_v1
  3  where src_type='3'
  4  /

COUNT(CREATED)
--------------
           129


Execution Plan
----------------------------------------------------------
Plan hash value: 1188403912
------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |     1 |    12 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE       |            |     1 |    12 |            |          |
|   2 |   VIEW                | MDW_V1     |   823 |  9876 |   184   (1)| 00:00:03 |
|   3 |    UNION-ALL          |            |       |       |            |          |
|*  4 |     FILTER            |            |       |       |            |          |
|*  5 |      TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  6 |     FILTER            |            |       |       |            |          |
|*  7 |      TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  8 |     TABLE ACCESS FULL | MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   6 - filter(NULL IS NOT NULL)
   7 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   8 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        807  consistent gets


DB_TEST> -- below will do a select of an indexed column across the view
DB_TEST> select count(owner) from mdw_v1
  2  /

COUNT(OWNER)
------------
        1284


Execution Plan
----------------------------------------------------------
Plan hash value: 4219520050
------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |    17 |   221   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE         |          |     1 |    17 |            |          |
|   2 |   VIEW                  | MDW_V1   |  2463 | 41871 |   221   (1)| 00:00:03 |
|   3 |    UNION-ALL            |          |       |       |            |          |
|*  4 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  5 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  6 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   5 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   6 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        989  consistent gets


DB_TEST> --
DB_TEST> -- below will do a select of an indexed column across the view
DB_TEST> -- limiting it to "one" of the types
DB_TEST> select count(owner)
  2  from mdw_v1
  3  where src_type='3'
  4  /

COUNT(OWNER)
------------
         129


Execution Plan
----------------------------------------------------------
Plan hash value: 815942527
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |     1 |    20 |    74   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE          |          |     1 |    20 |            |          |
|   2 |   VIEW                   | MDW_V1   |   823 | 16460 |    74   (2)| 00:00:01 |
|   3 |    UNION-ALL             |          |       |       |            |          |
|*  4 |     FILTER               |          |       |       |            |          |
|*  5 |      INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  6 |     FILTER               |          |       |       |            |          |
|*  7 |      INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  8 |     INDEX FAST FULL SCAN | MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   6 - filter(NULL IS NOT NULL)
   7 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   8 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        331  consistent gets


DB_TEST> -- below will select against the undelying object we are interested in
DB_TEST> select count(created)
  2  from mdw_source ms
  3  where ms.object_name like 'C%'
  4  /

COUNT(CREATED)
--------------
           129


Execution Plan
----------------------------------------------------------
Plan hash value: 152094671

---------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |     1 |    34 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE    |            |     1 |    34 |            |          |
|*  2 |   TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        805  consistent gets

Look at the plans at the start.

When my query does not include SRC_TYPE,ie the first SQL statement, the plan is simple – it scans the table three times, estimates 821 rows per scan and a cost of 184 for each step. When I include the SRC_TYPE = ’3′ in the next step we see that the plan still includes all three full table scans, each still estimated at a cost of 184.
When you provide a predicate that identifies the one sql statement that you want to run, it still runs all three. I tried a few further tests and could not force the exclusion of the “unwanted” queries from the plan.

I showed my friend above (the one who suggested it would not work as I had asked the question) the results and we agreed that it was a little unfortunate that the Oracle Optimizer Pixies had not coded in the ability to exclude unwanted Unioned statements, but we could understand it.

However, for those of you thinking “you are missing something” – you are right. {And, to be fair to my friend, my example was not quite so simplified as what I have posted here}.

I decided to quantify the level of the impact and so shifted my concentration to the buffer gets. Look at the first and second statements again. The cost of the total statement drops, from 551 to 184. The actual consistent gets resulting from executing the SQL is dropping from 2417 to 807. They are both dropping dramatically to approximately 1/3.

Now check out the filter predicates at step 4 and 6 – “NULL IS NOT NULL”. That will always be false. Everything “after” that statement in the plan will not be executed. The step in the plan is there and, as a step, has the same cost – but it is is not being executed.

The pattern is repeated when the simple SQL statement is on an indexed column and the index is being used (the next two statements) {and honesty forces me to admit I have included more output than I need for this example, as I wanted to confuse the issue slightly and maybe encourage you to not read the plans too carefully}.

So yes, the CBO IS excluding the individual unioned SQL statements based on the stated SRC_TYPE and the fixed value stated in each part of the union.

I’m quite impressed by that. Clever Oracle Optimizer Pixies.

To tie this back to reality, I initially looked at this as I am trying to tune some code that hides all the complexity of the application under one massive unioned uber-view over a set of 9 other complex, unioned views. The explain plan runs to close to 700 lines. I had utterly overlooked the filters and filter predicates in that – heck I missed it initially in my simple test :-) {which was more complex than in this post, in my defense}. Trying to work with a very slow statement based on nested, complex views that is used by pretty much the whole application is “fun”.

Also back in reality, the filter clause I see in my real code is along the lines of:

:B17 is null or is not null

This is as bind variable is being considerd – and that is actually one of the considerations back in this real world of the original issue. The level that the master query needs to run in is decided by this bind variable – otherwise the application developer would (should have?) just pointed at the sub-view they needed for that level. So the CBO does not know which chunk of the union’d code it can ignore until it sees that bind variable, so the plan shows what could be executed.

Of course, it could be argued that if you know the type at the time you query the database, rather than use the above union filtering trick to cut down your massively complex view to the part that is required, just query the sub view you need directly. A bit more code but a lot simpler system to look after.

It’s a nice SQL trick but I would keep it to simple examples!

Finaly, just for fun, here is the plan I’ve been trying to tune. As I said, close to 700 steps and only the highlighted grey section on the left is shown in more detail on the right.

Friday Philosophy – Lead or Lag (When to Upgrade)? January 20, 2012

Posted by mwidlake in development, Friday Philosophy, Testing.
Tags: , , ,
10 comments

I was involved in a discussion recently with Debra Lilley which version of Oracle to use. You can see her blog about it here (and she would love any further feedback from others). Oracle now has a policy that it will release the quarterly PSUs for a given point release for 12 months once that point release is superseded. ie once 11.2.0.3 came out, Oracle will only guarantee to provide PSUs for 11.2.0.2 for 12 months. See “My Oracle Support” note ID 742060.1. However, an older Terminal release such as 11.1.0.7 is not superseded and is supported until 2015 – and will get the quarterly PSU updates. This left the customer with an issue. Should they start doing their development on the latest and theoretically greatest version of Oracle and be forced to do a point upgrade “soon” to keep getting the PSUs, or use an older version of Oracle and avoid the need to upgrade?

This is in many ways a special case of the perennial issue of should you use the latest version of Oracle (or in fact any complex software solution) or go with the version you know and trust? Plus, should you patch up to the latest version which in theory gives you protection against bugs and vulnerabilities (along with the CPUs). Yes, they are two separate issues but people tend to sit on the same side of both points, for the same reasons.

The arguments to stay using an older version are that it is working, it is stable, you do not need the new features and upgrading is a lot of work and effort. Plus the new version will have new bugs that come along with the new features you do not need and things might be turned on by default that you could do without (like stats collecting or not creating the actual segments when a new table or partition is created). If you remain on your favourite version long enough, you get another issue which is that the latest version of Oracle might not be compatible with your ancient version of the OS or another package or programming language critical to your system (I got caught in a terrible web with old perl, old O/S and old DB that resulted in a need to upgrade all three together – ouch!).

The arguments to moving forward are that you get access to the latest features, that over all older features will have more bugs fixed in newer version, performance will be better {again, overall, exceptions allowing}. Also, if you do hit bugs and problems there are no issues in having to first upgrade to a fully supported version. Plus, fixes are made for current versions first and then back-ported to older ones. Those pack-ported fixes can cause real problems when you DO decide to upgrade.

The big sticking points are the effort involved in upgrading and living with the bugs that you find that Oracle Testing didn’t.

I’ve got a few of other considerations to throw into the pot.

Firstly, if you are developing something new, it is not a lot more effort to use the latest version. This allows you to learn the new version and eases the transition of older systems to it.

Secondly, Oracle like you if you use the latest version, especially if it is the latest-latest version or even beta. Yeah, the helpdesk will not have a clue about some of your issues but in my experience you get access to those really smart guys and gals in Oracle who do the third-line support or even the development work.

Thirdly, if you are on the latest version, if you do decide to freeze on that version for a while, for stability and a quiet life, you have a lot longer before your version (at least at a major level) drops out of support.

Fourthly, dynamic, inquisitive, flexible staff like new things. In my experience, environments that freeze on an old version have a higher percentage of staff who either like it dull and repetitive, or hate it being dull and repetitive – and itch to get out. If I’m in charge, I know which type of staff I like to have more of {NB there are some very good arguments for having some staff who like it dull and repetitive}.

As you can guess, I am in the “be on the latest version” side of the argument. I was ambivalent about it until a few years ago when I noticed a trend:

Sites that like to move forward tend to (a) do it in a controlled manner and (b) have the infrastructure to do proper regression testing.
Site that like to stay still lack the ability to do regression testing and move forward only when forced – and in a pressured, unplanned and frankly chaotic manner.

That was it, that was the real key thing for me. The further you lag behind the more likely you are to eventually be forced to upgrade and it won’t be a nice time doing it. I know, there are exceptions, systems still running Oracle 6 absolutely fine on an old DOS6.1 box. In the same way you also get the odd 95-year-old life-long smokers – and thousands of 45-year-old smokers with emphysema.

When I have any sway over the situation I now always strive to be on modern versions of Oracle {OS, language, whatever} and to patch small and regular. To support all this, have very good regression testing. I’ve only a couple of times been able to get the regression testing sorted out as well as I would like, but when you do the pain of patching and upgrading, as well as developing and integrating, is so much reduced that not patching seems madness.

So to sum up:

  • If it is a new development, go for the very latest version, play with the latest features if potentially beneficial and see if you can get Oracle to be interested in your attempts. ie (B)lead.
  • If you have good regression testing, plan and carry out patch and version upgrades as they come available and stay current. ie Lead
  • If you have a complex solution in place and no/poor regression testing, do not move to a new major release, leave it a while for the worst new bugs to be found and fixed. Then move. ie Lag
  • If your system is old AND critical and all the guys and gals who implemented it are long gone, stay on that version for ever. ie stagnate.

Oh, and if that last one applies to many of your systems – dust off the CV and start reading technical manuals. One day you will need a new job in a hurry.

Skipped Initial Sequence Value on Oracle 11.2 November 17, 2011

Posted by mwidlake in bug, Testing.
Tags: , ,
4 comments

I’m seeing an oddity with newly created sequences skipping the initial value. ie I create the sequence and the first use of it returns the value 2, not 1.

{update, see the post comments and this post by Niall – this is a side effect of delayed segment creation – the first insert silently fails, but gets the sequence, and is repeated behind the scenes once the segment is there. It is fixed in 11.2.0.3, my db is at 11.2.0.1 as I got sick of the download of the latest version failing on my poor broadband connection and got it off a mate – turns out it is not the latest version.
Thanks for the enlightenment Niall.}

This is on Oracle 11.2.0.1 on Windows (XP, Vista and now Windows 7 64 bit). I am sure I did not see this on Oracle 10.1 and 10.2 on linux, Tru64 and windows.

I create a set of test tables for when I run courses or just want to test some things, a few tables holding people, names, addresses, phone calls and monthly call summaries. I do this with a bunch of pretty crude scripts that create the data in steps. Only on Oracle 11.2 it all stopped working. I tracked down the problem to the skipping of the first value for the first sequence I create and use. I’ve just confirmed this on a new server I have created.

I’ve boiled it down to the following test case. I ensure my test table and two sequences are not there, create them and then insert 10 records for men and 10 for women. For the women the records are numbered 1 to 10. For the men they are numbered 2 to 11!!!

My code:

-- seq_oddity.sql
-- BUG ON 11.2.0.1 ??
-- though I create both sequences afresh, the first use of seq_m is getting 2, yes two!
-- cannot work out why, so fixed by setting the row with forn_id of 130 to 1.
set timi off
spool seq_oddity.lst
--
drop table test_fn purge;
drop sequence seq_m;
drop sequence seq_f;
create table test_fn
(forn_id number(5) not null
,forname varchar2(30) not null
,sex_ind char(1) not null)
tablespace users
/
create sequence seq_m;
create sequence seq_f;
insert into test_fn values (seq_m.nextval,'ALAN','M');
exec dbms_output.put_line ('I have just created male name number '||seq_m.currval);
insert into test_fn values (seq_m.nextval,'BARRY','M');
insert into test_fn values (seq_m.nextval,'CHRIS','M');
insert into test_fn values (seq_m.nextval,'DAVID','M');
insert into test_fn values (seq_m.nextval,'EDWARD','M');
insert into test_fn values (seq_m.nextval,'JANG','M');
insert into test_fn values (seq_m.nextval,'GARY','M');
insert into test_fn values (seq_m.nextval,'HARRY','M');
insert into test_fn values (seq_m.nextval,'IAN','M');
insert into test_fn values (seq_m.nextval,'JAMES','M');
exec dbms_output.put_line ('I created 10 men and last value was '||seq_m.currval);
--
--
--
insert into test_fn values (seq_f.nextval,'ALISON','F');
exec dbms_output.put_line ('I have just created female name number '||seq_f.currval);
insert into test_fn values (seq_f.nextval,'BARBARA','F');
insert into test_fn values (seq_f.nextval,'CHERYL','F');
insert into test_fn values (seq_f.nextval,'DAWN','F');
insert into test_fn values (seq_f.nextval,'ELAINE','F');
insert into test_fn values (seq_f.nextval,'FRANCIS','F');
insert into test_fn values (seq_f.nextval,'GILLIAN','F');
insert into test_fn values (seq_f.nextval,'CHERRY','F');
insert into test_fn values (seq_f.nextval,'INGRID','F');
insert into test_fn values (seq_f.nextval,'JANET','F');
exec dbms_output.put_line ('I created 10 women and last value was '||seq_f.currval);
--
select sex_ind,min(forn_id),max(forn_id)
from test_fn
group by sex_ind/
--
spool off

The results are:

mdw11> @seq_oddity
drop table test_fn purge
           *
ERROR at line 1:
ORA-00942: table or view does not exist

drop sequence seq_m
              *
ERROR at line 1:
ORA-02289: sequence does not exist

drop sequence seq_f
              *
ERROR at line 1:
ORA-02289: sequence does not exist

Table created.

Sequence created.

Sequence created.

1 row created.

I have just created male name number 2

1 row created.
1 row created.
...
1 row created.

I created 10 men and last value was 11


1 row created.

I have just created female name number 1

1 row created.
1 row created.
...
1 row created.
I created 10 women and last value was 10


S MIN(FORN_ID) MAX(FORN_ID)
- ------------ ------------
M            2           11
F            1           10
2 rows selected.

mdw11>

I suppose I should trace this and check for known oracle bugs but I wondered if anyone else had seen it.

IOT Part 6(B) – OLTP Inserts into an IOT November 10, 2011

Posted by mwidlake in development, performance, Testing.
Tags: , , , , , ,
14 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
<…………IOT6(A) – Bulk Insert slowed down

IOT 6 part A covered the sort of insert work used to create an IOT or push a large number of rows into it via an “insert into…select from” type format. ie Bulk Load.
In this, part B, I am going to test inserting a small number of rows (10,000) into a table already holding a million rows. Something more like an hourly load or the inserts that would occur over a working day – a more OLTP-like test

I’m going to do the tests against four things. A “non-insert” test to ensure my data creation is not taking a lot of resource {the Null test}, then against my three versions of test table – a normal heap table, my IOT and my Partitioned IOT. I run each test on each segment type three times, the minimum number needed to help identify variation that could invalidate the test.

You will probably be surprised by the results….

All three tables have the same column structure, the heap has indexes on the primary key and a unique constraint, the IOT and partitioned IOT are organized on a modified version of the Primary key and have the second index for the unique key. Below is the creation statement for the partitioned IOT (the IOT is the same without the partition clause). You can get full creation details from IOT part 2:

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(1000) not null
,vc_2      varchar2(1000)
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110901 values less than (to_date('01-09-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20111001 values less than (to_date('01-10-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20111101 values less than (to_date('01-11-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)

The three tests are:

  • Insert 10,000 records for random accounts, in time order for the same day. This would be like processing data for or during a “day”
  • Generating the 10,000 records in order of account, to better match the IOT structure, again for a single day. For any given account 0,1,2 or 3 records are created, to be a little more “real life”. This would be like loading a pre-processed set of data.
  • All 10,000 records are created for the one account, in datetime order, as an extreme example of inserting into an IOT data that is ordered to perfectly match the order of the IOT.

The code is written ‘badly’, ie a single record is prepared and inserted into the table at a time, within a loop of 10000 iterations. This is to avoid the bulk-processing that may benefit the heap table and also to better match the reality of an OLTP system. I do however only commit after all records are created for any given test, which is maybe a mistake but that’s what I did.

Now, before I go any further, does anyone want to guess by how much inserting into the heap table will be faster than inserting into the IOT (partitioned or otherwise)? Also, whether the partitioned IOT will be faster or slower than the IOT and what, if any, impact there will be by the ordering of data?

To allow you to think about that and not see the results by accident, I’ll stick in one of the insert harnesses here so you can see how I am doing it. Skip over if you just want the results

--
execute s_snap.my_snap(' finished non-insert test1')
--
-- Transaction_heap random data test
declare
v_num         number :=10000; -- number of accounts
v_offset      number :=-4;
v_str1        varchar2(60);
v_str2        varchar2(60);
v_tran_type   number(2);
v_tran_id     number(10);
v_acco_type   number(2);
v_acco_id     number(10);
v_cre_date    date;
v_vc_1        varchar2(1000);
v_vc_2        varchar2(1000);
v_date_1      date :=sysdate;
v_num_1       number(2) :=10;
v_num_2       number(2) :=15;
begin
v_str1:=dbms_random.string('U',60);
v_str2:=dbms_random.string('U',60);
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..10000 loop
    v_tran_type :=mod(1,3)+1;
    v_tran_id   := 1000000+i;
    v_acco_type := 5+(trunc(dbms_random.value(1,3))*5); -- 10 or 15
    v_acco_id   :=trunc(dbms_random.value(1,v_num/2)); -- 1 to 5000
    v_cre_date  :=trunc(sysdate-v_offset)+(i/(60*60*24));
    v_vc_1      :=substr(v_str1,1,51+mod(i,10));
    v_vc_2      :=substr(v_str2,1,46+mod(i,15));
    insert into transaction_heap
      (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,vc_2,date_1,num_1,num_2)
    values
      (v_tran_type,v_tran_id,v_ACCO_type,v_ACCO_id,v_cre_date
      ,V_vc_1,v_vc_2,v_date_1,v_num_1,v_num_2);
end loop;
commit;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
--
execute s_snap.my_snap(' finished th insert test1')
--

Here are the results of three runs for all tests:

Elapsed time in seconds:

                    random insert             ordered-by-account       perfect IOT order match
Segment type    Test 1    2    3            Test 1    2    3           Test 1    2     3    
------------    ------ ---- ----            ------ ---- ----            ----- ---- ----
No insert          2.1  0.1  0.1               0.1  0.1  0.1              0.1  0.0  0.1   

                                    AVG                         AVG                        AVG
                                   ----                        ----                       ----
Heap              52.3 20.5 11.4   28.1       33.4 13.8 10.7   19.3       9.1 13.0  9.8   10.6
IOT              137.1 83.7 26.4   82.4      110.8 63.6 39.0   71.2       6.4  6.7  4.0    5.8
Partn'd IOT       33.2  9.5  8.7   17.1       17.1 10.4 14.3   13.9       8.1  5.3  4.5    6.0
                                   ----                        ----                       ----

Allow me to summarise that.

First, the preparation of the data and cycling around the loop, the no-insert tests, are negligible. {The first no_insert test is taking 2 seconds due to the parsing and collecting dictionary information overhead of the first thing done}. We can ignore the overhead of the test harness.

Much more significantly:

  • Each iteration of each test, things get faster {with one odd exception}
  • For Random and semi-ordered inserts, inserting into the IOT is slower than the Heap by a factor of 300% to 400% (3 times to 4 times slower) – what we all probably expected.
  • Inserting into the IOT is Faster than the heap when the insert order matches the structure of the IOT in the perfect_IOT test. Not just the same but 40% faster
  • The partitioned IOT is as Fast or Faster than the the Heap table, never mind the IOT. Over all the partitioned IOT takes 50% TO 75% THE TIME OF THE HEAP to be populated.

Hands up everyone expecting this? NB My hand would be down if I had not done similar tests before.

Why do we see these results?

I’ll just mention that the reason all the tests get quicker with each iteration is that the computer is doing the same thing over and over again and gets used to it, the relevant circuits in the chip are “oiling in”. Of course not, how silly – but it seems to be how non-IT people think of it :-) Basically more and more of each segment is getting cached and that reduces physical IO and speed things up. You see this a lot in testing and you need to be constantly vigilant about it. You might think I should really run the tests until they are steady-state, but that is a very false situation. On real system you don’t tend to get the same thing run over and over and over again on a system with nothing else running.

Much more interestingly, why is the partitioned IOT winning? Well, for the Heap table data is being poured into the leading edge of the table, which takes a little processing and some blocks to be created. However, there are also two indexes being maintained, the Primary Key and that Unique key. For each insert oracle has to find the right spot in each index (root block ->branch block->leaf block->check for row) and insert the new record, occasionally splitting a full leaf block.

For the IOT and the Partitioned IOT there is no table to maintain. In effect there are just two indexes to maintain. For the normal IOT this benefit is outweighed by the cost of working with a less space efficient Primary Key index. As there is the whole of the row in each index entry there are fewer index entries possible per leaf block and so more block need to be read in from disc and more splits are needed. Thus the inserts into the IOT are slower.
But for the partitioned IOT not only is there no table to maintain but the table (really the Primary Key) is partitioned – on date. We are creating data for a single day at a time. All the relevant blocks will be in the one partition (and it’s the latest and smallest partition) and quickly all cached in memory and processed in memory. It’s the efficient use of the cache again. For both the IOT and the Heap, the primary key segment being processed is larger and the inserts are spread across the whole index.

Ordering data to match the index order (ordered_by_account tests) helps for all three segment types – heap, IOT and partitioned IOT. We have two indexes and the data is being ordered for both in effect, as one is on the columns we are ordering the data creation for and the other, the unique key, is generated by an ascending sequence. The gain is about 25, 30% and is due to working through the indexes in order as opposed to skipping about. If the indexes were significantly larger than the buffer cache or on a highly active system where the turnover in the cache was very high (indeicated by a low buffer cache hit ratio :-) ) then the benefit would be higher.
The perfect ordering test (perfect_iot_order_match) is a special case. Not only is the data being processed in order of the indexes but only one part of the index is being updated – for a single account. In this situation the maintaining of the index is so efficient that the time saved not processing the heap table means both the IOT and IOT_P win out over the heap. So in this special case, inserts into IOTs are faster than inserts into a heap table.

And finally a quick summary:

  • Inserting records into an IOT carries about a 200-400% overhead, depending on the number of indexes
  • Ordering the data to match the IOT gives a 25-30% advantage, but it similarly aids update of the primary key on the heap
  • Partitioning the IOT such that a smaller chunk of the index is processed can be highly beneficial. I don’t cover it specifically in the above but you can probably appreciate that partitioning on a heap can be beneficial to processing indexes on the heap also.
  • If you can organise to create your IOT such that data is created only in one or a few distinct points in the structure (you might have to tweak you data load too of course), insert performance might match or even beat that for a heap table with the same indexes
  • Using methods to maximise efficient use of your buffer cache and reduce IO is a very powerful performance tuning aid, not just for select

BTW I am conscious of the fact that the above may surprise many and you might wonder if my data or tests are flawed. I have stripped out most of the code from this post as it was just becoming massive and confusing, but contact me and I will happily supply all the example scripts and data.

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,
12 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

When to Fix & When to Find Out – Friday Philosophy July 15, 2011

Posted by mwidlake in AWR, Friday Philosophy, Testing.
Tags: ,
8 comments

{warning, this is one of my long, rambling Friday Philosophy blogs. Technical blogs are on the way – though there are some nice AWR screen shots in this one :-) }

As a DBA (or System Administrator or Network admin or any other role that involves fixing things that are broken before the shouting starts) there is often a tension between two contending “best practices”:

- getting the system working again as soon as possible.
or
- understanding exactly what the problem is.

Some experts point out {and I generally agree} that unless you actually find out exactly what the problem was, what you did to fix it and via a test case demonstrate why the fix worked, you are not really solving the problem – You are just making it go away. You (hopefully!) know what you changed or the action you took so you have something that you can repeat which might fix it again next time. (NB this does not apply to the classic “turn it off, turn it back on again”, that nearly always is just an aversion therapy).

But it might not fix it next time.

Or you might be going to way more trouble to fix it than is needed {I’m remembering how flushing the shared pool used to get touted as a way to resolve performance issues, but it is a pretty brutal thing to do to a production database}.

You might even be making other problems worse {like slowing down everything on the production database as the caches warm up again, having flushed out all that data in the SGA, or that good-old-standard of rebuilding indexes that simply do not benefit from the rebuild}.

There is another issue with “just fixing it by trying things” in that you are not really learning about the cause of the issue or about how the technology you are looking after works. A big part of what makes an “expert” an expert is the drive, desire and opportunity to take the time to work this out. It’s that last part I sometimes get grumpy about, the opportunity.
For many of us, we do not have the luxury of investigating the root cause. Because we are under so much pressure to “get it fixed right now and worry about the detail later”. But we do not get to the detail as there is then the next “fix this problem right now” and so it goes on. {Kellyn Pot’Vin does a nice post about what she calls the Superman Conundrum on this topic}.

I’ve had exactly this dilema just a couple of months ago. Below are the details, it’s a bit long so skip to the end of the post if you like…

I was creating test data and I decided to use parallel processing to speed it up. I created a month’s worth of data with PL/SQL and then decided to copy it with a simple “insert into …select” statement, updating dates and a couple of other incrementing columns as part of the insert, using parallel. The creation of the second month’s data took longer than the PL/SQL code for the first month took. What the!!!??? I pulled up the AWR information and I could see that the problem was (possibly) with inter process communication between the parallel processes, as shown by the PX DEQ CREDIT:send blkd wait event.

The below screenshot shows the overall instance workload, the green is CPU and the Pink is “Other”. Only one SQL statement is responsible for all of this effort, via 5 sessions (four of which are parallel threads) You can see that the issue had been going on for over an hour {oh, and to a certain extent these pretty pictures are pointless – I am not looking for the exact solution now, but having loaded the pictures up to the blog site I am damn well going to pretty-up my blog with them}:

Drilling into that one session shows that the bulk of the waits by far is for PX DEq Credit: Send blkd:

By selecting that wait event, I got the histogram of wait times since the system started {I love those little histograms of wait times}:

Note that these waits are for long periods of time, around 1/10 of a second on average and some are for up to 2 or 4 seconds.

The thing is, I had anticipated this problem and increased my PARALLEL_EXECUTION_MESSAGE_SIZE to 16K from the default of 2K already, as I knew from experience that the default is way to small and has slowed down parallel execution for me before. So why was I seeing poorer performane now than I anticipated? I’m not understanding stuff. So I needed to change one thing and see what impact it has and repeat until I got to the bottom of it.

Except I could not – the next team in line was waiting for this data and they already had massive pressure to deliver. My job, what my employer was giving me money to do, was to fix this problem and move on. So, in this instance I just pragmatically got on with getting the task done as quickly as I could.

I did what we all do {don’t we?} under times of accute time pressure. I made a handful of changes, using the knowledge I already have and guessing a little, hoping that one of them would do the trick. This included reducing the degree of parallelism, adding the /*+ append */ hint (I simply forgot the first time around), pre-allocating the required space to the tablespace, muttering “pleaseopleaseoplease” under my breath….

It worked:

The job ran in less than 20 minutes and used less resource during that time. Well, it waited less anyway.
The wait histograms show lots and lots of shorter duration waits:

The duplication took 20 minutes when the previous attempt had been terminated after 2 hours when other factors forced it to be curtailed. Job done.

But the thing is, the problem was not fixed. I got the task done in a timescale that was required, I satisfied the need of the client, but I was and am not sure exactly why.

If I was a permanent employee I would consider pressing for being allowed to spend some time working this out, as my employer benefits from me extending my knowledge and skills. This is not always a successful strategy :-) {but it does indicate to me that my employer has a Problem and I need to address that}. In any case, as I was a consultant on this job, I was being paid to already know stuff. So it is now down to me, some personal time, the internet and people more knowledgeble than me who I can ask for help to sort this out.

And that was my main point. Often at work you have to get over the issue. Later on, you work out what the issue was. I wish I could get paid for the latter as well as the former. The real blow for me is that I no longer have access to that site and the information. My job was complete and, whether they have really shut down my access or not, I am not allowed to go back into the systems to dig around. I think I now know the issue, but I can’t prove it.

If you Really Can’t Solve a “Simple” Problem.. March 11, 2011

Posted by mwidlake in Friday Philosophy, Perceptions, Testing.
Tags: , ,
12 comments

Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.

I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.

We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.

I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:

   9 |          NESTED LOOPS                       |                           |     1 |   139 |    37   (3)| 00:00:01 |       
* 10 |           HASH JOIN SEMI                    |                           |     1 |    50 |    11  (10)| 00:00:01 |       
* 11 |            TABLE ACCESS BY INDEX ROWID      | PARTY_ABCDEFGHIJ          |     3 |   144 |     4   (0)| 00:00
* 12 |             INDEX RANGE SCAN                | PA_PK                     |     3 |       |     3   (0)| 00:00:01 |       
  13 |            COLLECTION ITERATOR PICKLER FETCH|                           |       |       |            |          |       
  14 |           PARTITION RANGE ITERATOR          |                           |    77 |  6853 |    26   (0)| 00:00:01 | 
* 15 |            INDEX RANGE SCAN                 | EVEN_PK                   |    77 |  6853 |    26   (0)| 00:00:01 | 

On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).

|   9 |          NESTED LOOPS                 |                           |     1 |   125 |  1828   (3)| 00:00:16 |
|  10 |           NESTED LOOPS                |                           |     1 |    63 |     2   (0)| 00:00:01 |
|* 11 |            TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ          |     1 |    45 |     2   (0)| 00:00:01 |
|* 12 |             INDEX UNIQUE SCAN         | PA_PK                     |     1 |       |     1   (0)| 00:00:01 |
|* 13 |            INDEX UNIQUE SCAN          | AGR_PK                    |     1 |    18 |     0   (0)| 00:00:01 |
|  14 |           PARTITION RANGE ITERATOR    |                           |     1 |    62 |  1826   (3)| 00:00:16 |
|* 15 |            INDEX FAST FULL SCAN       | EVE_PK                    |     1 |    62 |  1826   (3)| 00:00:16 |

Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.

But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.

By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.

So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.

Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.

So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.

Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.

In this case, it was this:

Here is my primary key:

EVEN_PK EABCDE 1 AGR_EXT_SYS_ID
EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX
EVEN_PK EABCDE 3 AAAMND_DT
EVEN_PK EABCDE 4 EVT_EXT_SYS_ID
EVEN_PK EABCDE 5 EABCDE_ID

Except, here is what it is on Live

EVE_PK EABCDE 1 EVT_EXT_SYS_ID
EVE_PK EABCDE 2 EABCDE
EVE_PK EABCDE 3 AGR_EXT_SYS_ID
EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX
EVE_PK EABCDE 5 AAAMND_DT

Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.

I could not force my access plan as it was not possible – I had missed the screaming obvious.

So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.

That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,
7 comments

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful! :-)

Friday Philosophy – Alternatives to Testing February 19, 2010

Posted by mwidlake in Friday Philosophy, Testing.
Tags: ,
5 comments

We all know that we should test everything before we do it on a production system.

Or should we? Well yes, you should, but sometimes there is no suitable test system (or even any test system, which seems to be becoming a more common problem) or the testing regime takes weeks and you need a fix right now.

So given that there really is a proper business reason why doing the testing is expensive {in effort or time} or impossible, what can you do? I believe there is an alternative.

To be able to reduce or remove the need for proper testing you need to be absolutely sure the following three things are true, in order of importance:

  1. Be able to monitor the live system for negative impact of your changes.
  2. Have a proven and quick rollout.
  3. Be confident that the chance of negative impact on your system is low likelihood and low impact.

Note that, in my opinion, the least important {yet still very important} consideration is that the change be almost certain to work and won’t cause too much grief if something unexpected happens.  However, this seems to be the main concern when people are considering a “Quick Fix” on a production system. “Is it likely to go wrong? Is it going to matter if it does not help?”. At this point all eyes tend to swivel to the DBA or the Lead Developer who has to go on what I call DBA Instinct. Given your experience to date and what you know about how the database works, give it your best guess.

The problem with DBA Instinct is that if you are not very experienced you are unlikely to get it right. If you are very experienced, most of that experience was on earlier version of Oracle and you probably do little hands-on work now as you are a manger. So you are still likely unlikely to get it right :-)

So, I argue that point 1 (spotting things going wrong) and 2 (Getting back to where you started) are most important.

Let us take the classic example of just adding an index as our untested change (as we have no system capable of generating a realistic workload and thus the impact across the board).

3) Most of us would agree that adding an index is easy and the impact is low and the chance of plans going terribly slow by using the new index unsuitably are low…

2) Rollback is relatively simple, you drop the index again. Question, can you just drop an index on a table if a SQL statement is in-flight and using that index? You know, a SQL statement that the CBO has decided is going to be useful and, despite your DBA Instinct that nothing is likely to go wrong, now takes 20 seconds to run rather than 0.2 seconds. And it is being run all the time on your OLTP system so now there is no time when there are not 2 or 3 queries in flight.

3) Verifying impact. This point is vital and can be tricky. You can just rely on staring at your monitoring tool and hoping any serious impact shows up so you can catch it and do something about it. Or people ring up and complain. That would not be so good, you have impacted the business. 

I would pull out code from the SGA that is referencing the table to be indexed before you start (I’ll stick a bit of code at the bottom for doing this). If the list is not too large, you can do the same check after the code has gone in and compare changes to buffer gets and disk reads per execution.

A more targeted check is to pull out of v$SQL_PLAN anything that starts using the new index and check to make sure it is not now a problem.

A second example, I want to gather system statistics as I am sure the system does not appreciate how blindingly fast my disks are for multi block reads.

3) Impact? Well, I would suggest the potential impact is wide and across the board, else why do it?

2) Rollback? Delete the system statistics. What happens when you delete the system statistics? Are they all set to null? Are the defaults put back in place? Could I use DBMS_STATS.RESTORE_SYSTEM_STATS? What is the impact of any of them? {RESTORE appears to work reliably and, if you delete the system stats, you get the “out of the box” values again, which could be very different to what you had before your spur-of-the-moment gather}

1) How to assess negative impact? You are going to have to monitor the whole system and hope to pick up any code that starts running slowly before it impacts the business too much.

I would probably add the index but I would need to do at least some testing and proof of rollback before I gather system statistics. And I would only do so without proper testing if said proper testing was impossible due to a lack of test systems.

So, I do think you can implement changes without testing or with reduced testing, but it is not always the easy option.

I also think it is a valid (and often quicker, in terms of elapsed time) way of changing the productions system without going through full unit/integration/regression/NFR testing.

Now you just need to persuade the Change Advisory Board that it is OK to do the change. Good luck :-)

Oh, that code:-

-- chk_sga_txt.sql
-- Martin Widlake - look for sql in sga containing the provided txt
--
set pause on
set pages 36
col first_load_time form A20
spool chk_sga_txt
select
first_load_time
,parse_calls prse
,executions  excs
,buffer_gets buffs
,disk_reads   discs
,rows_processed rws
--,address        address
,hash_value     hash_value
,sql_id
--,plan_hash_value
,sql_text
from v$sqlarea
--where parsing_schema_id !='0'
where upper(sql_text) like upper('%'||nvl('&sql_txt','whoops')||'%')
and rownum <50
--order by first_load_time desc
order by (greatest(buffer_gets,1)/greatest(rows_processed,1)) desc
/
spool off

Turning on SQL Audit February 2, 2010

Posted by mwidlake in development, performance, Testing.
Tags: , , ,
6 comments

<Previous post…

I want to test out using the SQL AUDIT functionality.

The first thing you have to do (having read up on it a bit) is to enable it for the database. You do this by setting the initialization parameter AUDIT_TRAIL. On version 10.2 you have the options to write the audit entires to:

  •  The DB, by setting it to DB or DB_EXTENDED {not DB,EXTENDED as the manual says, that is the old format}. This puts the entries into the table  SYS.AUD$
  • The operating system, by setting it to OS, XML or XML_EXTENDED. If you set it to XML or XML_EXTENDED then the data is written out in XML format. You also optionally set AUDIT_FILE_DEST to where you want to data to be written to.

Writing the audit trail to the OS is potentially more secure, as you can stop those cunning and devious DBAs messing with the audit trail. {I’m not so sure that this really helps that much – if anyone knows of any DBAs caught out being naughty solely as a result of using the OS to store the SQL AUDIT records, it would be a fascinating comment}

I want to write to the DB as I want to be able to get at the audit data easily and I am not sure how I want to interrogate it. I’m faster with SQL than SED and AWK.

I also decided up front I wanted to use DB_EXTENDED so that the triggering SQL statement and all bind variables are caught, so I can see more about what it triggering the audit record. I am cautious of the impact of storing CLOBs though, which these two values are stored as. I’ve had performance issues moving lots of CLOBS around and I know from some old colleagues that Secure Files are a lot faster. If Secure Files are faster, that means CLOBs are slower :-). If the audit trail seems to add too much burden on my system, swapping back to just DB will be my first step.

Now for the bad news. You can’t just turn on AUDIT. That initialization parameter is not dynamic. You can’t even enable it for your session. It will need a restart of your database.

This tells me something. Oracle needs to do some setting up for SQL AUDIT when it starts the instance. Either start a new process, enable functionality in one of it’s regular processes or set up structures in memory to cope. Or a mixture thereof. I strongly suspect the need for memory structures {but this is only because, in reality, I have done some testing and I am writing this up afterwards}.

I should not really need to say this but DON’T go turning this on for a production system without extensive testing somewhere else first. There is not a lot “Out There” about the details of the performance impact of AUDIT but the general opinion is there is some; and that is reasonable given it is going to write database records for every action audited. Also, you have no idea yet of any knock-on effects. You know, things you did not expect that causes your database to lock or crash and you to get fired.

{Question, what happens if you alter the initialization file and restart only one node of a RAC database? I don’t know and so I should test that. My current test system is not RAC, but the final destination for this stuff is RAC}.

You probably also want to check that no one has gone and tried turning on SQL AUDIT on things already. You never know if someone else decided to have a play with this and issued a load of AUDIT statements only to find nothing happened – and left what they did in place “as nothing happened”. I already know of one example of this happening…

Here is a little script I knocked up to see what is currently set to be audited:

-- what_is_audited.sql
-- Martin Widlake 11/01/10
-- simple listing of what auditing is currently set
set pages 100
set pause on
spool what_is_audited.lst
select * from dba_priv_audit_opts
order by user_name,privilege
/
select * from sys.dba_stmt_audit_opts
order by user_name,audit_option
/
select * from DBA_OBJ_AUDIT_OPTS
order by owner,object_name
/
spool off
clear col
--
-- EOF
--

And some sample output. I’m not going to explain it in this post, but you can have a look though it.

DEV3&gt; @what_is_audited
USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
PRIVILEGE                                SUCCESS    FAILURE
---------------------------------------- ---------- ----------
MDW1
ALTER SYSTEM                             BY ACCESS  BY ACCESS
MDW1
AUDIT SYSTEM                             BY ACCESS  BY ACCESS
MDW1
CREATE SESSION                           BY ACCESS  BY ACCESS

ALTER SYSTEM                             BY ACCESS  BY ACCESS

AUDIT SYSTEM                             BY ACCESS  BY ACCESS

CREATE SESSION                           BY ACCESS  BY ACCESS

6 rows selected.

USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
AUDIT_OPTION                             SUCCESS    FAILURE
---------------------------------------- ---------- ----------
MDW1
ALTER SYSTEM                             BY ACCESS  BY ACCESS
MDW1
CLUSTER                                  BY ACCESS  BY ACCESS
MDW1
CONTEXT                                  BY ACCESS  BY ACCESS
MDW1
CREATE SESSION                           BY ACCESS  BY ACCESS
MDW1
DATABASE LINK                            BY ACCESS  BY ACCESS
MDW1
DELETE TABLE                             BY ACCESS  BY ACCESS
MDW1
...
TYPE                                     BY ACCESS  BY ACCESS
MDW1
UPDATE TABLE                             BY ACCESS  BY ACCESS
MDW1
USER                                     BY ACCESS  BY ACCESS
MDW1
VIEW                                     BY ACCESS  BY ACCESS

ALTER SYSTEM                             BY ACCESS  BY ACCESS

CLUSTER                                  BY ACCESS  BY ACCESS

CONTEXT                                  BY ACCESS  BY ACCESS

CREATE SESSION                           BY ACCESS  BY ACCESS

DATABASE LINK                            BY ACCESS  BY ACCESS

DIMENSION                                BY ACCESS  BY ACCESS

DIRECTORY                                BY ACCESS  BY ACCESS

INDEX                                    BY ACCESS  BY ACCESS

MATERIALIZED VIEW                        BY ACCESS  BY ACCESS
...
USER                                     BY ACCESS  BY ACCESS

VIEW                                     BY ACCESS  BY ACCESS

56 rows selected.

OWNER                          OBJECT_NAME                    OBJECT_TYPE
------------------------------ ------------------------------ --------------
ALT     AUD     COM     DEL     GRA     IND     INS     LOC     REN     SEL
------- ------- ------- ------- ------- ------- ------- ------- ------- ----
UPD     REF EXE     CRE     REA     WRI     FBK
------- --- ------- ------- ------- ------- -------
MWPERF                         FORN_M_SEQ                     SEQUENCE
-/-     -/-     -/-     -/-     -/-     -/-     -/-     -/-     -/-     A/A
-/-     -/- -/-     -/-     -/-     -/-     -/-
MWPERF                         PERSON                         TABLE
A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A     A/A
A/A     -/- -/-     -/-     -/-     -/-     A/A
MWPERF                         ROAD_TYPE                      TABLE
-/-     -/-     -/-     A/A     -/-     -/-     A/A     -/-     -/-     A/A
A/A     -/- -/-     -/-     -/-     -/-     -/-

If you discover you have a lot of things set to be audited, ESPECIALLY if they are auditing select access, think about turning some or all of it off before you enable AUDITING by setting that initialization parameter.

Once you have turned on the feature, you can start testing it…

Follow

Get every new post delivered to your Inbox.

Join 156 other followers