jump to navigation

Exclusion of Unioned SQL in Views – Followup August 20, 2012

Posted by mwidlake in database design, performance, SQL.
Tags: , ,
add a comment

Last week I put up a post about how Oracle can filter out sections of a union view..

Within the comments I put up another example where the CBO did not filter out all but one of the Union views despite my replicating the exact WHERE clause of one of the unioned statements. Jonathan Lewis posted a followup to say “Oracle can be very touchy about how it plays this game” and made a prediction of how the CBO would handle a slightly different scenario.

This was the extra scenario and I include brief details on creating the unioned view too. NB all on Oracle 11.2.0.2. {non-Exadata:-) }

  1  create table mdw_source
  2  tablespace users
  3  as select * from dba_objects

  1  create or replace view mdw_v1 as
  2  select
  3    '1'              src_type
  4   ,ms.OWNER ,ms.OBJECT_NAME,ms.SUBOBJECT_NAME ,ms.CREATED
  5  from mdw_source ms
  6  where ms.object_name like 'A%'
  7  union all
  8  select
  9    '2'              src_type
 10   ,ms.OWNER ,ms.OBJECT_NAME,ms.SUBOBJECT_NAME ,ms.CREATED
 11  from mdw_source ms
 12  where ms.object_name like 'B%'
 13  UNION ALL
 14  select
 15    '3'              src_type
 16   ,ms.OWNER ,ms.OBJECT_NAME,ms.SUBOBJECT_NAME ,ms.CREATED
 17  from mdw_source ms
 18  where ms.object_name like 'C%'

1 select count(created)
2 from mdw_v1 mv
3 where mv.object_name like ‘C%’

Execution Plan
———————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
———————————————————————————–
| 0 | SELECT STATEMENT | | 1 | 75 | 551 (1)| 00:00:07 |
| 1 | SORT AGGREGATE | | 1 | 75 | | |
| 2 | VIEW | MDW_V1 | 824 | 61800 | 551 (1)| 00:00:07 |
| 3 | UNION-ALL | | | | | |
|* 4 | TABLE ACCESS FULL| MDW_SOURCE | 1 | 34 | 184 (1)| 00:00:03 |
|* 5 | TABLE ACCESS FULL| MDW_SOURCE | 2 | 68 | 184 (1)| 00:00:03 |
|* 6 | TABLE ACCESS FULL| MDW_SOURCE | 821 | 27914 | 184 (1)| 00:00:03 |
———————————————————————————–

Predicate Information (identified by operation id):
—————————————————
4 – filter(“MS”.”OBJECT_NAME” LIKE ‘A%’ AND “MS”.”OBJECT_NAME” LIKE ‘C%’)
5 – filter(“MS”.”OBJECT_NAME” LIKE ‘B%’ AND “MS”.”OBJECT_NAME” LIKE ‘C%’)
6 – filter(“MS”.”OBJECT_NAME” LIKE ‘C%’)

Statistics
———————————————————-
8 recursive calls
0 db block gets
2417 consistent gets 

So, as you can see, despite me providing the WHERE clause in my final select as “where mv.object_name like ‘C%’” and the WHERE clauses of the unioned view are all similar and one is “ms.object_name like ‘C%’” the CBO fails to realise it can exclude all but one of the unioned SQL statements – despite the mutally exclusive filter predicates:

4 – filter(“MS”.”OBJECT_NAME” LIKE ‘A%’ AND “MS”.”OBJECT_NAME” LIKE ‘C%’)
5 – filter(“MS”.”OBJECT_NAME” LIKE ‘B%’ AND “MS”.”OBJECT_NAME” LIKE ‘C%’)

I have to confess, in light of recent experience I would have wondered if these filter predicates would prevent the other two SQL statements being fired at execution time but the number of consistent gets confirms that it is not – 2417 compared to 805 or so for when the query exclusion occurs. You can check back to the prior post to verify that.

So, let’s do the tests Jonathan suggested. He suggested altering the view to use equality predicates so I have recreated the view as such:

TEST_DB> create or replace view mdw_v1 as
  2  select
  3    '1'              src_type
  4    ,ms.OWNER ,ms.OBJECT_NAME ,ms.SUBOBJECT_NAME ,ms.CREATED
  5  from mdw_source ms
  6  where substr(ms.object_name,1,1) ='A'
  7  union all
  8  select
  9    '2'              src_type
 10   ,ms.OWNER ,ms.OBJECT_NAME ,ms.SUBOBJECT_NAME ,ms.CREATED
 11  from mdw_source ms
 12  where substr(ms.object_name,1,1) ='B'
 13  UNION ALL
 14  select
 15    '3'              src_type
 16   ,ms.OWNER ,ms.OBJECT_NAME ,ms.SUBOBJECT_NAME ,ms.CREATED
 17  from mdw_source ms
 18  where substr(ms.object_name,1,1) ='C'

View created.

TEST_DB> set autotrace on
TEST_DB> --
TEST_DB> -- below will do a select of a non-indexed column across the view
TEST_DB> select count(created) from mdw_v1

COUNT(CREATED)
--------------
          1288

Execution Plan
-----------------------------------------------------------------------------------
| Id  | Operation            | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |            |     1 |     9 |   552   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE      |            |     1 |     9 |            |          |
|   2 |   VIEW               | MDW_V1     |  1677 | 15093 |   552   (1)| 00:00:07 |
|   3 |    UNION-ALL         |            |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  5 |     TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  6 |     TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='A')
   5 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='B')
   6 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='C')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
       2417  consistent gets


TEST_DB> -- below will do a select of a non-indexed column
TEST_DB> -- specific to one of the Union's in the view via the src_type
TEST_DB> select count(created)
  2  from mdw_v1
  3  where src_type='3'

COUNT(CREATED)
--------------
           129

Execution Plan
------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |     1 |    12 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE       |            |     1 |    12 |            |          |
|   2 |   VIEW                | MDW_V1     |   561 |  6732 |   184   (1)| 00:00:03 |
|   3 |    UNION-ALL          |            |       |       |            |          |
|*  4 |     FILTER            |            |       |       |            |          |
|*  5 |      TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  6 |     FILTER            |            |       |       |            |          |
|*  7 |      TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  8 |     TABLE ACCESS FULL | MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='A')
   6 - filter(NULL IS NOT NULL)
   7 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='B')
   8 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='C')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        807  consistent gets


TEST_DB> -- below will do a select of a non-indexed column
TEST_DB> -- specific to one of the Union's in the view via matching the where clause
TEST_DB> select count(created)
  2  from mdw_v1
  3  where substr(object_name,1,1) ='C'

COUNT(CREATED)
--------------
           129

Execution Plan
------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |     1 |    75 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE       |            |     1 |    75 |            |          |
|   2 |   VIEW                | MDW_V1     |   561 | 42075 |   184   (1)| 00:00:03 |
|   3 |    UNION-ALL          |            |       |       |            |          |
|*  4 |     FILTER            |            |       |       |            |          |
|*  5 |      TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  6 |     FILTER            |            |       |       |            |          |
|*  7 |      TABLE ACCESS FULL| MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
|*  8 |     TABLE ACCESS FULL | MDW_SOURCE |   559 | 19006 |   184   (1)| 00:00:03 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='A')
   6 - filter(NULL IS NOT NULL)
   7 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='B')
   8 - filter(SUBSTR("MS"."OBJECT_NAME",1,1)='C')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        807  consistent gets


TEST_DB> -- below will do a select of a non-indexed column
TEST_DB> -- specific to one of the Union's in the view via NOT matching the where clause
TEST_DB> -- but maybe logical exclusion can be managed
TEST_DB> select count(created)
  2  from mdw_v1
  3  where object_name ='C'

COUNT(CREATED)
--------------
             0

Execution Plan
-----------------------------------------------------------------------------------
| Id  | Operation            | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |            |     1 |    75 |   551   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE      |            |     1 |    75 |            |          |
|   2 |   VIEW               | MDW_V1     |     3 |   225 |   551   (1)| 00:00:07 |
|   3 |    UNION-ALL         |            |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| MDW_SOURCE |     1 |    34 |   184   (1)| 00:00:03 |
|*  5 |     TABLE ACCESS FULL| MDW_SOURCE |     1 |    34 |   184   (1)| 00:00:03 |
|*  6 |     TABLE ACCESS FULL| MDW_SOURCE |     1 |    34 |   184   (1)| 00:00:03 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME"='C' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='A')
   5 - filter("MS"."OBJECT_NAME"='C' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='B')
   6 - filter("MS"."OBJECT_NAME"='C' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='C')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
       2417  consistent gets

TEST_DB> -- below will do a select of a non-indexed column
TEST_DB> -- specific to one of the Union's in the view via NOT matching the where clause
TEST_DB> -- but logically it is the same.
TEST_DB> select count(created)
  2  from mdw_v1
  3  where object_name like 'C%'

COUNT(CREATED)
--------------
           129

Execution Plan
-----------------------------------------------------------------------------------
| Id  | Operation            | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |            |     1 |    75 |   551   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE      |            |     1 |    75 |            |          |
|   2 |   VIEW               | MDW_V1     |    24 |  1800 |   551   (1)| 00:00:07 |
|   3 |    UNION-ALL         |            |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| MDW_SOURCE |     8 |   272 |   184   (1)| 00:00:03 |
|*  5 |     TABLE ACCESS FULL| MDW_SOURCE |     8 |   272 |   184   (1)| 00:00:03 |
|*  6 |     TABLE ACCESS FULL| MDW_SOURCE |     8 |   272 |   184   (1)| 00:00:03 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME" LIKE 'C%' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='A')
   5 - filter("MS"."OBJECT_NAME" LIKE 'C%' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='B')
   6 - filter("MS"."OBJECT_NAME" LIKE 'C%' AND
              SUBSTR("MS"."OBJECT_NAME",1,1)='C')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
       2417  consistent gets

Above, I create the view, do a quick select count(*) to cause a full scan of all three section in the union view, then do selects using various WHERE predicates to see what sections of the unioned SQl statements are excluded.

The results are very intersting. The filtering on sub_type still works fine, but now the WHERE predicate “where substr(object_name,1,1) =’C’” allows the CBO to exclude sections of the union, the “Null is not null” filters appear and the consistent gets is 807. The WHERE predicate “where object_name =’C'” which at first glance you might think to be logically the same but is not (we selected object_names where only the first character is checked but the whole object_name is selected) acts does not do any exclusion and I thought it might. There are no OBJECT_NAMES of ‘C’ but logically as a human you can see that such an object name could not be found by the first two unioned SQL statements in any case.

The last test is most interesting. The WHERE clause of the SQL select over the view is “where object_name like ‘C%’” and is logically the same as the one WHERE clause in the unioned view “where substr(object_name,1,1) =’C’“. So no exclusion occurs and again we see 2417 consistent gets.

It would seem the CBO Pixies either have not coded for that or, as Jonathan says Oracle can be “touchy” about this feature. Bottom line, if you are expecting this sort of unioned view SQL exclusion to occur – TEST!

Exclusion of Unioned SQL in Views? August 16, 2012

Posted by mwidlake in performance, SQL.
Tags: , , ,
13 comments

Question – you have a view definition of the following format:

select 1 as d_type, col_a,col_b,col_c
from TAB_X, TAB_Y, TAB_Z
where {your predicates}
UNION
select 2 as d_type, col_a,col_b,col_c
from TAB_P, TAB_Q, TAB_R
where {your predicates}
UNION
select 3 as d_type, col_a,col_b,col_c
from TAB_X X, TAB_Y, TAB_Z
where {your predicates}

You now select from the view and your code includes a filter predicate such as:

“WHERE D_TYPE = 1”

What will the CBO do? Is it intelligent enough to do what a human would do and exclude the parts of the query that could not possibly return data you want? Ie exclude the SQL statements where d_type is set to 2 and 3. Or is that funcitonality not built into the CBO? After all, it is only software and someone would have to decide that there could be benefit of looking into set of unioned SQL selects to see if any could be excluded and then code that into the CBO.

After all, the CBO is not smart enough to realise it could easily satisfy “select min(col_a), max(col_a) from tab x” via two fast lookups on a supporting index. It scans the index.

So – what is your guess?

I asked 3 people. Not a massive straw poll but I got one vote for it will not exclude, one for it will and one smart answer of “I would expect it to exclude- but as you asked the question I am thinking it does not”.

Let’s do the test.

DB_TEST> @demo_union_view_2
DB_TEST> drop table mdw_source purge;
Table dropped.

  1  create table mdw_source
  2  tablespace users
  3  as select * from dba_objects
  4  /
Table created.

  1  create index mdso_idx on mdw_source(owner,object_name)
  2  tablespace users
  3  /
Index created.

DB_TEST> exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'MDW_SOURCE',method_opt=>'for all columns size 1')
PL/SQL procedure successfully completed.

  1  select table_name,num_rows from dba_tab_statistics
  2  where owner=USER and table_name='MDW_SOURCE';

TABLE_NAME      NUM_ROWS
--------------- ----------
MDW_SOURCE      55895

  1  create or replace view mdw_v1 as
  2  select
  3    '1'              src_type
  4   ,ms.OWNER
  5   ,ms.OBJECT_NAME
  6   ,ms.SUBOBJECT_NAME
  7   ,ms.CREATED
  8  from mdw_source ms
  9  where ms.object_name like 'A%'
 10  union all
 11  select
 12    '2'              src_type
 13   ,ms.OWNER
 14   ,ms.OBJECT_NAME
 15   ,ms.SUBOBJECT_NAME
 16   ,ms.CREATED
 17  from mdw_source ms
 18  where ms.object_name like 'B%'
 19  UNION ALL
 20  select
 21    '3'              src_type
 22   ,ms.OWNER
 23   ,ms.OBJECT_NAME
 24   ,ms.SUBOBJECT_NAME
 25   ,ms.CREATED
 26  from mdw_source ms
 27  where ms.object_name like 'C%'
 28  /
View created.

So, a simple table is created from DBA_OBJECTS and a view over the top of the table consisting on 3 simple selects unioned together, with a static value (1, 2 or 3) generated for each of the 3 parts as SRC_TYPE

Now let’s run some tests:

DB_TEST> set autotrace on
DB_TEST> --
DB_TEST> -- below will do a select of a non-indexed column across the view
DB_TEST> select count(created) from mdw_v1
  2  /

COUNT(CREATED)
--------------
          1284

Execution Plan
----------------------------------------------------------
Plan hash value: 4259489107
-----------------------------------------------------------------------------------
| Id  | Operation            | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |            |     1 |     9 |   551   (1)| 00:00:07 |
|   1 |  SORT AGGREGATE      |            |     1 |     9 |            |          |
|   2 |   VIEW               | MDW_V1     |  2463 | 22167 |   551   (1)| 00:00:07 |
|   3 |    UNION-ALL         |            |       |       |            |          |
|*  4 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  5 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  6 |     TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   5 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   6 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
       2417  consistent gets


DB_TEST> -- below will do a select of a non-indexed column across the view
DB_TEST> -- limiting it to "one" of the types
DB_TEST> select count(created)
  2  from mdw_v1
  3  where src_type='3'
  4  /

COUNT(CREATED)
--------------
           129


Execution Plan
----------------------------------------------------------
Plan hash value: 1188403912
------------------------------------------------------------------------------------
| Id  | Operation             | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |            |     1 |    12 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE       |            |     1 |    12 |            |          |
|   2 |   VIEW                | MDW_V1     |   823 |  9876 |   184   (1)| 00:00:03 |
|   3 |    UNION-ALL          |            |       |       |            |          |
|*  4 |     FILTER            |            |       |       |            |          |
|*  5 |      TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  6 |     FILTER            |            |       |       |            |          |
|*  7 |      TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
|*  8 |     TABLE ACCESS FULL | MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   6 - filter(NULL IS NOT NULL)
   7 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   8 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        807  consistent gets


DB_TEST> -- below will do a select of an indexed column across the view
DB_TEST> select count(owner) from mdw_v1
  2  /

COUNT(OWNER)
------------
        1284


Execution Plan
----------------------------------------------------------
Plan hash value: 4219520050
------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |    17 |   221   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE         |          |     1 |    17 |            |          |
|   2 |   VIEW                  | MDW_V1   |  2463 | 41871 |   221   (1)| 00:00:03 |
|   3 |    UNION-ALL            |          |       |       |            |          |
|*  4 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  5 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  6 |     INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   5 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   6 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        989  consistent gets


DB_TEST> --
DB_TEST> -- below will do a select of an indexed column across the view
DB_TEST> -- limiting it to "one" of the types
DB_TEST> select count(owner)
  2  from mdw_v1
  3  where src_type='3'
  4  /

COUNT(OWNER)
------------
         129


Execution Plan
----------------------------------------------------------
Plan hash value: 815942527
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |     1 |    20 |    74   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE          |          |     1 |    20 |            |          |
|   2 |   VIEW                   | MDW_V1   |   823 | 16460 |    74   (2)| 00:00:01 |
|   3 |    UNION-ALL             |          |       |       |            |          |
|*  4 |     FILTER               |          |       |       |            |          |
|*  5 |      INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  6 |     FILTER               |          |       |       |            |          |
|*  7 |      INDEX FAST FULL SCAN| MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
|*  8 |     INDEX FAST FULL SCAN | MDSO_IDX |   821 | 26272 |    74   (2)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(NULL IS NOT NULL)
   5 - filter("MS"."OBJECT_NAME" LIKE 'A%')
   6 - filter(NULL IS NOT NULL)
   7 - filter("MS"."OBJECT_NAME" LIKE 'B%')
   8 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          8  recursive calls
          0  db block gets
        331  consistent gets


DB_TEST> -- below will select against the undelying object we are interested in
DB_TEST> select count(created)
  2  from mdw_source ms
  3  where ms.object_name like 'C%'
  4  /

COUNT(CREATED)
--------------
           129


Execution Plan
----------------------------------------------------------
Plan hash value: 152094671

---------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |     1 |    34 |   184   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE    |            |     1 |    34 |            |          |
|*  2 |   TABLE ACCESS FULL| MDW_SOURCE |   821 | 27914 |   184   (1)| 00:00:03 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("MS"."OBJECT_NAME" LIKE 'C%')

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        805  consistent gets

Look at the plans at the start.

When my query does not include SRC_TYPE,ie the first SQL statement, the plan is simple – it scans the table three times, estimates 821 rows per scan and a cost of 184 for each step. When I include the SRC_TYPE = ‘3’ in the next step we see that the plan still includes all three full table scans, each still estimated at a cost of 184.
When you provide a predicate that identifies the one sql statement that you want to run, it still runs all three. I tried a few further tests and could not force the exclusion of the “unwanted” queries from the plan.

I showed my friend above (the one who suggested it would not work as I had asked the question) the results and we agreed that it was a little unfortunate that the Oracle Optimizer Pixies had not coded in the ability to exclude unwanted Unioned statements, but we could understand it.

However, for those of you thinking “you are missing something” – you are right. {And, to be fair to my friend, my example was not quite so simplified as what I have posted here}.

I decided to quantify the level of the impact and so shifted my concentration to the buffer gets. Look at the first and second statements again. The cost of the total statement drops, from 551 to 184. The actual consistent gets resulting from executing the SQL is dropping from 2417 to 807. They are both dropping dramatically to approximately 1/3.

Now check out the filter predicates at step 4 and 6 – “NULL IS NOT NULL”. That will always be false. Everything “after” that statement in the plan will not be executed. The step in the plan is there and, as a step, has the same cost – but it is is not being executed.

The pattern is repeated when the simple SQL statement is on an indexed column and the index is being used (the next two statements) {and honesty forces me to admit I have included more output than I need for this example, as I wanted to confuse the issue slightly and maybe encourage you to not read the plans too carefully}.

So yes, the CBO IS excluding the individual unioned SQL statements based on the stated SRC_TYPE and the fixed value stated in each part of the union.

I’m quite impressed by that. Clever Oracle Optimizer Pixies.

To tie this back to reality, I initially looked at this as I am trying to tune some code that hides all the complexity of the application under one massive unioned uber-view over a set of 9 other complex, unioned views. The explain plan runs to close to 700 lines. I had utterly overlooked the filters and filter predicates in that – heck I missed it initially in my simple test:-) {which was more complex than in this post, in my defense}. Trying to work with a very slow statement based on nested, complex views that is used by pretty much the whole application is “fun”.

Also back in reality, the filter clause I see in my real code is along the lines of:

:B17 is null or is not null

This is as bind variable is being considerd – and that is actually one of the considerations back in this real world of the original issue. The level that the master query needs to run in is decided by this bind variable – otherwise the application developer would (should have?) just pointed at the sub-view they needed for that level. So the CBO does not know which chunk of the union’d code it can ignore until it sees that bind variable, so the plan shows what could be executed.

Of course, it could be argued that if you know the type at the time you query the database, rather than use the above union filtering trick to cut down your massively complex view to the part that is required, just query the sub view you need directly. A bit more code but a lot simpler system to look after.

It’s a nice SQL trick but I would keep it to simple examples!

Finaly, just for fun, here is the plan I’ve been trying to tune. As I said, close to 700 steps and only the highlighted grey section on the left is shown in more detail on the right.

Table High Water Mark and How Empty the Table Is April 30, 2012

Posted by mwidlake in performance, statistics.
Tags: , ,
1 comment so far

On a recent post about reducing the high water mark in a table by shrinking it someone asked if you could tell how much “space” there was in a table that could be recovered by compressing it.

Yes. As pseudo-code:

select number_of_rows*average_row_length. That is how much data there is.
Divided by the block size, to give how many blocks-worth of data you have.
Divided this by a “overhead” fudge factor to allow for wasted space (see later) of about 20%
Compare to the number of blocks used by the table.

Here is some code to do it, showing some extra data and how we calculate the effective percentage of space used (pct_spc_used). It runs against an example table ‘HWM’

select uts.blocks                                     blks_used
      ,uts.avg_space
      ,uts.num_rows
      ,uts.avg_row_len
      ,uts.empty_blocks                               empty_blks
      ,usse.blocks                                    alloc_blks
      ,greatest(uts.blocks,1)/greatest(usse.blocks,1) pct_hwm
      ,uts.num_rows*uts.avg_row_len                   data_in_bytes
      ,(uts.num_rows*uts.avg_row_len)/8192            data_in_blks
      ,((uts.num_rows*uts.avg_row_len)/8192)*1.25     mod_data_in_blks
      ,(((uts.num_rows*uts.avg_row_len)/8192)*1.25)/usse.blocks pct_spc_used
from user_tab_statistics uts
    ,user_segments       usse
where uts.table_name='HWM'
and   uts.table_name=usse.segment_name
/
--
 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630       1221      50000         213         34       1664   .97957
     10650000   1300.04883       1625.06104    .97659918

I am collecting the data from USER_TAB_STATISTICS and USER_SEGMENTS. For this code to work you must have reasonably good stats against the table.

I’d like to run through some of the columns I have selected:
First of all, columns AVG_SPACE and EMPTY_BLKS are not populated by dbms_stats.gather_table_stats.. They are populated by the deprecated ANALYZE command that you should NOT use to gather table stats since V10 came along. These columns are populated as I did an ANALYZE to get the data in there, as well as a dbms_stats.
Next, I collect BLOCKS_ALLOCATED from DBA_SEGMENTS {and for this demo I just ignored the potential for partitioned tables) and I compare this to the BLOCKS_USED to get the High Water Mark, as a percentage of the table. I do this as EMPTY_BLOCKS is set to zero if you have never used ANALYZE and, even if you did, unless you use this deprecated command all the time, the value will not change.
On the second line of output I calculate the DATA_IN_BYTES as a simple num_rows*avg_row_len, convert it into blocks {for simplicity I do not collect the block size, I know it is 8k}. I then apply my “Overhead” fudge factor. A block has a header, using around 100 bytes {I’ve not checked the exact figure for years}, pctfree can be varied but defaults to 10% and as only whole rows fit, then an average of half a row of space is empty in each “full” block. Thus I reduce the space available by 20-25%. In this case, 25% as my rows are large.
Finally, I compare this modified data volume to the used blocks to get the actual space

Below I run through creating some test data, looking at the stats and my calculated High Water Mark and pct_space_used and finally shrink my table to see if my guesstimate is a reasonable guesstimate:

populate table

drop table hwm purge;
prompt populate table
set feed on
create table hwm
(id   number(10)
,num1 number(2)
,vc1  varchar2(100)
,vc2  varchar2(100)
)
/
insert into hwm 
select rownum
,trunc(dbms_random.value(1,100))
,lpad('A',100,'A')
,lpad('B',100,'B')
from dual connect by level < 50001
/
50000 rows created.

exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'HWM')

--where is the HWM compared to total segment size

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630          0      50000         210          0       1664   .97957
     10500000   1281.73828       1602.17285   .962844262

NB AVG_SPC and EMPTY_BLKS are NULL. 
The high water mark is 1630 blocks out of 1664 in the segment
My calculated PCT_SPC_USED is 96%. That is probably close enough.
{remember, the last used block will be only partly used, accounting for a bit of the difference}


-- I will use ANALYZE to fill the missing columns
analyze table hwm compute statistics;

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630       1150      50000         213         34       1664   .97957
     10650000   1300.04883       1625.06104    .97659918

Now those two columns are populated. 
Not the slightly different AVG_ROW_LEN even though dbms_stats used 100% (as the table is so small)
and ANALYZE was compute 


-- clear 90% of the data randomly

45461 rows deleted.

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630       1150       4539         210         34       1664   .97957
       953190   116.356201       145.445251   .087407002

PCT_HWM is not altered of course but PCT_SPC_USED has dropped dramatically. 
The table is now only 8.7% used, according to my calculations (compared to
90% empty) 

The BLKS_USED does not change. The AVG_SPACE and EMPTY_BLOCKS are the same as I 
used dbms_stats to update the statistics and it DOES NOT ALTER the columns that it does not
populate. Thus you have no idea how recent those columns are if you use a mixture of commands.


-- clear some blocks completely by deleting a range

2181 rows deleted.


 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630       1150       2358        210         34       1664   .97957
       495180   60.4467773       75.5584717   .045407735

Now the PCT_SPC_USED is down to 4.5%


-- has EMPTY_BLOCKS changed if I use ANALYZE?
analyze table hwm compute statistics;

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630       7682       2358         213         34       1664   .97957
       502254   61.3103027       76.6378784   .046056417

As you can see, if I use ANALYZE AVG_SPACE alters. But EMPTY_BLOCKS does not, even though I cleared
a chunk of the table. So there are blocks that can be reused but not listed as empty.

I'll just take a quick side-step and show a quick "oddity" about dbms_stats
--deleting the stats (using dbms_stats)

exec dbms_stats.delete_table_stats(ownname=>user,tabname=>'HWM')

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
                                                              1664

dbms_stats.delete_table_statistics clears ALL statistics, even the ones it does not populate

--and now collect them via dbms_stats again

 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
      1630          0       2358         210          0       1664   .97957
       495180   60.4467773       75.5584717   .045407735


--now to shrink the table
alter table hwm enable row movement;
alter table hwm shrink space;


 BLKS_USED  AVG_SPACE   NUM_ROWS AVG_ROW_LEN EMPTY_BLKS ALLOC_BLKS  PCT_HWM
---------- ---------- ---------- ----------- ---------- ---------- --------
DATA_IN_BYTES DATA_IN_BLKS MOD_DATA_IN_BLKS PCT_SPC_USED
------------- ------------ ---------------- ------------
        72          0       2358         210          0         80   .90000
       495180   60.4467773       75.5584717   .944480896

So I calculated that there was about 75 blocks of data in that table. having shrunk it, I was
a little bit out.

Having run through those examples we can see that the accuracy of the PCT_SPC_USED is down to the fudge factor employed but is probably close enough at 25%. After all, you are only likely to shrink a table that very clearly would benefit from it.

Shrinking Tables to Aid Full Scans April 18, 2012

Posted by mwidlake in performance, SQL.
Tags: ,
14 comments

{This blog is about shrinking tables where the High Water Mark is higher than it needs to be. Prompted by one of the comments, I wrote a follow-up post on finding the High Water Mark and tables that consist mostly of empty space, which would be candidates for shrinking.}

This blog is about one of those things I do as almost an autonomous “not thinking about it” performance housekeeping task, one which I have been meaning to mention for ages.

There can be quite a lot to gain by checking out full scans on “small” tables and seeing if it is as efficient as it can be. Sometimes it is, sometimes it is not. Often it is “good enough”. Occasionally it is awful.

Recently I was just casting an eye over the “top 20” SQL on a system for any code generating a lot of consistent gets. I came across the below:

FRST_LOAD_TIME SQL_ID PRSE EXCS BUFFS DISCS RWS
-------------------- ------------- --------- ----------- -------------- ---------- ------------
CPU_MS ELAPSD_MS SORTS DIR_W OPT_COST
-------------- -------------- ---------- ---------- --------------
SQL_TEXT
----------------------------------------------------------------------------------------------------
2-03-14/18:00:10 bk9b5u6zyvy59 36,262 36,262 90,634,158 7 36,261
320,102.3 326,920.9 0 0 546
SELECT count(*) RUNNING_SESSIONS from SNP_SESSION WHERE SESS_STATUS = :1 and AGENT_NAME=:2

It caught my eye as it struck me as a lot of buffer gets for what I thought was quite a small table and it is executed quite often. In this case, a few thousand times a day. It takes 2500 buffer gets per count(*). All buffered, but the CPU has to chunk through it.

I check, it is a small table:

ABCD01> select count(*) from dev_ODI_XXXX.snp_session

COUNT(*)
----------
73

ABCD01> @tab_lst
Enter value for tab_name: snp_session

TABLE_NAME OWNER NUM_ROWS
------------------------------ ------------------------------ -------------
SNP_SESSION DEV_ODI_XXXX 49

-- and a quick check on those key columns

OWNER COLUMN_NAME NUM_DISTINCT N_NULLS LOW_V HI_V BKTS AVG_L
-------- -------------------- ------------ ---------- --------------- --------------- ---- -----
DEV_ODI_ SESS_STATUS 4 0 D W 4 2
DEV_ODI_ AGENT_NAME 4 6 AAA_ODI_AGENT BBB_ODI_AGENT 4 13

The table IS small, only 73 rows. It is also interesting that the stats are out by quite a percentage, but the table is still seen as small according to the stats. This indicates the table is quite volatile, but consistently small. Another thing to notice is that the number of distinct values for each column in the WHERE clause is only 4 and thus an index lookup would not be very specific. With such a small number of records in the table and not very specific index I “know” that an index lookup would not be of much benefit {BTW Everything that you “know” in this way – do check it in reality every now and again…}. I don’t look at adding an index, I look at why such a small table needs so many buffer gets.

So, why so many buffer gets?

First of all, I need to be careful. The original source of information I was looking as was V$SQL and that holds cumulative stats for all executions of the same SQL statement. Thus the buffer gets could have been very high a while ago (say, yesterday) and now are small, and the average is 2500 BGs per select but right now it is running fine. So I verify that the code is still doing a lot of work for such a small number of records:

set autotrace on
SELECT count(*) RUNNING_SESSIONS from dev_ODI_XXXX.SNP_SESSION 
WHERE SESS_STATUS = 'D' --:1 
and AGENT_NAME= 'DEV_ODI_AGENT'--:2

RUNNING_SESSIONS
----------------
1
Elapsed: 00:00:00.99

Execution Plan
----------------------------------------------------------
Plan hash value: 1837310352
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 546 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
|* 2 | TABLE ACCESS STORAGE FULL| SNP_SESSION | 1 | 15 | 546 (1)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("SESS_STATUS"='D' AND "AGENT_NAME"='DEV_ODI_AGENT')
filter("SESS_STATUS"='D' AND "AGENT_NAME"='DEV_ODI_AGENT')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2013 consistent gets
0 physical reads

That’s still a lot of consistent gets to scan all of 75 rows and do a count (which is the sort aggregate step). A second to process and 2013 consistent gets to look at 75 rows in a skinny table – just trust me it is a skinny table. And the optimiser cost the full scan step at 546 “equivalent single IOs”.
{Oh, if anyone is wondering what the “table access storage full” and predicate step “storage(“SESS_STATUS”=’D’ AND “AGENT_NAME”=’DEV_ODI_AGENT’)” it’s because this system is an exadata box. Just ignore this for now, it is not relevant to this issue. I’ll blog about it *next week*}.

What is the problem?

Well, let’s see how big the table is.

ABCD01> @seg_dets
Enter value for seg_name: snp_session
Enter value for owner: dev_odi%
Any Key>

OWNER SEG_NAME SEG TS_NAME BYTES_K BLOCKS exts INI_K NXT_K
-------- --------------- --- -------- ---------- --------- ---- ------- -------
DEV_ODI_ SNP_SESSION TAB DEV_ODI_ 16,384 2,048 31 64 1024
REPO USER

The table is 2048 blocks in size, the consistent gets to look at the table is 2013. That’s close.

The problem is when you full scan a table, it scans up to the high water mark of that table, ie the highest block that Oracle knows there has been data in. As Oracle inserts new data into a table and extends it, it keeps a note of the highest block data has been put into, the High Water Mark (HWM). When you delete data from a table, Oracle marks individual blocks to be reused as they empty (drop below a certain percentage full). But Oracle does NOT check to see if many blocks are being fully emptied and it can reduce the high water mark. In fact, there could well be a few records in the first block of the table and then nothing until a few records in the last block that has been used in the table, the High Water Mark. Oracle just scans all blocks and counts relevant records up to the high water mark. This is actually a very long-standing performance tweek, you can allocate GBs of data to a table but Oracle knows the very highest block it has put data into for that segment and only scans up to the High Water Mark. Usually, the HWM is useful.

I happen to know that this table used to have several thousand records in it but is now used far less extensively and holds only a few records. They are constantly deleted and new ones created. I have seen {very regularly, at least once a year} some extreme situations where a table has had many millions of rows in it at one point, setting the HWM to several hundred thousand blocks into the table segment, but the table now only holds a few hundred records. But a Full Table Scan scans all the way up to the HWM, taking several minutes to access a few hundred records. Often that table is being inefficiently cleared with a “delete from table_x” statement and then re-populated.

What is the answer? Well, I could truncate the table and set the HWM to zero, but this table actually holds some real data. I can’t truncate it in this situation. I could MOVE the table, ie get Oracle to recreate the table and move the existing data into it. That would set the HWM of the new segment to block one and then raise it as it inserts the data neatly into the first few blocks. The third option, and the one I took, is to SHRINK the table. This basically tells oracle to shuffle the data down to the start of the table and reset the HWM. You can do this on the fly and I have done it on relatively active systems, but I must give the usual advice to test this yourself and you may have locking issues on a busy system hitting the table you shrink very hard. Also, you do need to enable row movement on the table for Shrink to work, as is demonstrated below:

ABCD01> alter table dev_ODI_XXXX.snp_session shrink space cascade;
alter table dev_ODI_XXXX.snp_session shrink space cascade
*
ERROR at line 1:
ORA-10636: ROW MOVEMENT is not enabled

ABCD01> alter table dev_ODI_XXXX.snp_session enable row movement;

Table altered.

ABCD01> alter table dev_ODI_XXXX.snp_session shrink space cascade;

Table altered.

Elapsed: 00:00:01.98

So, how does my little select perform now?

set autotrace on
SELECT count(*) RUNNING_SESSIONS from dev_ODI_XXXX.SNP_SESSION 
WHERE SESS_STATUS = 'D' --:1 
and AGENT_NAME= 'DEV_ODI_AGENT'--:2

RUNNING_SESSIONS
----------------
1
Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 1837310352
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 546 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
|* 2 | TABLE ACCESS STORAGE FULL| SNP_SESSION | 1 | 15 | 546 (1)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("SESS_STATUS"='D' AND "AGENT_NAME"='DEV_ODI_AGENT')
filter("SESS_STATUS"='D' AND "AGENT_NAME"='DEV_ODI_AGENT')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
8 consistent gets
0 physical reads

Fabulous. The run time is very short and the number of consistent gets is tiny. You might notice that the Explain Plan still thinks the cost of the full scan will be 546 (ie the equivalent of 546 single block reads) but that is as I have not re-gathered stats. I left the system to do this automatically that night.

I’ll just mention one more thing. Shrinking a table is a bit like the old trick of rebuilding indexes to compact them. People misunderstood what rebuilding indexes was achieving and why the indexes were so much larger than they “needed” to be and would spend CPU and IO rebuilding indexes overnight – only for them to quickly explode back to the original size (creating a lot of overhead as they did). Shrinking oversized tables can have a great benefit but keep in mind that if some process pushes the HWM right up every day by shoving lots of data into the table and then clearing it down later, shrinking the table is just a waste of time. It gives benefit when the table really has shrunk down in size for a while. Or it only occasionally explodes to a massive size. I have in the past written “regular shrink” scripts for tables I know explode once in a while (eg once a week) but are usually small, but you need to know what the pattern is. Don’t just blindly shrink tables.

For me in this situation, that code scanning that table dropped out of my “top 20 most demanding SQL statements” and has stayed out of it for weeks. The code now completed in considerably under 0.1 seconds as opposed to a whole second – but no one has noticed. I’d removed about 5% of the workload of the system – but no one noticed. But it took me 1/2 an hour to investigate and fix and I’d freed up a non-trivial chunk of processing resource. In my Performance Tuning heart I know I did a good job on that day.

Dropped Tables, Hiding Extents and Slow DBA_FREE_SPACE Queries February 16, 2012

Posted by mwidlake in database design, internals, performance.
Tags: , , ,
10 comments

My last post was on slow dictionary queries caused by having many, many thousands of extents in your database. This post is about a special case of this problem, which I encountered recently. It was on an Exadata box – Exadata is not particularly relevant to the issue, but I’m curious to see if mentioning Exadata will make this post more popular:-)

I was querying the used and free space on a very small database on the client’s X2-2 1/4 rack Exadata machine. The query was really slow, taking about 30 seconds. This is a FAST box, what is going on?

I quickly realised that the problem was specific to one tablespace:

>@SPC_SUM
Enter the tablespace (or leave null)&gt; DATA_01

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
DATA_01              alloc     262,144    2,097,152   2,097,152        1
                     free       63,128      505,024     504,384       11
2 rows selected.
Elapsed: 00:00:00.21

&gt; @SPC_SUM
Enter the tablespace (or leave null)&gt; USERS

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
USERS                alloc     748,320    5,986,560   5,372,160        2
                     free      127,904    1,023,232       6,144    3,058
2 rows selected.

Elapsed: 00:00:26.05

We can see it takes 0.21 seconds for tablespace DATA_01, 26.05 seconds for the USERS. Yes, USERS is larger but not 150 times larger. What is going on? Well, as the title of this post suggests, it is down to dropping tables. The below shows my route to that conclusion.

What I could see was that for the USERS tablespace the maximum chunk of free space was relatively small compared to the sum of free space – 6MB out of 1GB, pretty much – and that there was a lot of individual pieces of free space, 3,058. This tablespace was shattered into a lot of bits. So, what sort of extent size management do we have? How big are the tables and indexes in this tablespace {NB I already knew we had no partitions so I did not have to worry about that}.

select tablespace_name,initial_extent,next_extent
      ,extent_management,min_extlen,allocation_type
from dba_tablespaces where tablespace_name ='USERS'

TABLESPACE_NAME                INITIAL_EXTENT NEXT_EXTENT EXTENT_MAN MIN_EXTLEN ALLOCATIO
------------------------------ -------------- ----------- ---------- ---------- ---------
USERS                                   65536             LOCAL           65536 SYSTEM

select tablespace_name,blocks,count(*) from dba_extents
where tablespace_name = 'USERS'
group by tablespace_name,blocks
having count(*) >1
order by blocks desc,tablespace_name

TABLESPACE_NAME                    BLOCKS   COUNT(*)
------------------------------ ---------- ----------
USERS                                2560          2
USERS                                2048          3
USERS                                1536          7
USERS                                1408          5
USERS                                1280          2
USERS                                1248          2
USERS                                1152          2
USERS                                1024        229
USERS                                 896         15
USERS                                 768         21
USERS                                 736          3
USERS                                 720          3
USERS                                 704          2
USERS                                 672          2
USERS                                 640         25
USERS                                 624          2
USERS                                 576          2
USERS                                 512        117
USERS                                 400          2
USERS                                 384         34
USERS                                 360          2
USERS                                 312          2
USERS                                 288          4
USERS                                 256         49
USERS                                 248          2
USERS                                 240          2
USERS                                 192          5
USERS                                 160          4
USERS                                 128       1165
USERS                                   8       1788

30 rows selected.

So we have system controlled extent allocation, this should be fine. I did a quick check of the table contents – 300 or so tables and 200 or so indexes. And, as you can see from the check of extents above, very few larger extents and a lot of small ones. {As a little aside, note the relatively large number of 8-block, 128-block, 256-block and 1024-block extents; These are the sizes that LMTs with system managed extents tend to use unless a similar sized chunk is found to be available, in which case it is used}.

I did some other checks on indexes and segment sizes and it all boiled down to one thing. There were not a lot of things in this tablespace, what was there was small and… The total of all those table and index blocks was way less than the used space in the tablespace.

That is the major indicator of what is going on here. The difference in space used in the tablespace and the total of the visible tables and indexes.

The last link in the chain is the recycle bin.

select owner,ts_name,count(*) from dba_recyclebin group by owner,ts_name

OWNER                          TS_NAME                          COUNT(*)
------------------------------ ------------------------------ ----------
USER1                          USERS                              542356
USER1                                                                  2
WEGWEGWEG                      USERS                                  97
KKKUKUYLLX                     USERS                                 149
USOVFPKEKS                     USERS                                   3
....
ERHJTRTTTURT                   USERS                                   4

11 rows selected.

That’s 542,356 objects in the recyclebin for one user, in the one tablespace. My problem tablespace. The penny clicked, something I had already noticed and was a little uncomfortable about fell into place.

The client is using Oracle Data Integrator (ODI) to pull data together and put it into the database. The process they are using basically creates a table, uses it to load some data into and then pushes the data into the target tables. Then drops the table. The drop does not do a “DROP TABLE … PURGE;”. This is done many, many times per load cycle, which is intended to run several times a day.

Something you should always keep in mind with the recyclebin is that the tables, indexes and their extents that go into the recycle bin do not get reported in several of the key data dictionary views. I did mention this in a blog post way back but maybe a whole post on it is called for.

So, the tablespace was being shattered by the constant creating and dropping of small tables. It was hidden from easy view due to how the recyclebin is (not) exposed in data dictionary views.

It is not good practice to constantly create and drop lots of tables. As well as the admittedly rather odd impact that this posting is all about, there is a lot of internal work involved for the oracle database in creating and dropping tables. It is an overhead best avoided. The client had good reasons for this approach but now they are going to look at the alternatives.

I’ve not actually proven in this posting that all those dropped tables is the actual cause of the slow performance in querying the free space in that tablespace {though the corroborative evidence is very strong}. Neither have I said how we cleared up the mess. I’ll cover both of those in the next post on this mini-thread.

Slow querying of DBA_EXTENTS, DBA_FREE_SPACE and dropping tables February 14, 2012

Posted by mwidlake in performance.
Tags: ,
3 comments

Are you finding queries to identify free space, the size of segments or details of extents in your database are very slow? You could have a database with a very, very large number of extents. If you are on version 9 or up, I would suggest you check the extent management settings for any tablespace holding large segments (see later).

{Caveat – I’m writing this posting based on what I remember of some issues I had on two V10.2 databases 2 and 3 years ago respectively, and those systems are no longer available to me. So please treat everything I say here with some caution – I’ve not checked and verified it to the level I normally would. But I wanted to move this information from another posting I am writing on a recent experience I will verify…}

First, what this article is NOT about. It has been known for a very long time that creating tables with a large number of extents can have a negative impact on SQL performance on that table. Except it never really did and it was a myth. Way back prior to Oracle 7.3 you could only create so many extents per segment, depending on block size, but that was another issue. It used to be argued that SQL select against such tables with many extents was slower. I did some tests and it was not – unless you were in the strange situation where your extent size was less than your multi-block read count, and even then the impact was not huge, it was a slowdown of a few percent to maybe 25%.

However, dropping such tables, truncating such tables and queries against the dictionary objects that deal with extents and free space could and still can become very slow. As we have progressed through the Oracle versions from 8 ,9 and 10 this problem has become less common and the impact has become less, mostly due to Locally Managed Tablespaces (LMTs) and Automatic Segment Space Management {though that is more in respect of concurrent DML than select}.

LMTs in particular have generally removed the issue. If you do not use LMTS and have no very,very pressing reason to not {like 3rd party support for applications}, then swap to LMTs. LMTs have been the default for user defined tablespaces since oracle 9 and have two options. Uniform (Fixed) extent sizes or Automatic, where oracle uses 8 block extents for a new segment to start, then 64 block extents, then 1024 block extents as the segment grows {I might be wrong on the exact size details but don’t worry about it, I certainly don’t}. You can check the settings for tablespaces as is demonstrated below. I create two tablespaces, one with uniform extent sizes and then one with automanaged extent sizes, and check the relevant information (this is on 11.2.0.3):

create tablespace mdw_uni_1m
datafile 'C:\APP\MARTIN\ORADATA\MDW1123\MDW_UNI_1M'
size 100m
extent management local uniform size 1M;

create tablespace mdw_auto
datafile 'C:\APP\MARTIN\ORADATA\MDW1123\MDW_AUTO'
size 100m
extent management local autoallocate;

select tablespace_name,initial_extent,next_extent
      ,extent_management,min_extlen,allocation_type
from dba_tablespaces where tablespace_name like 'MDW%';

TABLESPACE_NAME    INITIAL_EXTENT NEXT_EXTENT EXTENT_MAN MIN_EXTLEN ALLOCATIO
------------------ -------------- ----------- ---------- ---------- ---------
MDW_UNI_1M                1048576     1048576 LOCAL         1048576 UNIFORM
MDW_AUTO                    65536             LOCAL           65536 SYSTEM

As you can see, tablespace MDW_UNI_1M uses uniform extents of 1M and MDW_AUTO has system managed allocation and starts with 64K extents – 8 blocks with my 8k block size.

As a quick demo, I’ll just create and populate two simple tables and see what extent sizes are created for them:

create table mdw_big
tablespace mdw_auto
as select rownum id,lpad('A',1000,'A') pad
from dual connect by level <10000

create table mdw_big_uni
tablespace mdw_uni_1m
as select rownum id,lpad('A',1000,'A') pad
from dual connect by level <10000

select owner,segment_name,blocks,count(*)
from dba_extents
where segment_name like 'MDW%'
group by owner,segment_name,blocks

OWNER           SEGMENT_NAME        BLOCKS   COUNT(*)
--------------- --------------- ---------- ----------
ERIC            MDW_BIG_UNI            128         12
ERIC            MDW_BIG                  8         16
ERIC            MDW_BIG                128         11

3 rows selected.

So, how do issues with large numbers of extents still arise with modern oracle systems? Well, the two situations I’ve seen on Oracle 10 had the same cause {and, as a teaser for later this week, I’ve seen a variation of this issue on a nice, shiny Exadata X2-2 Oracle 11.2.0.3 box:-) You can readabout that here }. What alerted me was slow performance querying the data dictionary, in particular my scripts for checking free space, the size of segments and how many extents they consisted of.

If you create a tablespace with fixed extent sizes of eg 64K and then create a table in there that is 100GB in size, you will get an awful lot of extents. Now make it worse and have a partitioned table that ends up being several hundreds of GB in size with all those segments in that tablespace (or similarly defined tablespaces).

Since I hit the problem twice myself, I’ve chatted to others who have had the same issue. The above usually happens because of a mistake. The tablespace(s) in question are set up to hold small tables and then get used for large tables, either by a mistake in stating the exact tablespace to use or having the tablespace default to your default tablespace – which just happens to be a tablespace with fixed but small extent sizes.

The end result is a massive number of small extents in these tablespaces, usually with extents for different objects mixed in. Some dictionary queries slow down and, in particular, anything to do with looking at extents. For one site, I was trying to use my own code to gather statistics on tables that replaced the standard automated job. It’s fairly “smart” code and chooses a sample size based on the size of the segments. Only, the data dictionary was performing so slowly for the check on segment size that it was taking over 5 seconds to get the information – longer than some of the stats gathers.

You can logically understand why dropping or truncating the table is slow. Oracle has to sort out all those extents, remove the information from the data dictionary. This is not helped by the fact that part of the data dictionary is being slowed down due to all those pesky records…

You MAY be able to get some relief from this situation by gathering fixed object statistics. I did so at one site, where the queries against free_space and segment size sped up by around 80%. I have no demonstrated proof of this, it is just what I saw in one situation, so feel free to try it but don’t sue me if it does not help. Also, it took over 3 hours to gather the fixed object stats and you only do this sort of thing, untested, on a production system if you are already in a bad place.

{update – I just tested this on a private 11.2 db that was taking 10.3 seconds to count all extents, all 12,742 of them. Gathering fixed object stats made no difference at all.}

However, the real answer is to laboriously rebuild those segments in tablespaces with correctly specified uniform extent sizes. Which we did, over several weeks, and it made a difference.

If I was doing this task today, if I could get the outage to do it, I would create COPIES of those segments that were in the wrong tablespaces, re-name and re-apply any constraints and move the other other, smaller tables and indexes to a new tablespace – and then drop the tablespaces including contents. Why? As dropping a table with lots and lots of small extents seemed to take a very long time (many minutes per partition and we had thousands of them). Again, my proof is lost in the mists of time, but that’s what I would aim to do.

IOTs by the Oracle Indexing Expert January 10, 2012

Posted by mwidlake in Blogging, performance.
Tags: , ,
add a comment

I’m really pleased to see that Richard Foote has started a series on Index Organized Tables. You can see his introductory post on the topic here. As ever with Richard, he puts in lots of detail and explanation and I’ve been a fan of his blogging style for a long time.

I’ve got a few posts on the topic left to do myself, maybe this competition will spur me to get on and write them!

What I will also be very interested to see is the different way we will cover the same topic. Richard has already put in a block dump and dug into the details a little at a low level about how the data is stored, which I have not done. He has also shown how using an IOT instead of a fully overloaded index (where you create an index covering all the columns of the table, to avoid visiting the table for other columns) results in not only saving the space taken up by the redundant heap table but that the IOT index is smaller than the fully overloaded index. This is due to the lack of a rowid.

I put in occasional pictures and maybe write more about how the example matches real world situations. If you want, you can look back at my own introduction to the topic.

I’m sure this is going to be an excellent series and I’ll be following it myself.

IOT Part 6(B) – OLTP Inserts into an IOT November 10, 2011

Posted by mwidlake in development, performance, Testing.
Tags: , , , , , ,
14 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
<…………IOT6(A) – Bulk Insert slowed down

IOT 6 part A covered the sort of insert work used to create an IOT or push a large number of rows into it via an “insert into…select from” type format. ie Bulk Load.
In this, part B, I am going to test inserting a small number of rows (10,000) into a table already holding a million rows. Something more like an hourly load or the inserts that would occur over a working day – a more OLTP-like test

I’m going to do the tests against four things. A “non-insert” test to ensure my data creation is not taking a lot of resource {the Null test}, then against my three versions of test table – a normal heap table, my IOT and my Partitioned IOT. I run each test on each segment type three times, the minimum number needed to help identify variation that could invalidate the test.

You will probably be surprised by the results….

All three tables have the same column structure, the heap has indexes on the primary key and a unique constraint, the IOT and partitioned IOT are organized on a modified version of the Primary key and have the second index for the unique key. Below is the creation statement for the partitioned IOT (the IOT is the same without the partition clause). You can get full creation details from IOT part 2:

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(1000) not null
,vc_2      varchar2(1000)
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110901 values less than (to_date('01-09-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20111001 values less than (to_date('01-10-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20111101 values less than (to_date('01-11-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)

The three tests are:

  • Insert 10,000 records for random accounts, in time order for the same day. This would be like processing data for or during a “day”
  • Generating the 10,000 records in order of account, to better match the IOT structure, again for a single day. For any given account 0,1,2 or 3 records are created, to be a little more “real life”. This would be like loading a pre-processed set of data.
  • All 10,000 records are created for the one account, in datetime order, as an extreme example of inserting into an IOT data that is ordered to perfectly match the order of the IOT.

The code is written ‘badly’, ie a single record is prepared and inserted into the table at a time, within a loop of 10000 iterations. This is to avoid the bulk-processing that may benefit the heap table and also to better match the reality of an OLTP system. I do however only commit after all records are created for any given test, which is maybe a mistake but that’s what I did.

Now, before I go any further, does anyone want to guess by how much inserting into the heap table will be faster than inserting into the IOT (partitioned or otherwise)? Also, whether the partitioned IOT will be faster or slower than the IOT and what, if any, impact there will be by the ordering of data?

To allow you to think about that and not see the results by accident, I’ll stick in one of the insert harnesses here so you can see how I am doing it. Skip over if you just want the results

--
execute s_snap.my_snap(' finished non-insert test1')
--
-- Transaction_heap random data test
declare
v_num         number :=10000; -- number of accounts
v_offset      number :=-4;
v_str1        varchar2(60);
v_str2        varchar2(60);
v_tran_type   number(2);
v_tran_id     number(10);
v_acco_type   number(2);
v_acco_id     number(10);
v_cre_date    date;
v_vc_1        varchar2(1000);
v_vc_2        varchar2(1000);
v_date_1      date :=sysdate;
v_num_1       number(2) :=10;
v_num_2       number(2) :=15;
begin
v_str1:=dbms_random.string('U',60);
v_str2:=dbms_random.string('U',60);
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..10000 loop
    v_tran_type :=mod(1,3)+1;
    v_tran_id   := 1000000+i;
    v_acco_type := 5+(trunc(dbms_random.value(1,3))*5); -- 10 or 15
    v_acco_id   :=trunc(dbms_random.value(1,v_num/2)); -- 1 to 5000
    v_cre_date  :=trunc(sysdate-v_offset)+(i/(60*60*24));
    v_vc_1      :=substr(v_str1,1,51+mod(i,10));
    v_vc_2      :=substr(v_str2,1,46+mod(i,15));
    insert into transaction_heap
      (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,vc_2,date_1,num_1,num_2)
    values
      (v_tran_type,v_tran_id,v_ACCO_type,v_ACCO_id,v_cre_date
      ,V_vc_1,v_vc_2,v_date_1,v_num_1,v_num_2);
end loop;
commit;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
--
execute s_snap.my_snap(' finished th insert test1')
--

Here are the results of three runs for all tests:

Elapsed time in seconds:

                    random insert             ordered-by-account       perfect IOT order match
Segment type    Test 1    2    3            Test 1    2    3           Test 1    2     3    
------------    ------ ---- ----            ------ ---- ----            ----- ---- ----
No insert          2.1  0.1  0.1               0.1  0.1  0.1              0.1  0.0  0.1   

                                    AVG                         AVG                        AVG
                                   ----                        ----                       ----
Heap              52.3 20.5 11.4   28.1       33.4 13.8 10.7   19.3       9.1 13.0  9.8   10.6
IOT              137.1 83.7 26.4   82.4      110.8 63.6 39.0   71.2       6.4  6.7  4.0    5.8
Partn'd IOT       33.2  9.5  8.7   17.1       17.1 10.4 14.3   13.9       8.1  5.3  4.5    6.0
                                   ----                        ----                       ----

Allow me to summarise that.

First, the preparation of the data and cycling around the loop, the no-insert tests, are negligible. {The first no_insert test is taking 2 seconds due to the parsing and collecting dictionary information overhead of the first thing done}. We can ignore the overhead of the test harness.

Much more significantly:

  • Each iteration of each test, things get faster {with one odd exception}
  • For Random and semi-ordered inserts, inserting into the IOT is slower than the Heap by a factor of 300% to 400% (3 times to 4 times slower) – what we all probably expected.
  • Inserting into the IOT is Faster than the heap when the insert order matches the structure of the IOT in the perfect_IOT test. Not just the same but 40% faster
  • The partitioned IOT is as Fast or Faster than the the Heap table, never mind the IOT. Over all the partitioned IOT takes 50% TO 75% THE TIME OF THE HEAP to be populated.

Hands up everyone expecting this? NB My hand would be down if I had not done similar tests before.

Why do we see these results?

I’ll just mention that the reason all the tests get quicker with each iteration is that the computer is doing the same thing over and over again and gets used to it, the relevant circuits in the chip are “oiling in”. Of course not, how silly – but it seems to be how non-IT people think of it:-) Basically more and more of each segment is getting cached and that reduces physical IO and speed things up. You see this a lot in testing and you need to be constantly vigilant about it. You might think I should really run the tests until they are steady-state, but that is a very false situation. On real system you don’t tend to get the same thing run over and over and over again on a system with nothing else running.

Much more interestingly, why is the partitioned IOT winning? Well, for the Heap table data is being poured into the leading edge of the table, which takes a little processing and some blocks to be created. However, there are also two indexes being maintained, the Primary Key and that Unique key. For each insert oracle has to find the right spot in each index (root block ->branch block->leaf block->check for row) and insert the new record, occasionally splitting a full leaf block.

For the IOT and the Partitioned IOT there is no table to maintain. In effect there are just two indexes to maintain. For the normal IOT this benefit is outweighed by the cost of working with a less space efficient Primary Key index. As there is the whole of the row in each index entry there are fewer index entries possible per leaf block and so more block need to be read in from disc and more splits are needed. Thus the inserts into the IOT are slower.
But for the partitioned IOT not only is there no table to maintain but the table (really the Primary Key) is partitioned – on date. We are creating data for a single day at a time. All the relevant blocks will be in the one partition (and it’s the latest and smallest partition) and quickly all cached in memory and processed in memory. It’s the efficient use of the cache again. For both the IOT and the Heap, the primary key segment being processed is larger and the inserts are spread across the whole index.

Ordering data to match the index order (ordered_by_account tests) helps for all three segment types – heap, IOT and partitioned IOT. We have two indexes and the data is being ordered for both in effect, as one is on the columns we are ordering the data creation for and the other, the unique key, is generated by an ascending sequence. The gain is about 25, 30% and is due to working through the indexes in order as opposed to skipping about. If the indexes were significantly larger than the buffer cache or on a highly active system where the turnover in the cache was very high (indeicated by a low buffer cache hit ratio:-) ) then the benefit would be higher.
The perfect ordering test (perfect_iot_order_match) is a special case. Not only is the data being processed in order of the indexes but only one part of the index is being updated – for a single account. In this situation the maintaining of the index is so efficient that the time saved not processing the heap table means both the IOT and IOT_P win out over the heap. So in this special case, inserts into IOTs are faster than inserts into a heap table.

And finally a quick summary:

  • Inserting records into an IOT carries about a 200-400% overhead, depending on the number of indexes
  • Ordering the data to match the IOT gives a 25-30% advantage, but it similarly aids update of the primary key on the heap
  • Partitioning the IOT such that a smaller chunk of the index is processed can be highly beneficial. I don’t cover it specifically in the above but you can probably appreciate that partitioning on a heap can be beneficial to processing indexes on the heap also.
  • If you can organise to create your IOT such that data is created only in one or a few distinct points in the structure (you might have to tweak you data load too of course), insert performance might match or even beat that for a heap table with the same indexes
  • Using methods to maximise efficient use of your buffer cache and reduce IO is a very powerful performance tuning aid, not just for select

BTW I am conscious of the fact that the above may surprise many and you might wonder if my data or tests are flawed. I have stripped out most of the code from this post as it was just becoming massive and confusing, but contact me and I will happily supply all the example scripts and data.

IOT P6(a) Update November 8, 2011

Posted by mwidlake in Architecture, development, performance, Testing.
Tags: , , , ,
3 comments

In my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, much slower than on a larger system I had used for professional testing. A major contributing factor was that the insert was working on the whole of the IOT as data was created. The block buffer cache was not large enough to hold the whole working set (in this case the whole IOT) once it grew beyond a certain size. Once it no longer fitted in memory, Oracle had to push blocks out of the cache and then read them back in next time they were needed, resulting in escalating physical IO.

I’ve just done another test which backs up this claim. I altered my test database so that the block buffer cache was larger, 232MB compared to 100MB in my first tests. The full IOT is around 200MB

Bottom line, the creation of the IOT was greatly sped up (almost by a factor of 4) and the physical IO dropped significantly, by a factor of 20. As a result, the creation of the IOT was almost as fast as the partitioned IOT. It also shows that the true overhead on insert of using an IOT is more like a factor of 2 to 4 as opposed 6 to 8.

You can see some of the details below. Just to help you understand them, it is worth noting that I had added one new, larger column to the test tables (to help future tests) so the final segments were a little larger (the IOT now being 210MB as opposed to 180MB in the first tests) and there was a little more block splitting.

                        Time in Seconds
Object type           Run with       Run with
                     100MB cache    232MB cache
------------------  ------------    -----------   
Normal Heap table          171.9          119.4   
IOT table                1,483.8          451.4     
Partitioned IOT            341.1          422.6 

-- First reading 100MB cache
-- second reading 232MB cache 
STAT_NAME                            Heap    	IOT	      IOT P
-------------------------------- ---------- -----------  ----------
CPU used by this session            5,716         7,222       6,241
                                    5,498         5,967       6,207

DB time                            17,311       148,866      34,120
                                   11,991        45,459      42,320

branch node splits                     25            76          65
                                       25            82         107

leaf node 90-10 splits                752         1,463       1,466
                                      774         1,465       1,465

leaf node splits                    8,127        24,870      28,841
                                    8,162        30,175      40,678

session logical reads           6,065,365     6,422,071   6,430,281
                                6,150,371     6,544,295   6,709.679

physical read IO requests             123        81,458       3,068
                                      36          4,012       1,959

physical read bytes             2,097,152   668,491,776  25,133,056
                                1,400,832    34,037,760  16,048,128

user I/O wait time                    454       139,585      22,253
                                       39        34,510      19,293

The heap table creation was faster with more memory available. I’m not really sure why, the cpu effort was about the same as before and though there was some reduction in physical IO with the larger cache, I suspect it might be more to do with both the DB and the machine having been recently restarted.

All three tests are doing a little more “work” in the second run due to that extra column and thus slightly fewer rows fitting in each block (more branch node and leaf node splits), but this just highlights even more how much the IOT performance has improved, which correlates with a massive drop in physical IO for the IOT creation. If you check the session logical reads they are increased by a very small, consistent amount. Physical read IO requests have dropped significantly and, in the case of the IOT, plummeted.

I believe the 90:10 leaf node splits are consistent as that will be the maintaining of the secondary index on ACCO_TYPE and ACCO_ID, which are populated in order as the data is created (derived from rownum).

What this second test really shows is that the efficiency with which you are able to make use of the database cache is incredibly significant. Efficiently accessing data via good indexes or tricks like IOTs and hash tables is important but it really helps to also try and consider how data is going to be recycled within the cache or used, pushed out and then reused. A general principle for batch-type work seems to me to be that if you can process it in chunks that can sit in memory, rather than the whole working set, there are benefits to be gained. Of course, partitioning can really help with this.

{If anyone is wondering why, for the heap table, the number of physical IO requests has dropped by 70% but the actual number of bytes has dropped by only 30%, I’m going to point the finger to some multi-block read scan going on, either in recursive code or, more likely, my code that actually gathers those stats! That would also help explain the drop in user IO wait time for the heap run.}

Just for completeness, here is a quick check of my SGA components for the latest tests, just to show I am using the cache size I claim. All of this is on Oracle 11.1 enterprise edition, on a tired old Windows laptop. {NB new laptop arrived today – you have no idea how hard it has been to keep doing this blog and not play with the new toy!!!}. If anyone wants the test scripts in full, send me a quick email and I’ll provide them.:

-- sga_info.sql
-- Martin Widlake /08
-- summary
set pages 32
set pause on
col bytes form 999,999,999,999,999 head byts___g___m___k___b
spool sga_info.lst
select * 
from v$sgainfo
order by name
/
spool off
clear col
--
NAME                             byts___g___m___k___b RES
-------------------------------- -------------------- ---
Buffer Cache Size                         243,269,632 Yes
Fixed SGA Size                              1,374,892 No
Free SGA Memory Available                           0
Granule Size                                4,194,304 No
Java Pool Size                              4,194,304 Yes
Large Pool Size                             4,194,304 Yes
Maximum SGA Size                          401,743,872 No
Redo Buffers                                6,103,040 No
Shared IO Pool Size                                 0 Yes
Shared Pool Size                          142,606,336 Yes
Startup overhead in Shared Pool            50,331,648 No
Streams Pool Size                                   0 Yes

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,
14 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Follow

Get every new post delivered to your Inbox.

Join 218 other followers