jump to navigation

IOT part 3 – Significantly Reducing IO August 2, 2011

Posted by mwidlake in development, performance.
Tags: , , , ,
16 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
……>IOT4 – Boosting Buffer Cache Efficiency
……..>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

In the previous two posts I covered the basics of Index Organized Tables (IOTs) and then created some test tables to demonstrate the benefit of IOTs that is most often covered – reducing the IO needed to get a single record by one IO, say from 5 to 4. {Whether this is a buffer get from memory or a disc IO depends on if the block is cached, of course}.

In this post I am going to show how IOTs can far more significantly reduce the IO when several related rows are required.

Below is one of my test tables, the one that is a normal heap table and has a primary key, CHHE_PK on PARE_ID and CRE_DATE:

mdw11> desc child_heap
 Name                                      Null?    Type
 ----------------------------------------- -------- --------------
 PARE_ID                                   NOT NULL NUMBER(10)
 CRE_DATE                                  NOT NULL DATE
 VC_1                                      NOT NULL VARCHAR2(100)
 DATE_1                                             DATE
 NUM_1                                              NUMBER(2)
 NUM_2                                              NUMBER(2)

--
mdw11> select count(*),count(distinct(pare_id)) from child_heap

  COUNT(*) COUNT(DISTINCT(PARE_ID))
---------- ------------------------
   1000000                     9999

As you can see, the table has 1 million records and 9,999 values for PARE_ID, there are approx 100 records per parent. The data was created to match a common situation – that of a bit of data coming in for each parent every day. See post 2 for details.

The result of this is that the data for any given parent is scattered through the table. As the data comes in for a given day, the data for the first parent is added to the end of the table, followed by all the data for all the other parents who have data that day. The next day this is repeated, so the child records for a given parent are interspersed with the child records for many other parents.

The below diagram demonstrate what will now happen if you issue a statement like
select *
from CHILD_HEAP
where PARE_ID=12

Oracle quickly works down the index to the leaf block containing the first key that matches the range. This takes, in my example, 4 block reads. Oracle now works through the index entries and, via the rowid, identifies the exact block to visit in the table for each key. For each key it has to visit a new block – because the data is scattered through the table. This is what the clustering_factor in the index statistics is measuring, how often contiguous rows in the index are for the same block. In our case, almost never.
In my diagram I do not number those table reads but in my simplistic diagram it would be 10 further reads.
If Oracle reaches the end of the leaf block before it reaches the end of the range of key values, oracle follows the pointer in the leaf block (not shown) to the next leaf block (whcih is another block read) and continues working through the keys until the range scan is completed.

In my simplified diagram I only have 6 entries per leaf block. In reality, and in my example tables, this is more like a few hundred. 247 in the case of CHHE_PK.

Now let’s consider my Index Organized Table, CHILD_IOT. It has exactly the same columns as CHILD_HEAP and the data was created in the same way. However, because it is an IOT, as the data came in it was inserted into the primary key index and is thus in an ordered state.

The below diagram demonstrate what will now happen if you issue a statement like
select *
from CHILD_IOT
where PARE_ID=12

Oracle works down the index to the leaf block where the range scan begins and now simply works along the leaf blocks. There is no need to go and visit the table as there is no table.

In my IOT diagram the leaf entries are longer and there are fewer in each leaf block, ie 5. So my scan has to visit 3 leaf blocks rather than 2. In reality the difference is more pronounced, in my example table there are actually 56 rows per leaf block, compared to the 247 in the index on the heap table. As such, my scan on the IOT will cover more leaf blocks but this is insignificant compared to the reduction in block visits caused by not having to go hunt down records scattered over the table. Even in the unlikely event of my IOT being deeper by 1 level (an extra layer of branch blocks) due to the reduces entries per leaf block, I would still be winning for range scans.

That is all nice theory and pictures. As ever, we need to back this up with some real tests. Firstly, I am using SQL*Plus and I need to set my arraysize large enough so that I do not introduce extra consistent gets through selecting small sets of rows between client and server. You will need to do the same to see similar results to me.
{I keep meaning to do a dedicated post on arraysize but H.Tonguç YIlmaz has a nice post already on it.}

set arraysize 200
set autotrace on

Now I will select all the records for PARE_ID=10, including a column not in the Primary Key, so that the table needs to be visited. I did this twice to remove the parsing overhead:

select pare_id,cre_date,vc_1
from child_heap
where pare_id =10
order by cre_date

   PARE_ID CRE_DATE  VC_1
---------- --------- -----------------------------------------------------------------------
        10 17-APR-11 LDOBKMLCYCSQYBDFIUISJWQAHNYSQOSUQJKIGCSEJHDPOFFLHHXYSMDSQNUB
        10 18-APR-11 LBGDNOYQFQMTMJQRAUWSRNBTHQSKBEUVLZSFWEGULOPDXQSVXOIC
        10 18-APR-11 LBGDNOYQFQMTMJQRAUWSRNBTHQSKBEUVLZSFWEGULOPDXQSVXOICOSFTSYNO
        10 19-APR-11 IBVTIGYBXJLMZQKRPJZEPXLMQLNOYNWLQOYVVGARNSITZWULVBYLEJKZNII
        10 19-APR-11 IBVTIGYBXJLMZQKRPJZEPXLMQLNOYNWLQOYVVGARNSITZWULVBYLEJ
        10 19-APR-11 IBVTIGYBXJLMZQKRPJZEPXLMQLNOYNWLQOYVVGARNSITZWULVBYLEJ
        10 20-APR-11 USIGVSPPIUUXEIRBMPFNBTTMDUJTVITHKQWZAKZOMJEDZCUPQAEFQQEYM
        10 20-APR-11 USIGVSPPIUUXEIRBMPFNBTTMDUJTVITHKQWZAKZOMJEDZCUPQAEF
...
        10 19-JUL-11 BNOYCIDTFJHPPOYPSVAVKJSYUNVPGPHLJXUOIKYKASKHYGZNVHVFFGPVAKN
        10 25-JUL-11 HDFGAQWTYZBSVYVXTFFRDIAKRYWFUPFCNDCETHUWHSQUITHHVUEJTJ

82 rows selected.


Execution Plan
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |   100 |  6900 |   103   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |   100 |  6900 |   103   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | CHHE_PK    |   100 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         86  consistent gets
          0  physical reads

82 rows collected and 86 consistent gets. That will be 4 consistent gets to process the index blocks and 82 for the table blocks.

Now let’s repeat that on the IOT:

select pare_id,cre_date,vc_1
from child_IOT
where pare_id =10
order by cre_date
mdw11> /
any key>

   PARE_ID CRE_DATE  VC_1
---------- --------- ------------------------------------------------------------
        10 17-APR-11 QJHQXTQAYEUICPNDQTYMMFZPWJSIDLBKOXYTHLEHKTVWUPKQMWUUX
        10 18-APR-11 BUTUEWDCDQVPLTPPRFGBBEDOZYRPERPRROVUQPTSRZLHKVBSBUEAMZYAS
        10 18-APR-11 BUTUEWDCDQVPLTPPRFGBBEDOZYRPERPRROVUQPTSRZLHKVBSBUEAMZY
        10 19-APR-11 DEGNPALVLMIDYCYIQIIQJJVZFTNIMEULMAGDEWVTOAKBNHOPUQJE
        10 19-APR-11 DEGNPALVLMIDYCYIQIIQJJVZFTNIMEULMAGDEWVTOAKBNHOPUQJ
...
        10 24-JUL-11 TJGLOEITTVXQTQPHSKGVERSGJDREYSKKCDUFMQXQVXMHMMDWPLJNSNK
        10 24-JUL-11 TJGLOEITTVXQTQPHSKGVERSGJDREYSKKCDUFMQXQVXMHMMDWPLJNSNKCN
        10 25-JUL-11 BCLLVPYMWAAQOVLILXARQZXEGAQAARPURIFKFKHROUSFORRYYXQZUAJHDBL

108 rows selected.


Execution Plan
----------------------------------------------------------------------------
| Id  | Operation        | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT |         |   100 |  6900 |     4   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| CHIO_PK |   100 |  6900 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          6  consistent gets
          0  physical reads

We actually gathered more data, 108 rows compared to 82, all for 6 consistent gets compared to 86 consistent gets. That is a reduction to less than 10% of the original effort.

Now for a more extreme test. I am going to select a single row summary of data for 10 parents, flushing the cache between each run to show the impact when you have to do real IO to support those consistent gets. This is on a fairly old {4 years} laptop with a rather tired hard disc

alter system flush buffer_cache

System altered.

Elapsed: 00:00:00.18

--
--

select count(*),sum (num_1)
from child_heap
where pare_id between 50 and 60

  COUNT(*) SUM(NUM_1)
---------- ----------
      1155      12031

Elapsed: 00:00:06.39

Execution Plan
-------------------------------------------------------------------------------------------
| Id  | Operation                    | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |            |     1 |     7 |  1203   (0)| 00:00:18 |
|   1 |  SORT AGGREGATE              |            |     1 |     7 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |  1200 |  8400 |  1203   (0)| 00:00:18 |
|*  3 |    INDEX RANGE SCAN          | CHHE_PK    |  1200 |       |     7   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1157  consistent gets
       1112  physical reads

--
--

alter system flush buffer_cache

System altered.

Elapsed: 00:00:00.18

--
--

select count(*),sum (num_1)
from child_iot
where pare_id between 50 and 60

  COUNT(*) SUM(NUM_1)
---------- ----------
      1111      11528

Elapsed: 00:00:00.29

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     7 |    24   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     7 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |  1200 |  8400 |    24   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         25  consistent gets
         25  physical reads

The Heap took 6.39 seconds and required 1112 physical reads to support 1157 consistent gets.
The IOT took 0.29 seconds and required 25 physical reads to support 25 consistent gets.

I think we can all see that IOTs have the potential to greatly reduce physical and logical IO. Perhaps we should all be using IOTs more.

Final point. The Heap version took less physical reads than consistent gets as some blocks read into the block buffer cache held data required later in the query.

The impact of IOTs on the buffer cache will be the topic of my next post on IOTs. I think { hope:-) } that many of you will be very interested and impressed by what you could gain…

IOT 2 – First examples and proofs July 26, 2011

Posted by mwidlake in development, performance.
Tags: , , , ,
8 comments

<.. IOT1 – Basics
..>IOT3 – Great reductions in IO for IOTs
….>IOT4 – Boosting Buffer Cache Efficiency
……>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.

I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.

All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.

Here are the creation statements:

--first create the parent table, keyed by ID.
-- The other columns are not significant, they just represent "information"
create table mdw.parent
(id       number(10)    not null 
,name     varchar2(100) not null
,date_1   date          not null
,num_1    number(2)
,num_2    number(2) 
,constraint pare_pk primary key(id) 
 using index tablespace index_01
)
tablespace data_01
/
--
--Now put my 9999 parents into the table.
insert into parent
select rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level < 10000
/
--
-- create the table to hold the children as a heap table
create table child_heap
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chhe_pk primary key(pare_id,cre_date) 
 using index tablespace index_01
)
tablespace data_01
/
--
-- create the table to hold the children as an IOT table
create table child_iot
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chio_pk primary key(pare_id,cre_date) 
-- using index tablespace index_01 -- CANNOT STATE for IOT. State in table definition
)
ORGANIZATION INDEX -- This is it. This makes the table an IOT
tablespace data_01
/

There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.

The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.

When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.

I then populated the two CHILD tables with data. The method of creating this test data is very important.

I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.

The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.

The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.

declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into CHILD_HEAP
    (pare_id,cre_date,vc_1,date_1,num_1,num_2)
  select
    trunc(dbms_random.value(1,v_num))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/

I then gathered objects stats on the tables.
Let’s check the size of the tables:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CHILD%';

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHILD_HEAP      TABLE           DATA_01              12288

1 row selected.

ONE row? Where is the other table, where is CHILD_IOT? It does not exists.

Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years :-).

Let’s look at the size of the primary key indexes:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CH%PK'
and segment_name not like '%ORD%'

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHHE_PK         INDEX           INDEX_01              4224
CHIO_PK         INDEX           DATA_01              19456

2 rows selected.

Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…

Here are some other stats from one of my scripts:

mdw11> @tab_sci_own
owner for Table: mdw
Name for Table: child_heap

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_HEAP          1000,000      12,137    83 YES NO  250711 22:01 NO     1000000


INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHHE_PK         NOR NO  UNI  2      4,034    1000,000      995,857          1          1 250711 22:02

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHHE_PK                      CHILD_HEAP       1   PARE_ID
CHHE_PK                      CHILD_HEAP       2   CRE_DATE

--
--
owner for Table: mdw
Name for Table: child_iot

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_IOT           1000,000                83 YES NO  250711 22:03 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHIO_PK         IOT NO  UNI  2     17,855     910,881            0          1          1 250711 22:03

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHIO_PK                      CHILD_IOT        1   PARE_ID
CHIO_PK                      CHILD_IOT        2   CRE_DATE

Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.

The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.

If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.

The depth of the index does not change, being 3 in each case (BL or blevel 2)

So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:

-- First the Heap table
select * from child_HEAP where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 20:13:21','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- ------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11  LUTFHOCIJNYREYICQNORREAJOVBRIHFVLXNIGIVZDMFJCTGYFWC
25-JUN-11         11         16
1 row selected.

Execution Plan
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |     1 |    83 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |     1 |    83 |     3   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CHHE_PK    |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets


--and now the IOT table

select * from child_IOT where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 21:23:41','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- -------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11
CSIGBHSXWNDDTCFRCNWYPRNLEQWPCRYTXQQZHACDEXHOBEYXLNYBHRUHJ
27-JUN-11          7         52
1 row selected.

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |    83 |     2   (0)| 00:00:01 |
|*  1 |  INDEX UNIQUE SCAN| CHIO_PK |     1 |    83 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}

To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.

For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.

For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.

Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post :-).

For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.

select count(*) from child_heap where pare_id = 2

  COUNT(*)
----------
        98

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHHE_PK |   100 |   400 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

--
--

select count(*) from child_iot where pare_id = 2

  COUNT(*)
----------
        93

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |   100 |   400 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.

For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.

Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.

A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:

mdw11> select count(*) from child_heap

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |   989   (1)| 00:00:15 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHHE_PK |  1000K|   989   (1)| 00:00:15 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          2  db block gets
       4109  consistent gets
       4088  physical reads

mdw11> select count(*) from child_iot

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |  4359   (1)| 00:01:05 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHIO_PK |  1000K|  4359   (1)| 00:01:05 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      19298  consistent gets
      19246  physical reads

The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.

That is enough for one post.

Index Organized Tables – the Basics. July 18, 2011

Posted by mwidlake in development, internals, performance.
Tags: , , , ,
34 comments

..>IOT2 – Examples and proofs
….>IOT3 – Greatly reducing IO with IOTs
……>IOT4 – Boosting Buffer Cache Efficiency
……..>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

I think Index Organized Tables(IOTs) are a much under-used and yet very useful feature of Oracle. Over the next few postings I’m going to cover some aspect of Index Organised Tables, both good and not-so-good. I am going to cover some benefits of IOTs that I think many people are unaware of. In this first post I am just going to run through the basics of IOTs.

The idea behind an IOT is simple. You hold all the data for the table in the ordered structure of an index. Why would you want to do that? Let us consider a very common requirement, accessing a row in a “large” table via a known, unique key.

Traditionally you have a heap table holding the data you want to access and a standard index to support access to that table. See the first diagram below. The 4-layer triangle represents the index, with a root block, two levels of branch blocks and then the leaf blocks at the “bottom”. The blue rectangle represents the table with the squares being individual rows. Of course, in a large table there would be thousands or millions of “squares”, this is just a simple diagram to show the idea.

When you issue a SQL statement to select the row via the indexed column(s) then oracle will read the root block (1), find the relevent block in the first level of branch blocks (2), then the relevant block in the second level of branch blocks (3) and finally (as far as the index is concerned) the relevant Leaf Block for the unique key. The leaf block holds the indexed column(s) and also the rowid. The rowid is the fastest way to look up a record, it states the file, block and row offset for the row. This allows oracle to go straight to the block and get the row. That is read number (5).
The number of branch blocks {and thus the number of blocks that need to be read to find a row} will vary depending on how much data is indexed, the number and size of the columns in the index, how efficiently the space has been used in the blocks and one or two other factors. In my experience most indexes for tables with thousands or millions of rows have one, two or three levels of branch blocks.

The second diagram shows a representation of the Index Organized Table. The table has in effect disappeared as a distinct object and the information has been moved into the leaf blocks of the index {part of me feels Index Organized Tables should really be called Table Organized Indexes or Table Containing Indexes as that would better indicate what is physically done}:

So with the IOT oracle reads the root block (1), the two branch level blocks (2 and 3) and finally the leaf block (4). The leaf block does not hold the rowid but rather the rest of the columns for the table {this can be changed, a more advanced feature allows you to store some or all the extra columns in an overflow segment}. Thus to access the same data, Oracle has to read only 4 blocks, not 5. Using an IOT saves one block read per unique lookup.

This saving of block reads is probably the main feature that IOTs are known for, but there are others which I will cover in later posts. Two things I will mention now is that, firstly, the use of IOTs is potentially saving disc space. An index is in effect duplication of data held in the table. When you create an index no new information is created but space is used up holding some of the table information in a structure suitable for fast lookup. Secondly, the index and table have to be maintained whenever a change is made to the columns that are indexed. IOTs reduce this maintenance overhead as there is only one thing to maintain.

Now for some drawbacks.

  • The IOT has to be indexed on the primary key. There is no option to create an IOT based on other indexes. As such you have to either be accessing the table via the primary key to get the benefit – or you have to be a little cunning.
  • The index is going to be larger than it was and very often larger than the original table. This can slow down range scans or full scans of the index and a “full table scan” will now be a full index scan on this large object, so that can also negatively impact performance. However, if a range scan would then have resulted in access to the table to get extra columns, the IOT gives a similar benefit in reducing IO to that for single row lookups.
  • I just want to highlight that you now have no rowid for the rows.
  • Secondary indexes are supported but will potentially be less efficient due to this lack of rowid.

So, a brief summary is that Index Organised Tables effectively move the table data into the Primary Key index, reduce the number of block lookups needed to select one row, can save some disc space. But you can only organize the table via the Primary Key and it can make full or partial table scans and lookups via other indexes slower.

There are several more important benefits to IOTs {in my opinion} which I will come to over the next week or two.

Fastest £1,000 server – what happened? July 12, 2011

Posted by mwidlake in One Grand Server, performance.
Tags:
7 comments

A couple of people have asked me recently what happened to that “fastest Oracle server for a grand” idea I had last year, after all I did announce I had bought the machine.

{Update – it came back.}
Well, a couple of things happened. Firstly, what was a small job for a client turned into a much more demanding job for a client – not so much mentally harder as time-consuming harder and very time consuming it was. So the playing had to go on hold, the client comes first. The server sat in the corner of the study, nagging me to play with it, but it remained powered down.
Secondly, when the work life quietened down last month and I decided to spend a weekend getting that server set up I hit an issue. I turned on the server and it turned itself straight off. It than rested for 5 seconds and turned itself back on for half a second – and then straight off. It would cycle like that for as long as I was willing to let it.

OK, duff power switch, mother board fault, something not plugged in right, PSU not reaching stable voltage… I opened the case and checked everything was plugged in OK and found the manufacturer had covered everything with that soft resin to hold things in place. I pressed on all the cards etc in hope but no, it was probably going to have to go back. It is still in warranty, the manufacturer can fix it.

So I rang the manufacturer and had the conversation. They were not willing to try and diagnose over the phone so I had to agree to ship it back to them to be fixed {I did not go for on-site support as the only time I did, with Evesham Micros, they utterly refused to come out to fix the problem. Mind you, it turns out they were counting down the last week or two before going bust and, I suspect, knew this}. I shipped it back and the waiting began. Emails ignored, hard to get on touch over the phone. Over three weeks on and they only started looking at the machine last Friday (they claim).

On the positive side, this delay means that solid state storage is becoming very affordable and I might be able to do some more interesting things within my budget.
On the bad side the technology has moved on and I could get a better server for the same money now, but that is always the case. Mine does not have the latest Sandy Bridge Intel processor for example. Also, I have time now to work on it, I hope not to have time next month as I’d like to find some clients to employ me for a bit!

I better go chase the manufacturer. If it is not fixed and on its way back very, very soon then they will be off my list of suppliers and I’ll be letting everyone know how good their support isn’t.

Why is my SYSAUX Tablespace so Big? Statistics_level=ALL June 2, 2011

Posted by mwidlake in AWR, performance.
Tags: , ,
12 comments

One of my most popular postings is about why your SYSTEM tablespace could be rather large. Recently I’ve had issues with a SYSAUX tablespace being considerably larger than I expected, so I thought I would do a sister posting on the reason.

The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.

We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:

 select * from
 (select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
 from dba_segments
 where tablespace_name = 'SYSAUX'
 ORDER BY BLOCKS desc)
 where rownum < 40

OWNER              SEGMENT_NAME                                             SIZE_M
------------------ -------------------------------------------------- ------------
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911            27,648
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911         26,491
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3537            23,798
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3537         22,122
SYS                WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296            17,378
SYS                WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296         16,818
SYS                WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3          136
                   911
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                       96
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3537                       72
SYS                WRH$_SQLSTAT~WRH$_SQLSTA_14459270_4296                       47
SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           45
                   3537
SYS                I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST~                              41
SYS                WRH$_SYSMETRIC_SUMMARY~                                      40
SYS                WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_           37

As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.

I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.

Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.

As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.

So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.

The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:

The internal SQL inserting into wrh$_latch_children was becoming demanding

This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system

We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…

What was happening to the size of the SYSAUX tablespace?

Enter the tablespace (or leave null)> sys

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX               alloc  58,187,904  465,503,232  33,553,408       14
                     free       10,728       85,824      21,504       20
SYSTEM               alloc     128,000    1,024,000   1,024,000        1
                     free       68,360      546,880     546,752        3

4 rows selected.

select * from
(select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
 from dba_segments
 where tablespace_name = 'SYSAUX'
 ORDER BY BLOCKS desc)
where rownum < 40

OWNER    SEGMENT_NAME                                                     SIZE_M
-------- ------------------------------------------------------------ ----------
SYS      WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6201                     30262
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5817                     29948
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5435                     28597
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4675                     28198
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911                     27648
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5817                  27144
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6585                     26965
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6201                  26832
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4675                  26741
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911                  26491
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296                     26307
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5435                  26248
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296                  25430
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6585                  25064
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5058                     24611
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5058                  23161
         WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6966                      9209
         WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6966                   8462
         WRH$_SYSMETRIC_SUMMARY~                                             152
         WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3911               136
         WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911                               96

@sysaux_conts

OWNER                          OBJ_PART_NAME                                SIZE_M
------------------------------ ---------------------------------------- ----------
SYS                            WRH$_LATCH_CHILDREN-WRH                  231745.063
SYS                            WRH$_LATCH_CHILDREN_PK-WRH               215573.063
SYS                            WRH$_SQLSTAT-WRH                           711.0625
SYS                            WRH$_LATCH_MISSES_SUMMARY_PK-WRH           439.0625
SYS                            WRH$_ACTIVE_SESSION_HISTORY-WRH            437.0625
SYS                            WRH$_LATCH_PARENT-WRH                      292.0625
SYS                            WRH$_LATCH-WRH                             276.0625
SYS                            WRH$_LATCH_MISSES_SUMMARY-WRH              273.0625
SYS                            WRH$_SEG_STAT-WRH                          268.0625
SYS                            WRH$_LATCH_PARENT_PK-WRH                   239.0625
SYS                            WRH$_SYSSTAT_PK-WRH                        237.0625

Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.

Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.

I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:

OWNER      OBJ_PART_NAME                                SIZE_M
---------- ---------------------------------------- ----------
SYS        WRH$_LATCH_MISSES_SUMMARY_PK-WRH           512.0625
SYS        WRH$_LATCH_MISSES_SUMMARY-WRH              350.0625
SYS        WRH$_LATCH-WRH                             304.0625
SYS        WRH$_SQLSTAT-WRH                           280.0625
SYS        WRH$_LATCH_PK-WRH                          259.0625
SYS        WRH$_SYSSTAT_PK-WRH                        247.0625
SYS        WRH$_SERVICE_STAT_PK-WRH                   228.0625
SYS        WRH$_PARAMETER_PK-WRH                      201.0625
SYS        WRH$_PARAMETER-WRH                         169.0625
SYS        WRH$_SYSSTAT-WRH                           169.0625
SYS        WRH$_SEG_STAT-WRH                          161.0625
SYS        WRH$_SYSTEM_EVENT_PK-WRH                   156.0625
SYS        WRH$_SYSMETRIC_SUMMARY-                         152
SYS        WRH$_SYSTEM_EVENT-WRH                      133.0625
SYS        WRH$_SERVICE_STAT-WRH                      123.0625
SYS        WRH$_ACTIVE_SESSION_HISTORY-WRH            115.0625

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX               alloc  58,251,904  466,015,232  33,553,408       15
                     free   57,479,400  459,835,200   4,063,232    1,208
SYSTEM               alloc     128,000    1,024,000   1,024,000        1
                     free       68,048      544,384     544,320        2

Now, how do we get that space back? I left that with the DBA team to resolve.

Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”
:-)

{Update, I’ve just spotted this posting by Colbran which is related. Last time I googled this I just got a stub with no information}

I am Neo off the Matrix (apparently) March 30, 2011

Posted by mwidlake in AWR, performance.
Tags: , ,
8 comments

I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.

As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:

This will not be an interesting picture to you, but to me it tells me a lot about my system

“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”

“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.

That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.

The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.

However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.

That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.

Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.

Server Bought for the 1 Grand Challenge December 6, 2010

Posted by mwidlake in Architecture, One Grand Server, performance.
Tags: ,
3 comments

What seems like a couple of months ago I suggested the idea of The Fastest Oracle Server for a Grand. It turns out this was actually over 1/3 of a year ago! {such is the rapid passing of normal time}. Anyway, I’ve decided to give this a go.

The intention is that I am going to build a server based on PC technology which costs less than £1,000 and see how fast I can make it go. Of course “how fast” is a “piece of string” question – it depends on what you put into the Oracle database, how you want to use or manipulate the data and how business-ready the system is. I’m intending to build something that looks very, very un-business ready. That is, not a lot of redundancy. Before anyone wants to shoot me down for that (a) I am not running a bank or anything to do with finance (b) why are banks systems that only deal with cash so much more regulated and goverend than medical systems that are only relied on to keep you alive? (c) some of the biggest systems I know of are actually running on something close to PC kit.

I’m quietly confident I can build something that out-performs systems consisting 100 times as much. Now, that is a massive claim and I won’t be too sad if I fall short, but you can do a lot with modest kit. I worked for a charity for 6 years and boy did I see some clever stuff done on the sort of budget many organisation spend on office stationary.

So, what have I got so far? I confess I held off until I saw some new technology appear in a price band I could squeeze in. Namely USB3 and SATA3. There is always something just around the corner but I wanted those as I want to maximise the impact of solid state storage. So, my base server is:

  • Asus P7P55D-E motherboard supporting DDR3, USB3 and SATA3
  • Intel i5 760 2.8HHz chip
  • 8GB memory
  • 1TB samsung 7200rpm SATAII disk
  • AZCool Infinity 800W PSU
  • Coolmaster Elite RC-335 case

I chose the motherboard as it was getting good reviews and had the SATA3 and USB3 ports. I chose the case as it was large enough to take many hard drives, small enough to lug about and was a nice case. I stuck to 8GB RAM as RAM is expensive at the moment, but as it is in 2GB chunks I might regret that choice as all my slots are full. Many people forget the PSU but it’s like the tyers on your car. Those tyers keep you stuck to the road, a PSU keeps you powered. It might be utilitarian but they are vital and often overlooked. The hard disc is pretty good, but very likely to be swapped out (I don’t mind sticking it in another system). The CPU is a proper quad core CPU. I had plenty of scope to go bigger and better on the CPU but for grunt for cash, it seems presently to be the sweet spot.

The basic unit is not overclocked. I will increase the cooling and overclocking will be an option. It comes with 64 bit windows but linux is almost certainly going to be the faster option. No monitor is included but hey, it’s a database server, you don’t need fancy graphics. That old CRT in the corner will do! The server does have a rather nice nVidia GeForce GTX 460 in it but I am cutting out the cost of that. The server is currently the best gaming machine I have but that will end when I get time to start working on the Oracle side.

Total cost, £615 or so. That is like $615 seeing as we get so ripped off in the UK for IT kit. I can now go spend money on more fast hard discs, SSDs, even fast USB memory sticks. Any suggestions, I am happy to listen.

The biggest question is – When am I going to get time to work on this damn thing?

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,
7 comments

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful! :-)

An Oracle server – How Fast for £1,000 July 27, 2010

Posted by mwidlake in Architecture, performance, Testing.
17 comments

Question? How fast an Oracle server can you create for £1,000 pounds?

{I’d really appreciate feedback and suggestions on this particular post}

The power of domestic PCs continues to grow, with four-core chips become pretty much standard and starting RAM looking more like 4GB than 2GB, with 8GB quite reasonable. So, how quick an Oracle server can you make based on a domestic PC? After all, those of us who play with Oracle in our spare time tned to use such machines and, in fact, they are often not far off what are our smaller servers at the office really are. When I worked at the Wellcome Trust Sanger Institute, we had to make our IT budget pounds go a long way. We were, after all, a charity with a limited budget but also a scientific organisation with a huge demand for data and processing. So we used a lot of cheap kit.

I’m seriously thinking of giving this a go. I need a new PC anyway and so I am willing to use it, at least initially, to see what can be done.

If I do this, I’m going to need to set some boundaries on the exercise. How about:

  • The oracle licence is being ignored in the cost {and please, I don’t need to be told how the licence can be more than the hardware costs!}. OS cost is though.
  • I am not aiming for enterprise-level resilience, so I am not going to consider hot-swappable components, dual redundant power supplies or things like that.
  • I am going to use new kit, so no scavenging or buying second-hand. It must all be easily available and repeatable.
  • I will use local storage in the server or connected to ports available on the server.
  • It will support a database of 1TB in size {yet to be designed}.
  • Oracle v11. Enterprise edition but nothing special like TimesTen or Exadata (unless Oracle are willing to sell me an Exadata box for a grand, then I’ll consider it).
  • I’m not considering backup and recovery performance {and this would be a serious oversight if this was a real system, but most places have central backup/recovery facilities}.

I would also have a few other things to decide.

The main one is “Do I use Linux or Windows?” Yes, you are all probably shouting “Linux!!!” but I have never been a Linux sys admin (I was an incredibly poor HP-UX system admin for 3 months though) so it will take me more time to deal with issues under Linux – in work situations I have always had access to people who know all this stuff to sort out issues but in this case I will be doing this on my own. On the other hand, you can just chuck Oracle on a standard windows box and it works, and as a rule hardware just works under Windows. If I decide to use USB3 ports, for example, is it going to be a major pain getting drivers under Linux? But then if I want the fastest oracle box under a grand why would I slow it down with windows and spend money on the licence? I just want the box to run Oracle and a workload.

The second “software” decision is, how do I measure performance? I think I could be getting to grips with Dom Giles’ excellent Swingbench {BTW, nice tag line on that page, Dom :-)}. But it runs on Java and guess what boys and girls? I’ve never been a Java developer. How limited are my skills! So that would take some of my precious spare time up too.

I’d love feedback on this, I’d love to know what hardware suggestions you would make, what you think about the overall idea, what else I need to consider to make the tests valid… I have a few ideas already for the hardware architecture and the intention would be to try lots of things but I’ll save that for a second post. After all, if I get no feedback I might just spend the money on a gaming machine and a week’s walking in the Lake District instead.

And if anyone want to help with the cost, please send cheques to….

More Memory Meanderings – IOPS and Form Factors July 19, 2010

Posted by mwidlake in Architecture, Management, performance.
Tags: ,
8 comments

I had a few comments when I posted on solid state memory last week and I also had a couple of interesting email discussions with people.

I seriously failed to make much of one of the key advantages of solid-state storage over disk storage, which is the far greater capacity of Input/output operations per second (IOPS), which was picked up by Neil Chandler. Like many people, I have had discussions with the storage guys about why I think the storage is terribly slow and they think it is fast. They look at the total throughput from the storage to the server and tell me it is fine. It is not great ,they say, but it is {let’s say for this example} passing 440MB a second over to the server. That is respectable and I should stop complaining.

The problem is, they are just looking at throughput, which seems to be the main metric they are concerned about after acreage. This is probably not really their fault, it is the way the vendors approach things too. However, my database is just concerned in creating, fetching, and altering records and it does it as input/output operations. Let us say a disk can manage 80 IOPS per second (which allows an average 12.5 ms to both seek to the record and also read the data. Even many modern 7,200 rpm discs struggle to average less than 12ms seek time). We have 130 disks in this example storage array and there is no overhead from any sort of raid or any bottleneck in passing the data back to the server. {This is of course utterly unbelievable, but if i have been a little harsh not stating the discs can manage 8ms seek time, ignoring the raid/hba/network cost covers that}. Each disc is a “small” one of 500GB. They bought cheap disk to give us as many MB/£ as they could {10,000 and 15,0000 rpm disks will manage 120 and 160 IOPS per second but cost more per MB}.

Four sessions on my theoretical database are doing full table scans, 1MB of data per IO {Oracle’s usual max on 10.2}, Each session receiving 100MB of data a second, so 400MB in total. 5 discs {5*80 IOPS*1MB} could supply that level of IOPS. It is a perfect database world and there are no blocks in the cache already for these scans to interrupt the multi-block reads.

However, my system is primarily an OLTP system and the other IO is records being read via index lookups and single block reads or writes.

Each IOP reads the minimum for the database, which is a block. A block is 4k. Oracle can’t read a bit of a block.

Thus the 40MB of other data being transferred from (or to) the storage is single block reads of 4k. 10,000 of them. I will need 10,000/80 disks to support that level of IO. That is 125 discs, running flat out.

So, I am using all my 130 discs and 96% of them are serving 40MB of requests and 4% are serving 400MB of requests. As you can see, as an OLTP database I do not care about acreage or throughput. I want IOPS. I need all those spindles to give me the IOPS I need.

What does the 40MB of requests actually equate to? Let us say our indexes are small and efficient and have a height of 3 (b-level of 2), so root node, one level of branch nodes and then the leaf nodes. To get a row you need to read the root node, branch node, lead node and then the table block. 4 IOs. So those 10,000 IOPS are allowing us to read or write 10,000/4 records a second or 2,500 records.
You can read 2,500 records a second.

Sounds a lot? Well, let us say you are pulling up customer records onto a screen and the main page pulls data from 3 main tables (customer, address, account_summary) and translates 6 fields via lookups. I’ll be kind and say the lookups are tiny and oracle just reads the block or blocks of the table with one IO. So that is 9IOs for the customer screen, so if our 40MB OLTP IO was all for looking up customers then you could show just under 280 customers a second, across all users of your database. If you want to pull up the first screen of the orders summary, each screen record derived from 2 underlying main tables and again half a dozen lookups, but now with 10 records per summary page – that is 80 IOs for the page. Looking at a customer and their order summary you are down to under thirty a second across your whole organisation and doing nothing else.

You get the idea. 2,500 IOPS per second is tiny. Especially as those 130 500GB disks give you 65TB of space to host your database on. Yes, it is potentially a big database.

The only way any of this works is due to the buffer cache. If you have a very healthy buffer cache hit ratio of 99% then you can see that your 2500 records of physical IO coming in and out of the storage sub-system is actually supporting 250,000 logical-and-physical IOPS. {And in reality, many sites not buffer at the application layer too}.

Using Solid State Storage would potentially give you a huge boost in performance for your OLTP system, even if the new technology was used to simply replicate disk storage.

I think you can tell that storage vendors are very aware of this issue as seek time and IOPS is not metrics that tend to jump out of the literature for disk storage. In fact, often it is not mentioned at all. I have just been looking at some modern sales literature and white papers on storage from a couple of vendors and they do not even mention IOPS – but they happily quote acreage and maximum transfer rates. That is, until you get to information on Solid State Discs. NOw, because the vendor can say good things bout the situation then the information is there. On one HP white paper the figures given are:

				Modern super-fast		Top-end 
				SAS disk drive Top-end 	Solid State Disk
Sustained write     	150MB/s			180MB/s
Sustained read			90MB/s			180MB/s
Random write			285				5,000+
Random read				340				20,000+ 

More and more these days, as a DBA you do not need or want to state your storage requirements in terms of acreage or maximum throughput, you will get those for free, so long as you state your IOPS requirements. Just say “I need 5000 IOPS a second” and let the storage expert find the cheapest, smallest disks they can to provide it. You will have TBs of space.

With solid-state storage you would not need to over-specify storage acreage to get the IOPS, and this is why I said last week that you do not need solid state storage to match the capacity of current disks for this storage to take over. We would be back to the old situation where you buy so many cheap, small units to get the volume, IOPS are almost an accidental by-product. With 1GB discs you were always getting a bulk-buy discount :-)

I said that SSD would boost performance even if you used the technology to replicate the current disk storage. By this I mean that you get a chunk of solid-state disk with a SATA or SAS interface in a 3.5 inch format block and plug it in where a physical disk was plugged in, still sending chunks of 4k or 8k over the network to the Block Buffer Cache. But does Oracle want to stick with the current block paradigm for requesting information and holding data in the block buffer cache? After all, why pass over and hold in memory a block of data when all the user wanted was a specific record? It might be better to hold specific records. I suspect that Oracle will stick with the block-based structure for a while yet as it is so established and key to the kernel, but I would not be at all surprised if something is being developed with exadata in mind where data sets/records are buffered and this could be used for data coming from solid state memory. A second cache where, if using exadata or solid-state memory, holding single records. {I might come back to this in a later blog, this one is already getting bloated}.

This leads on to the physical side of solid-state discs. They currently conform to the 3.5” or 2.5” hard disc form factor but there is no need for them to do so. One friend commented that, with USB memory sticks, you could stick a female port on the back of a memory stick and a joint and you could just daisy-chain the USB sticks into each other, as a long snake. And then decorate your desk with them. Your storage could be looped around the ceiling as bunting. Being serious, though, with solid state storage then you could have racks or rows of chips anywhere in the server box. In something like a laptop the storage could be an array 2mm high across the bottom the chasis. For the server room you could have a 1u “server” and inside it a forest of chips mounted vertically, like row after row of teeth, with a simple fan at front and back to cool the teeth (if needed at all). And, as I said last time, with the solid state being so much smaller and no need to keep to the old hard disk format, you could squeeze a hell of a lot of storage into a standard server box.

If you pulled the storage locally into your server, you would be back to the world of localised storage, but then LANs and WANs are so much faster now that if you had 10TB of storage local to your server, you could probably share it with other machines in the network relatively easily and yet have it available to the local server with as many and as fat a set of internal interfaces as you could get your provider to manage.

I’m going to, at long last, wrap up this current instalment on my thoughts with a business one. I am convinced that soon solid-state storage is going to be so far superior a proposition to traditional disks that demand will explode. And so it won’t get cheaper. I’m wondering if manufacturers will hit a point where they can sell as much as they can easily make and so hold the price higher. After all, what was the argument for Compact Discs to cost twice as much to produce as old cassette tapes, even when they had been available for 5 years? What you can get away with charging for it.

Follow

Get every new post delivered to your Inbox.

Join 185 other followers