IOT 2 – First examples and proofs July 26, 2011
Posted by mwidlake in development, performance.Tags: design, index organized tables, IOT, performance, system development
7 comments
<.. IOT1 – Basics
..>IOT3 – Great reductions in IO for IOTs
….>IOT4 – Boosting Buffer Cache Efficiency
……>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts
In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.
I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.
All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.
Here are the creation statements:
--first create the parent table, keyed by ID.
-- The other columns are not significant, they just represent "information"
create table mdw.parent
(id number(10) not null
,name varchar2(100) not null
,date_1 date not null
,num_1 number(2)
,num_2 number(2)
,constraint pare_pk primary key(id)
using index tablespace index_01
)
tablespace data_01
/
--
--Now put my 9999 parents into the table.
insert into parent
select rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level < 10000
/
--
-- create the table to hold the children as a heap table
create table child_heap
(pare_id number(10) not null
,cre_date date not null
,vc_1 varchar2(100) not null
,date_1 date
,num_1 number(2)
,num_2 number(2)
,constraint chhe_pk primary key(pare_id,cre_date)
using index tablespace index_01
)
tablespace data_01
/
--
-- create the table to hold the children as an IOT table
create table child_iot
(pare_id number(10) not null
,cre_date date not null
,vc_1 varchar2(100) not null
,date_1 date
,num_1 number(2)
,num_2 number(2)
,constraint chio_pk primary key(pare_id,cre_date)
-- using index tablespace index_01 -- CANNOT STATE for IOT. State in table definition
)
ORGANIZATION INDEX -- This is it. This makes the table an IOT
tablespace data_01
/
There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.
The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.
When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.
I then populated the two CHILD tables with data. The method of creating this test data is very important.
I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.
The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.
The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.
declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
v_str:=dbms_random.string('U',60);
insert into CHILD_HEAP
(pare_id,cre_date,vc_1,date_1,num_1,num_2)
select
trunc(dbms_random.value(1,v_num))
,sysdate-(100-i) + (rownum/(60*60*24) )
,substr(v_str,1,51+mod(rownum,10))
,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
,mod(rownum,20)+1
,mod(rownum,99)+1
from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
I then gathered objects stats on the tables.
Let’s check the size of the tables:
select segment_name, segment_type,tablespace_name,blocks from dba_segments where owner=USER and segment_name like 'CHILD%'; SEGMENT_NAME SEGMENT_TYPE TABLESPACE_NAME BLOCKS --------------- --------------- --------------- ---------- CHILD_HEAP TABLE DATA_01 12288 1 row selected.
ONE row? Where is the other table, where is CHILD_IOT? It does not exists.
Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years
.
Let’s look at the size of the primary key indexes:
select segment_name, segment_type,tablespace_name,blocks from dba_segments where owner=USER and segment_name like 'CH%PK' and segment_name not like '%ORD%' SEGMENT_NAME SEGMENT_TYPE TABLESPACE_NAME BLOCKS --------------- --------------- --------------- ---------- CHHE_PK INDEX INDEX_01 4224 CHIO_PK INDEX DATA_01 19456 2 rows selected.
Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…
Here are some other stats from one of my scripts:
mdw11> @tab_sci_own owner for Table: mdw Name for Table: child_heap OWNER TABLE_NAME NUM_ROWS BLOCKS AVG_L GLS ULS LST_ANL PRT SAMP_SIZE -------- -------------- ------------- ----------- ----- --- --- ------------ --- ---------- MDW CHILD_HEAP 1000,000 12,137 83 YES NO 250711 22:01 NO 1000000 INDEX_NAME TYP PRT UNQ BL L_BLKS DIST_KEYS CLUSTF LB_KEY DB_KEY LST_ANL --------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------ CHHE_PK NOR NO UNI 2 4,034 1000,000 995,857 1 1 250711 22:02 INDEX_NAME TABLE_NAME PSN COL_NAME ---------------------------- ---------------- --- ------------------------------------------------ CHHE_PK CHILD_HEAP 1 PARE_ID CHHE_PK CHILD_HEAP 2 CRE_DATE -- -- owner for Table: mdw Name for Table: child_iot OWNER TABLE_NAME NUM_ROWS BLOCKS AVG_L GLS ULS LST_ANL PRT SAMP_SIZE -------- -------------- ------------- ----------- ----- --- --- ------------ --- ---------- MDW CHILD_IOT 1000,000 83 YES NO 250711 22:03 NO 1000000 INDEX_NAME TYP PRT UNQ BL L_BLKS DIST_KEYS CLUSTF LB_KEY DB_KEY LST_ANL --------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------ CHIO_PK IOT NO UNI 2 17,855 910,881 0 1 1 250711 22:03 INDEX_NAME TABLE_NAME PSN COL_NAME ---------------------------- ---------------- --- ------------------------------------------------ CHIO_PK CHILD_IOT 1 PARE_ID CHIO_PK CHILD_IOT 2 CRE_DATE
Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.
The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.
If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.
The depth of the index does not change, being 3 in each case (BL or blevel 2)
So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:
-- First the Heap table
select * from child_HEAP where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 20:13:21','DD-MON-YY HH24:MI:SS')
PARE_ID CRE_DATE VC_1
---------- --------- ------------------------------------------------------
DATE_1 NUM_1 NUM_2
--------- ---------- ----------
1234 24-JUN-11 LUTFHOCIJNYREYICQNORREAJOVBRIHFVLXNIGIVZDMFJCTGYFWC
25-JUN-11 11 16
1 row selected.
Execution Plan
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 83 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| CHILD_HEAP | 1 | 83 | 3 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | CHHE_PK | 1 | | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
--and now the IOT table
select * from child_IOT where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 21:23:41','DD-MON-YY HH24:MI:SS')
PARE_ID CRE_DATE VC_1
---------- --------- -------------------------------------------------------
DATE_1 NUM_1 NUM_2
--------- ---------- ----------
1234 24-JUN-11
CSIGBHSXWNDDTCFRCNWYPRNLEQWPCRYTXQQZHACDEXHOBEYXLNYBHRUHJ
27-JUN-11 7 52
1 row selected.
Execution Plan
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 83 | 2 (0)| 00:00:01 |
|* 1 | INDEX UNIQUE SCAN| CHIO_PK | 1 | 83 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}
To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.
For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.
For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.
Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post
.
For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.
select count(*) from child_heap where pare_id = 2
COUNT(*)
----------
98
Execution Plan
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | INDEX RANGE SCAN| CHHE_PK | 100 | 400 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
--
--
select count(*) from child_iot where pare_id = 2
COUNT(*)
----------
93
Execution Plan
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 4 | | |
|* 2 | INDEX RANGE SCAN| CHIO_PK | 100 | 400 | 4 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.
For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.
Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.
A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:
mdw11> select count(*) from child_heap
COUNT(*)
----------
1000000
Execution Plan
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 989 (1)| 00:00:15 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | INDEX FAST FULL SCAN| CHHE_PK | 1000K| 989 (1)| 00:00:15 |
-------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
2 db block gets
4109 consistent gets
4088 physical reads
mdw11> select count(*) from child_iot
COUNT(*)
----------
1000000
Execution Plan
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4359 (1)| 00:01:05 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | INDEX FAST FULL SCAN| CHIO_PK | 1000K| 4359 (1)| 00:01:05 |
-------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
19298 consistent gets
19246 physical reads
The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.
That is enough for one post.
Index Organized Tables – the Basics. July 18, 2011
Posted by mwidlake in development, internals, performance.Tags: Database Design, index organized tables, IOT, performance, system development
32 comments
..>IOT2 – Examples and proofs
….>IOT3 – Greatly reducing IO with IOTs
……>IOT4 – Boosting Buffer Cache Efficiency
……..>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts
I think Index Organized Tables(IOTs) are a much under-used and yet very useful feature of Oracle. Over the next few postings I’m going to cover some aspect of Index Organised Tables, both good and not-so-good. I am going to cover some benefits of IOTs that I think many people are unaware of. In this first post I am just going to run through the basics of IOTs.
The idea behind an IOT is simple. You hold all the data for the table in the ordered structure of an index. Why would you want to do that? Let us consider a very common requirement, accessing a row in a “large” table via a known, unique key.
Traditionally you have a heap table holding the data you want to access and a standard index to support access to that table. See the first diagram below. The 4-layer triangle represents the index, with a root block, two levels of branch blocks and then the leaf blocks at the “bottom”. The blue rectangle represents the table with the squares being individual rows. Of course, in a large table there would be thousands or millions of “squares”, this is just a simple diagram to show the idea.
When you issue a SQL statement to select the row via the indexed column(s) then oracle will read the root block (1), find the relevent block in the first level of branch blocks (2), then the relevant block in the second level of branch blocks (3) and finally (as far as the index is concerned) the relevant Leaf Block for the unique key. The leaf block holds the indexed column(s) and also the rowid. The rowid is the fastest way to look up a record, it states the file, block and row offset for the row. This allows oracle to go straight to the block and get the row. That is read number (5).
The number of branch blocks {and thus the number of blocks that need to be read to find a row} will vary depending on how much data is indexed, the number and size of the columns in the index, how efficiently the space has been used in the blocks and one or two other factors. In my experience most indexes for tables with thousands or millions of rows have one, two or three levels of branch blocks.
The second diagram shows a representation of the Index Organized Table. The table has in effect disappeared as a distinct object and the information has been moved into the leaf blocks of the index {part of me feels Index Organized Tables should really be called Table Organized Indexes or Table Containing Indexes as that would better indicate what is physically done}:

So with the IOT oracle reads the root block (1), the two branch level blocks (2 and 3) and finally the leaf block (4). The leaf block does not hold the rowid but rather the rest of the columns for the table {this can be changed, a more advanced feature allows you to store some or all the extra columns in an overflow segment}. Thus to access the same data, Oracle has to read only 4 blocks, not 5. Using an IOT saves one block read per unique lookup.
This saving of block reads is probably the main feature that IOTs are known for, but there are others which I will cover in later posts. Two things I will mention now is that, firstly, the use of IOTs is potentially saving disc space. An index is in effect duplication of data held in the table. When you create an index no new information is created but space is used up holding some of the table information in a structure suitable for fast lookup. Secondly, the index and table have to be maintained whenever a change is made to the columns that are indexed. IOTs reduce this maintenance overhead as there is only one thing to maintain.
Now for some drawbacks.
- The IOT has to be indexed on the primary key. There is no option to create an IOT based on other indexes. As such you have to either be accessing the table via the primary key to get the benefit – or you have to be a little cunning.
- The index is going to be larger than it was and very often larger than the original table. This can slow down range scans or full scans of the index and a “full table scan” will now be a full index scan on this large object, so that can also negatively impact performance. However, if a range scan would then have resulted in access to the table to get extra columns, the IOT gives a similar benefit in reducing IO to that for single row lookups.
- I just want to highlight that you now have no rowid for the rows.
- Secondary indexes are supported but will potentially be less efficient due to this lack of rowid.
So, a brief summary is that Index Organised Tables effectively move the table data into the Primary Key index, reduce the number of block lookups needed to select one row, can save some disc space. But you can only organize the table via the Primary Key and it can make full or partial table scans and lookups via other indexes slower.
There are several more important benefits to IOTs {in my opinion} which I will come to over the next week or two.
Why is my SYSAUX Tablespace so Big? Statistics_level=ALL June 2, 2011
Posted by mwidlake in AWR, performance.Tags: AWR, data dictionary, performance
10 comments
One of my most popular postings is about why your SYSTEM tablespace could be rather large. Recently I’ve had issues with a SYSAUX tablespace being considerably larger than I expected, so I thought I would do a sister posting on the reason.
The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.
We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:
select * from
(select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
from dba_segments
where tablespace_name = 'SYSAUX'
ORDER BY BLOCKS desc)
where rownum < 40
OWNER SEGMENT_NAME SIZE_M
------------------ -------------------------------------------------- ------------
SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911 27,648
SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911 26,491
SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3537 23,798
SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3537 22,122
SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296 17,378
SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296 16,818
SYS WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3 136
911
SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911 96
SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3537 72
SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_4296 47
SYS WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_ 45
3537
SYS I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST~ 41
SYS WRH$_SYSMETRIC_SUMMARY~ 40
SYS WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_ 37
As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.
I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.
Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.
As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.
So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.
The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:
This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.

The physical IO requirements and 15-20 second elapsed time made this out most demanding statement on the system
We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…
What was happening to the size of the SYSAUX tablespace?
Enter the tablespace (or leave null)> sys
TS_NAME ORD SUM_BLKS SUM_K MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX alloc 58,187,904 465,503,232 33,553,408 14
free 10,728 85,824 21,504 20
SYSTEM alloc 128,000 1,024,000 1,024,000 1
free 68,360 546,880 546,752 3
4 rows selected.
select * from
(select owner,segment_name||'~'||partition_name segment_name,bytes/(1024*1024) size_m
from dba_segments
where tablespace_name = 'SYSAUX'
ORDER BY BLOCKS desc)
where rownum < 40
OWNER SEGMENT_NAME SIZE_M
-------- ------------------------------------------------------------ ----------
SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6201 30262
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5817 29948
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5435 28597
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4675 28198
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911 27648
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5817 27144
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6585 26965
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6201 26832
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4675 26741
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911 26491
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296 26307
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5435 26248
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296 25430
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6585 25064
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5058 24611
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5058 23161
WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6966 9209
WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6966 8462
WRH$_SYSMETRIC_SUMMARY~ 152
WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3911 136
WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911 96
@sysaux_conts
OWNER OBJ_PART_NAME SIZE_M
------------------------------ ---------------------------------------- ----------
SYS WRH$_LATCH_CHILDREN-WRH 231745.063
SYS WRH$_LATCH_CHILDREN_PK-WRH 215573.063
SYS WRH$_SQLSTAT-WRH 711.0625
SYS WRH$_LATCH_MISSES_SUMMARY_PK-WRH 439.0625
SYS WRH$_ACTIVE_SESSION_HISTORY-WRH 437.0625
SYS WRH$_LATCH_PARENT-WRH 292.0625
SYS WRH$_LATCH-WRH 276.0625
SYS WRH$_LATCH_MISSES_SUMMARY-WRH 273.0625
SYS WRH$_SEG_STAT-WRH 268.0625
SYS WRH$_LATCH_PARENT_PK-WRH 239.0625
SYS WRH$_SYSSTAT_PK-WRH 237.0625
Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.
Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.
I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:
OWNER OBJ_PART_NAME SIZE_M
---------- ---------------------------------------- ----------
SYS WRH$_LATCH_MISSES_SUMMARY_PK-WRH 512.0625
SYS WRH$_LATCH_MISSES_SUMMARY-WRH 350.0625
SYS WRH$_LATCH-WRH 304.0625
SYS WRH$_SQLSTAT-WRH 280.0625
SYS WRH$_LATCH_PK-WRH 259.0625
SYS WRH$_SYSSTAT_PK-WRH 247.0625
SYS WRH$_SERVICE_STAT_PK-WRH 228.0625
SYS WRH$_PARAMETER_PK-WRH 201.0625
SYS WRH$_PARAMETER-WRH 169.0625
SYS WRH$_SYSSTAT-WRH 169.0625
SYS WRH$_SEG_STAT-WRH 161.0625
SYS WRH$_SYSTEM_EVENT_PK-WRH 156.0625
SYS WRH$_SYSMETRIC_SUMMARY- 152
SYS WRH$_SYSTEM_EVENT-WRH 133.0625
SYS WRH$_SERVICE_STAT-WRH 123.0625
SYS WRH$_ACTIVE_SESSION_HISTORY-WRH 115.0625
TS_NAME ORD SUM_BLKS SUM_K MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
SYSAUX alloc 58,251,904 466,015,232 33,553,408 15
free 57,479,400 459,835,200 4,063,232 1,208
SYSTEM alloc 128,000 1,024,000 1,024,000 1
free 68,048 544,384 544,320 2
Now, how do we get that space back? I left that with the DBA team to resolve.
Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”
{Update, I’ve just spotted this posting by Colbran which is related. Last time I googled this I just got a stub with no information}
I am Neo off the Matrix (apparently) March 30, 2011
Posted by mwidlake in AWR, performance.Tags: AWR, perception, performance
8 comments
I know I have mentioned it before, but I am a big fan of the OEM performance screens that are derived from the ASH/AWR information. One of the things I really like about it is the immediate information it gives you, in one glance, that things are “not normal”. Once you notice that things are not normal you can then, within a few seconds, get a feel for what is still probably OK and where you have something that has changed.
As an example of the immediate information, I recently came back to my desk and glanced at my OEM performance screen. It was showing the below:
“data load has just ran” I said to my comrade-in-arms. “which one?” he asked. “The Delta – It ran the quick plan. But it started a bit late, 12:15. Oh, and looks like the transaction view code has swapped back to the full table scan plan and the summary code is not playing up at the moment.”
“you’re turning into Neo you are – can you see a lady in a red dress???” he asked.
That was of course a reference to the “Matrix” films where at times you see the virtual world displayed on a screen as a stream of characters running down the screen – but once you get used to it you can apparently “see” what is going.
The screen shot above is not even actually a very good example of what the performance screens can show you. One of my minor complaints about the performance screens is that it scales to show the greatest of the largest peak or a number of sessions to match the number of CPUs (real or fake) that are available to you. So if you have more CPU available than you need, you can’t see much detail in the graph. And if you have had a nasty peak of activity, again, all detail is squeezed out. In my case, the box is sized to cope in 12 months and the system is new, so activity is scuttling along the bottom of the graph.
However, “poor” though the example is, it told me what was going on across my system at a glance, something about the major tasks we are running, that one problem is currently occurring and that several of the other issues I need to keep an eye out for are not occurring.
That is why I love these screens – I recognise “my” activity patterns from the graph, I now recognise the SQL IDs for my key statements. If I see a pattern in the graph I don’t recognise, I need to check things out immediately. Three or four times over the last 2 weeks I have spotted an issues, started investigating and found out the cause before the Operations desk has even noticed an issue.
Oh, and what is SQL type 189? It is a merge statement. Our implementation of OEM is a little old, it does not correctly interpret that SQL command type. It might be a little old, it is still a lot useful.
If you Really Can’t Solve a “Simple” Problem.. March 11, 2011
Posted by mwidlake in Friday Philosophy, Perceptions, Testing.Tags: behaviour, performance, Testing
12 comments
Sometimes it can be very hard to solve what looks like a simple problem. Here I am going to cover a method that I almost guarantee will help you in such situations.
I recently had a performance issue with an Oracle database that had just gone live. This database is designed to scale to a few billion rows in two key tables, plus some “small” lookup tables of a few dozen to a couple of million rows. Designing a system of this scale with theory only is very dangerous, you need to test at something like the expected volumes. I was lucky, I was on a project where they were willing to put the effort and resource in and we did indeed create a test system with a few billion rows. Data structure and patterns were created to match the expected system, code was tested and we found issues. Root causes were identified, the code was altered and tested, fine work was done. Pleasingly soon the test system worked to SLAs and confidence was high. We had done this all the right way.
We went live. We ramped up the system to a million records. Performance was awful. Eyes swung my way… This was going to be easy, it would be the statistics, the database was 2 days old and I’d warned the client we would need to manage the object statistics. Stats were gathered.
The problem remained. Ohhh dear, that was not expected. Eyes stayed fixed upon me.
I looked at the plan and I quickly spotted what I knew was the problem. The below code is from the test system and line 15 is the key one, it is an index range scan on the primary key, within a nested loop:
9 | NESTED LOOPS | | 1 | 139 | 37 (3)| 00:00:01 | * 10 | HASH JOIN SEMI | | 1 | 50 | 11 (10)| 00:00:01 | * 11 | TABLE ACCESS BY INDEX ROWID | PARTY_ABCDEFGHIJ | 3 | 144 | 4 (0)| 00:00 * 12 | INDEX RANGE SCAN | PA_PK | 3 | | 3 (0)| 00:00:01 | 13 | COLLECTION ITERATOR PICKLER FETCH| | | | | | 14 | PARTITION RANGE ITERATOR | | 77 | 6853 | 26 (0)| 00:00:01 | * 15 | INDEX RANGE SCAN | EVEN_PK | 77 | 6853 | 26 (0)| 00:00:01 |
On the live system we had an index fast full scan (To be clear, the below is from when I had tried a few things already to fix the problem, but that index_fast_full_scan was the thing I was trying to avoid. Oh and, yes, the index has a different name).
| 9 | NESTED LOOPS | | 1 | 125 | 1828 (3)| 00:00:16 | | 10 | NESTED LOOPS | | 1 | 63 | 2 (0)| 00:00:01 | |* 11 | TABLE ACCESS BY INDEX ROWID| PARTY_ABCDEFGHIJ | 1 | 45 | 2 (0)| 00:00:01 | |* 12 | INDEX UNIQUE SCAN | PA_PK | 1 | | 1 (0)| 00:00:01 | |* 13 | INDEX UNIQUE SCAN | AGR_PK | 1 | 18 | 0 (0)| 00:00:01 | | 14 | PARTITION RANGE ITERATOR | | 1 | 62 | 1826 (3)| 00:00:16 | |* 15 | INDEX FAST FULL SCAN | EVE_PK | 1 | 62 | 1826 (3)| 00:00:16 |
Now I knew that Oracle would maybe pick that plan if it could get the data it wanted from the index and it felt that the cost was lower than doing multiple range scans. Many reasons could lead to that and I could fix them. This would not take long.
But I could not force the step I wanted. I could not get a set of hints that would force it. I could not get the stats gathered in a way that forced the nested loop range scan. I managed to alter the plan in many ways, fix the order of tables, the types of joins, but kept getting to the point where the access was via the index fast full scan but not by range scan. I thought I had it cracked when I came across a hint I had not known about before, namely the INDEX_RS_ASC {and INDEX_RS_DESC} hint to state do an ascending range scan. Nope, no joy.
By now, 8 hours had passed trying several things and we had a few other people looking at the problem, including Oracle Corp. Oracle Corp came up with a good idea – if the code on test runs fine, copy the stats over. Not as simple as it should be as the test system was not quite as-live but we did that. You guessed, it did not work.
So what now? I knew it was a simple problem but I could not fix it. So I tried a technique I knew had worked before. I’d long passed the point where I was concerned about my pride – I emailed friends and contacts and begged help.
Now, that is not the method of solving problems I am writing about – but it is a damned fine method and I have used it several times. I highly recommend it but only after you have put a lot of effort into doing your own work, if you are willing to give proper details of what you are trying to do – and, utterly crucially, if you are willing to put yourself out and help those you just asked for help on another day.
So, what is the silver bullet method? Well, it is what the person who mailed me back did and which I try to do myself – but struggle with.
Ask yourself, what are the most basic and fundamental things that could be wrong. What is so obvious you completely missed it? You’ve tried complex, you’ve been at this for ages, you are missing something. Sometimes it is that you are on the wrong system or you are changing code that is not the code being executed {I’ve done that a few times over the last 20 years}.
In this case, it was this:
Here is my primary key:
EVEN_PK EABCDE 1 AGR_EXT_SYS_ID EVEN_PK EABCDE 2 EXT_PRD_HLD_ID_TX EVEN_PK EABCDE 3 AAAMND_DT EVEN_PK EABCDE 4 EVT_EXT_SYS_ID EVEN_PK EABCDE 5 EABCDE_ID
Except, here is what it is on Live
EVE_PK EABCDE 1 EVT_EXT_SYS_ID EVE_PK EABCDE 2 EABCDE EVE_PK EABCDE 3 AGR_EXT_SYS_ID EVE_PK EABCDE 4 EXT_PRD_HLD_ID_TX EVE_PK EABCDE 5 AAAMND_DT
Ignore the difference in name, that was an artifact of the test environment creation, the key thing is the primary key has a different column order. The DBAs had implemented the table wrong {I’m not blaming them, sometimes stuff just happens OK?}.
Now, it did not alter logical functionality as the Primary Key is on the same columns, but as the access to the table is on only the “leading” three columns of the primary key, if the columns are indexed in the wrong order then Oracle cannot access the index via range scans on those values! Unit testing on the obligatory 6 records had worked fine, but any volume of data revealed the issue.
I could not force my access plan as it was not possible – I had missed the screaming obvious.
So, next time you just “know” you should be able to get your database (or code, or whatever) to do something and it won’t do it, go have a cup of tea, think about your last holiday for 5 minutes and then go back to the desk and ask yourself – did I check that the most fundamental and obvious things are correct.
That is what I think is the key to solving what look like simple problems where you just can’t work it out. Try and think even simpler.
Server Bought for the 1 Grand Challenge December 6, 2010
Posted by mwidlake in Architecture, One Grand Server, performance.Tags: performance, system development
3 comments
What seems like a couple of months ago I suggested the idea of The Fastest Oracle Server for a Grand. It turns out this was actually over 1/3 of a year ago! {such is the rapid passing of normal time}. Anyway, I’ve decided to give this a go.
The intention is that I am going to build a server based on PC technology which costs less than £1,000 and see how fast I can make it go. Of course “how fast” is a “piece of string” question – it depends on what you put into the Oracle database, how you want to use or manipulate the data and how business-ready the system is. I’m intending to build something that looks very, very un-business ready. That is, not a lot of redundancy. Before anyone wants to shoot me down for that (a) I am not running a bank or anything to do with finance (b) why are banks systems that only deal with cash so much more regulated and goverend than medical systems that are only relied on to keep you alive? (c) some of the biggest systems I know of are actually running on something close to PC kit.
I’m quietly confident I can build something that out-performs systems consisting 100 times as much. Now, that is a massive claim and I won’t be too sad if I fall short, but you can do a lot with modest kit. I worked for a charity for 6 years and boy did I see some clever stuff done on the sort of budget many organisation spend on office stationary.
So, what have I got so far? I confess I held off until I saw some new technology appear in a price band I could squeeze in. Namely USB3 and SATA3. There is always something just around the corner but I wanted those as I want to maximise the impact of solid state storage. So, my base server is:
- Asus P7P55D-E motherboard supporting DDR3, USB3 and SATA3
- Intel i5 760 2.8HHz chip
- 8GB memory
- 1TB samsung 7200rpm SATAII disk
- AZCool Infinity 800W PSU
- Coolmaster Elite RC-335 case
I chose the motherboard as it was getting good reviews and had the SATA3 and USB3 ports. I chose the case as it was large enough to take many hard drives, small enough to lug about and was a nice case. I stuck to 8GB RAM as RAM is expensive at the moment, but as it is in 2GB chunks I might regret that choice as all my slots are full. Many people forget the PSU but it’s like the tyers on your car. Those tyers keep you stuck to the road, a PSU keeps you powered. It might be utilitarian but they are vital and often overlooked. The hard disc is pretty good, but very likely to be swapped out (I don’t mind sticking it in another system). The CPU is a proper quad core CPU. I had plenty of scope to go bigger and better on the CPU but for grunt for cash, it seems presently to be the sweet spot.
The basic unit is not overclocked. I will increase the cooling and overclocking will be an option. It comes with 64 bit windows but linux is almost certainly going to be the faster option. No monitor is included but hey, it’s a database server, you don’t need fancy graphics. That old CRT in the corner will do! The server does have a rather nice nVidia GeForce GTX 460 in it but I am cutting out the cost of that. The server is currently the best gaming machine I have but that will end when I get time to start working on the Oracle side.
Total cost, £615 or so. That is like $615 seeing as we get so ripped off in the UK for IT kit. I can now go spend money on more fast hard discs, SSDs, even fast USB memory sticks. Any suggestions, I am happy to listen.
The biggest question is – When am I going to get time to work on this damn thing?
How Fast for £1,000 – Architecture August 5, 2010
Posted by mwidlake in Architecture, performance, Testing.Tags: Architecture, performance, system development, Testing
7 comments
My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.
My initial ideas for the hardware configuration, written at the same time as the original post, were:
- A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
- 8GB of memory as the best price point at present, but maybe push to 16GB
- Multiple small, fast internal disks for storage, maybe expand via eSATA
- backup to an external drive (cost not included in the budget).
- USB3 and use of memory sticks for temp and online redo.
- If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
ASM or no ASM?
If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?
As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.
Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}
Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.
I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.
Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful!
How often is v$sys_time_model updated? July 14, 2010
Posted by mwidlake in internals, performance.Tags: data dictionary, performance
6 comments
I think this posting might go down as one of my more pointless contributions to the Oracle knowledge sphere
I was looking at V$SYS_TIME_MODEL and V$SESS_TIME_MODEL and I just happened to run “select * from V$SYS_TIME_MODEL” several times in very quick succession. And I noticed the values for the various counters stayed the same between a couple of the runs.
“Hmmm, interesting” I thought “The values are only flushed down to the view ‘periodically’. I wonder how periodically?”… and thus I wasted a lunch time.
I used the below sql*plus-PL/SQL script to investigate the refreshing of v$sess_time_model. Yes, I know the title says v$sys_time_model but the numbers are smaller and easier to check for the session version of the view and they are virtually the same, I can bet on the results being very similar. This is my script (and it is on 10.2.0.3 on linux):
--test_vstm_upd
-- how often is v$sessstat updated
set trims on
set term off
set serveroutput on size unli
spool test_vstm_upd
begin
for a in 1..1000 loop
for vrec in
(select rpad(rpad(to_char(a),4)||' '||to_char(systimestamp,'ss.ff')||' '|| stat_name||' ',45)||value rec
from v$sess_time_model
-- replace with your current session ID
where sid=1989
and stat_id in (3649082374,2748282437,2821698184,2643905994)
)
loop
dbms_output.put_line(vrec.rec);
end loop;
dbms_output.put_line(chr(9));
end loop;
end;
/
set term on
spool off
As you can see, it simply loops around selecting four of the values from v$sess_time_model, including the loop counter and current timestamp. Timetamp is evaluated afresh for each executed sql statement.
Here is the output for the first three iterations;
1 53.389576 DB time 475860419 1 53.389576 DB CPU 402642660 1 53.389576 sql execute elapsed time 209780319 1 53.389576 PL/SQL execution elapsed time 52290858 2 53.408944 DB time 475860419 2 53.408944 DB CPU 402642660 2 53.408944 sql execute elapsed time 209780319 2 53.408944 PL/SQL execution elapsed time 52290858 3 53.429159 DB time 475860419 3 53.429159 DB CPU 402642660 3 53.429159 sql execute elapsed time 209780319 3 53.429159 PL/SQL execution elapsed time 52290858
As you can see, the timetamp is increasing by 2/100s of a second or so per loop. Which is not as quick as I hoped but it is a test box. Note that the counters for DB Time, CPU time, SQL execute elapsed time and PL/SQL execution elapsed time are constant.
A few iterations later we see the v$sess_time_model counters increment:
7 53.509351 DB time 475860419 7 53.509351 DB CPU 402642660 7 53.509351 sql execute elapsed time 209780319 7 53.509351 PL/SQL execution elapsed time 52291610 --all change! 8 53.531378 DB time 475871716 8 53.531378 DB CPU 402653957 8 53.531378 sql execute elapsed time 209786745 8 53.531378 PL/SQL execution elapsed time 52292793 -- and stable 9 53.555889 DB time 475871716 9 53.555889 DB CPU 402653957 9 53.555889 sql execute elapsed time 209786745 9 53.555889 PL/SQL execution elapsed time 52292793
The counters all increment between iteration 7 and 8 and then stay the same. I can’t tell how long the counters had been the same, I need to wait and see when they change again. How long until they increment again? Well, not very long, in fact just around 0.12 seconds:
14 53.650154 DB time 475871716 14 53.650154 DB CPU 402653957 14 53.650154 sql execute elapsed time 209786745 14 53.650154 PL/SQL execution elapsed time 52293064 -- change 15 53.670358 DB time 475881268 15 53.670358 DB CPU 402663509 15 53.670358 sql execute elapsed time 209792803 15 53.670358 PL/SQL execution elapsed time 52294180 -- still changing 16 53.689011 DB time 475887530 16 53.689011 DB CPU 402669771 16 53.689011 sql execute elapsed time 209794387 16 53.689011 PL/SQL execution elapsed time 52294180 -- and still changing 17 53.710875 DB time 475889549 17 53.710875 DB CPU 402671790 17 53.710875 sql execute elapsed time 209796393 17 53.710875 PL/SQL execution elapsed time 52295342 -- ...still changing... 18 53.728168 DB time 475893032 18 53.728168 DB CPU 402675273 18 53.728168 sql execute elapsed time 209797665 18 53.728168 PL/SQL execution elapsed time 52295342 -- and stable 19 53.744725 DB time 475893032 19 53.744725 DB CPU 402675273 19 53.744725 sql execute elapsed time 209797665 19 53.744725 PL/SQL execution elapsed time 52295342
This time, the increment occurs over several iterations of the loop before becoming stable again. All four values I am pulling out increment over these iterations.
The next increment comes four iterations or 0.1 seconds later and happens swiftly, between two iterations:
22 53.802486 DB time 475893032 22 53.802486 DB CPU 402675273 22 53.802486 sql execute elapsed time 209797665 22 53.802486 PL/SQL execution elapsed time 52295342 -- change 23 53.822231 DB time 475897963 23 53.822231 DB CPU 402680204 23 53.822231 sql execute elapsed time 209800369 23 53.822231 PL/SQL execution elapsed time 52296904 -- stable 24 53.840085 DB time 475905724 24 53.840085 DB CPU 402687965 24 53.840085 sql execute elapsed time 209803330 24 53.840085 PL/SQL execution elapsed time 52296904
So it seem that v$sess_time_model is incremented in steps, not constantly, and does so every 0.10 to 0.13 seconds or so. My work here is done.
Or is it?
No, it is not, as there is now a “massive” gap where the counters do not increment for almost 3/4 of a second, until iteration 127:
126 55.530398 DB time 475905724 126 55.530398 DB CPU 402687965 126 55.530398 sql execute elapsed time 209803775 126 55.530398 PL/SQL execution elapsed time 52297583 -- change 127 55.545085 DB time 475914013 127 55.545085 DB CPU 402696254 127 55.545085 sql execute elapsed time 209809518 127 55.545085 PL/SQL execution elapsed time 52298886 -- still changing 128 55.560141 DB time 475921342 128 55.560141 DB CPU 402703583 128 55.560141 sql execute elapsed time 209812345 128 55.560141 PL/SQL execution elapsed time 52299359 -- still changing 129 55.574806 DB time 475922705 129 55.574806 DB CPU 402704946 129 55.574806 sql execute elapsed time 209812345 129 55.574806 PL/SQL execution elapsed time 52299359 -- stable 130 55.589541 DB time 475922705 130 55.589541 DB CPU 402704946 130 55.589541 sql execute elapsed time 209812345 130 55.589541 PL/SQL execution elapsed time 52299359
Again, the incrementing ran over a small number of iterations of my loop.
I think I have shown that all the values increment together so I will reduce my output to just the one counter and see when it increments and over how many iterations and see if a pattern appears:
25 53.860550 DB time 475905724 -- 53.84 to 55.54 0 1.7 seconds of stability 126 55.530398 DB time 475905724 127 55.545085 DB time 475914013 128 55.560141 DB time 475921342 129 55.574806 DB time 475922705 136 55.682402 DB time 475922705 137 55.697191 DB time 475956738 138 55.712266 DB time 475969859 139 55.727820 DB time 475974350 140 55.743315 DB time 475982356 141 55.758749 DB time 475994069 142 55.773602 DB time 476004596 143 55.788472 DB time 476004596 144 55.803295 DB time 476007541 145 55.818136 DB time 476011172 146 55.832886 DB time 476020336 147 55.847772 DB time 476025376 148 55.865303 DB time 476036347 -- incrementd with a couple of brief pauses over 0.34 seconds 149 55.881480 DB time 476041481 150 55.896735 DB time 476041481 ... 200 56.664783 DB time 476041481 -- 55.88 to 56.67 0.8 seconds of stability 201 56.679455 DB time 476049162 -- increments over two iterations, 0.03 seconds 202 56.694092 DB time 476052385 203 56.708733 DB time 476052385 ... 261 57.566902 DB time 476052385 -- 56.69 to 57.59 0.9 seconds of stability 262 57.581582 DB time 476052842 263 57.596218 DB time 476058537 ... 270 57.700212 DB time 476058537 271 57.715371 DB time 476060552 272 57.730797 DB time 476063551 273 57.745700 DB time 476074383 274 57.760351 DB time 476079741 ... 279 57.835162 DB time 476079741 280 57.849966 DB time 476080090 281 57.864782 DB time 476090799 282 57.879446 DB time 476100404 283 57.894553 DB time 476103222 -- stable again after 0.3 seconds and a couple of mini-pauses 284 57.910592 DB time 476103222 ... 335 58.677438 DB time 476103222 -- 57.91 to 58.69 0.8 seconds of stability 336 58.694704 DB time 476113168 337 58.709995 DB time 476113909 338 58.724782 DB time 476119452 339 58.740756 DB time 476119795 340 58.758659 DB time 476129752 341 58.776040 DB time 476132036 ... 345 58.854895 DB time 476132036 346 58.869516 DB time 476138982 347 58.884100 DB time 476145880 348 58.898772 DB time 476160301 349 58.913401 DB time 476178139 350 58.935391 DB time 476183281 -- stable again after 0.27 seconds 351 58.955195 DB time 476183281 ... 395 59.608368 DB time 476183281 -- 57.93 to 59.60 0.68 seconds of stability 396 59.623062 DB time 476187188 ... 402 59.713566 DB time 476187188 403 59.728220 DB time 476194591 404 59.742900 DB time 476204006 405 59.757544 DB time 476210666 406 59.774934 DB time 476216338 407 59.796595 DB time 476228874 ... 413 59.890172 DB time 476228874 414 59.908436 DB time 476238680 415 59.923166 DB time 476251316 416 59.937805 DB time 476259466 417 59.952540 DB time 476261228 418 59.967215 DB time 476277094 419 59.981914 DB time 476282108 -- stable again after 0.29 seconds 420 00.000358 DB time 476298216 ... 529 01.684500 DB time 476298216 -- 00.00 to 01.69 1.69 seconds of stability 530 01.699165 DB time 476301888 531 01.714307 DB time 476312510
I would say that we can draw a few conclusions from the above
- It is dangerous to look at a little bit of data and draw a firm conclusion, as I nearly did
- The data in v$sess_time_model is only maintained in near-time not real-time
- The counters in v$sess_time_model increment together
- The counters seem to increment in a slightly messy way over part of a second and then are stable for 3/4 of a second to a second or two
I wonder how many of you went “Oh dear” when I said I could derive what is true for v$sys_time_model from v$sess_time_model? Could I? well, here is the modified script for v$sys_time_model:
--test_vstm_upd2
-- how often is v$sysstat updated
set trims on
set term off
set serveroutput on size unli
spool test_vstm_upd2
begin
for a in 1..1000 loop
for vrec in
(select rpad(rpad(to_char(a),4)||' '||to_char(systimestamp,'ss.ff')||' '|| stat_name||' ',45)||value rec
from v$sys_time_model
--where sid=1989
where stat_id in (3649082374)
)
loop
dbms_output.put_line(vrec.rec);
end loop;
-- dbms_output.put_line(chr(9));
end loop;
end;
/
set term on
spool off
And a sample of my output:
1 43.187666 DB time 14429733395433 2 43.188523 DB time 14429733395755 3 43.188642 DB time 14429733395905 4 43.188733 DB time 14429733395905 5 43.188822 DB time 14429733395905 6 43.188909 DB time 14429733395905 -- 7 43.188995 DB time 14429733396491 8 43.189080 DB time 14429733396491 9 43.189164 DB time 14429733396491 10 43.189258 DB time 14429733396491 11 43.189345 DB time 14429733396491 12 43.189430 DB time 14429733396491 13 43.189515 DB time 14429733396491 14 43.189600 DB time 14429733396491 15 43.189687 DB time 14429733396491 16 43.189774 DB time 14429733396491 17 43.189858 DB time 14429733396491 18 43.189942 DB time 14429733396491 19 43.190026 DB time 14429733396491 20 43.190111 DB time 14429733396491 -- 21 43.190200 DB time 14429733397436 22 43.190287 DB time 14429733397436 23 43.190371 DB time 14429733397436 24 43.190454 DB time 14429733397436 25 43.190540 DB time 14429733397436 26 43.190624 DB time 14429733397436 27 43.190708 DB time 14429733397436 -- 28 43.190793 DB time 14429733397764 29 43.190877 DB time 14429733397764 30 43.190961 DB time 14429733397764 31 43.191045 DB time 14429733397764 32 43.191132 DB time 14429733397764 33 43.191221 DB time 14429733397764 34 43.191309 DB time 14429733397764 35 43.191392 DB time 14429733397764 -- 36 43.191475 DB time 14429733402416 37 43.191558 DB time 14429733402416 -- 38 43.191641 DB time 14429733403070 39 43.191725 DB time 14429733403070 40 43.191809 DB time 14429733403070 41 43.191893 DB time 14429733403070 42 43.191976 DB time 14429733403070 43 43.192060 DB time 14429733403070 44 43.192144 DB time 14429733403070 45 43.192230 DB time 14429733403070 46 43.192315 DB time 14429733403070 47 43.192400 DB time 14429733403070 48 43.192484 DB time 14429733403070 49 43.192569 DB time 14429733403070 50 43.192654 DB time 14429733403070 -- 51 43.192737 DB time 14429733407045 52 43.192821 DB time 14429733407045 53 43.192904 DB time 14429733407045 54 43.192985 DB time 14429733407045 55 43.193069 DB time 14429733407045 56 43.193152 DB time 14429733407045 57 43.193237 DB time 14429733407045 58 43.193321 DB time 14429733407045 59 43.193404 DB time 14429733407045 60 43.193488 DB time 14429733407045 61 43.193574 DB time 14429733407045 -- 62 43.193660 DB time 14429733408897 63 43.193743 DB time 14429733408897 64 43.193828 DB time 14429733408897 65 43.193912 DB time 14429733408897 66 43.193994 DB time 14429733408897 67 43.194076 DB time 14429733408897 -- 68 43.194160 DB time 14429733409208 69 43.194283 DB time 14429733409208 70 43.194378 DB time 14429733409208 -- 71 43.194465 DB time 14429733409267 72 43.194551 DB time 14429733409267 73 43.194635 DB time 14429733409267 74 43.194719 DB time 14429733409267 75 43.194801 DB time 14429733409267 76 43.194884 DB time 14429733409267 -- 77 43.194967 DB time 14429733409863 78 43.195052 DB time 14429733409863 -- 79 43.195136 DB time 14429733410499 80 43.195245 DB time 14429733410499 81 43.195329 DB time 14429733410499 82 43.195412 DB time 14429733410499 83 43.195495 DB time 14429733410499 84 43.195577 DB time 14429733410499 85 43.195660 DB time 14429733410499 86 43.195743 DB time 14429733410499 87 43.195825 DB time 14429733410499 88 43.195909 DB time 14429733410499 89 43.195991 DB time 14429733410499 90 43.196074 DB time 14429733410499 91 43.196156 DB time 14429733410499 92 43.196244 DB time 14429733410499 93 43.196326 DB time 14429733410499 94 43.196409 DB time 14429733410499 -- 95 43.196493 DB time 14429733411732 96 43.196577 DB time 14429733411732 97 43.196661 DB time 14429733411732 98 43.196745 DB time 14429733411732 99 43.196826 DB time 14429733411732 -- 100 43.196910 DB time 14429733412107 101 43.196992 DB time 14429733412410 102 43.197076 DB time 14429733412410 103 43.197158 DB time 14429733412410 104 43.197245 DB time 14429733412410 105 43.197327 DB time 14429733412410 106 43.197410 DB time 14429733412410 107 43.197493 DB time 14429733412410 108 43.197575 DB time 14429733412410 109 43.197658 DB time 14429733412410 -- 110 43.197741 DB time 14429733412981 111 43.197824 DB time 14429733412981 112 43.197907 DB time 14429733412981 113 43.197990 DB time 14429733412981 114 43.198072 DB time 14429733413001 115 43.198156 DB time 14429733413001 116 43.198247 DB time 14429733413001 117 43.198330 DB time 14429733413001 -- 118 43.198414 DB time 14429733413300 119 43.198499 DB time 14429733413300 120 43.198581 DB time 14429733413300 121 43.198665 DB time 14429733413300 122 43.198748 DB time 14429733413300 123 43.198830 DB time 14429733413300 124 43.198913 DB time 14429733413300 -- 125 43.198997 DB time 14429733414262 126 43.199081 DB time 14429733414262 127 43.199165 DB time 14429733414262 128 43.199252 DB time 14429733414262 129 43.199336 DB time 14429733414262 130 43.199419 DB time 14429733414262 131 43.199503 DB time 14429733414262 -- 132 43.199586 DB time 14429733414569 133 43.199669 DB time 14429733414569 134 43.199752 DB time 14429733414569 135 43.199834 DB time 14429733414569 136 43.199918 DB time 14429733414569 137 43.200000 DB time 14429733414569 138 43.200083 DB time 14429733414569 139 43.200166 DB time 14429733414569 140 43.200252 DB time 14429733414569 -- 141 43.200334 DB time 14429733415145 142 43.200418 DB time 14429733415145 -- 143 43.200504 DB time 14429733415335 144 43.200588 DB time 14429733415335 145 43.200672 DB time 14429733415335 146 43.200756 DB time 14429733415335 147 43.200838 DB time 14429733415335 148 43.200921 DB time 14429733415335 149 43.201003 DB time 14429733415335 150 43.201086 DB time 14429733415335 151 43.201169 DB time 14429733415335 152 43.201259 DB time 14429733415335
I would say that we can draw a few conclusions from this latest test above!
- It is dangerous to look at one thing and assume something closely related will be the same!
- The data in v$sys_time_model is also being updated in bursts
- The data in v$sys_time_model is actually updated very, very frequently, at around 1/1000 of a second intervals
- It might be that v$sess_time_model is being updated for sessions in some sort of round-robin fashion and v$sys_time_model each time the v$sess version is updated
- You can spend a lot of time looking at really quite obscure and possibly pointless stuff
- The reason I am losing weight is I keep skipping lunch.
Memory Changes Everything July 12, 2010
Posted by mwidlake in Architecture, performance.Tags: Architecture, performance, rant, Storage
9 comments
I’ve got this USB memory stick which I use to carry around my scripts, documents, presentations, Oracle manuals and enough music to keep me going for a few days. It is on an 8GB Gizzmo Junior and it is tiny. By tiny I mean as wide as my little finger, the length of a matchstick and about the same thickness of said matchstick. So small that I did indeed lose the damn thing for 6 months before I realised it had got trapped behind a credit card in my wallet.
It cost me ten British pounds about 15 months ago (less than most 4GB USB sticks seem to cost now, but then it is nothing more than the memory chip and connectors wrapped in plastic) and it highlights how cheap solid-state “storage” is becoming.
Connected to this, I was looking at buying a new PC this week and this machine comes with 10 USB slots, if you include the ones on the supplied monitor and stubs on the motherboard.
10 USB slots, 8GB gizzmo memory sticks… That would be 80GB of cheap and fast storage. Now get a few USB hubs and bulk-buy a few dozen cheap USB2 sticks and you could soon have a solid-state database of a few hundred GB for a thousand pounds. Then of course you can have fun seeing where the pinch-points in the system are (USB2 has a maximum speed per port and going USB3 right now is going to break that 1 grand barrier. But give it a year…).
This really started me thinking about when memory-based storage would take over from spinning disk as the best option for enterprise-level storage and my gut feeling is in about 5 years. I think it will be both technically possible and financially viable in much less than that, say as little as 2 years, but the cost of solid-state storage per MB will still be higher than disk by then but potentially much faster. A few considerations going through my mind were:-
- Disk is getting a lot slower in relation to acreage. By this I mean that, for a single disc drive, capacity is doubling about every 18 months but seek time has hardly reduced in a decade and transfer rate (reading from the physical platters to the units buffer) is again almost stationary, at about 120MB/s for 10,000rpm disk and up towards 180 for those very expensive and noisy 15,000 rpm disks. Being a tad ridiculous to make the point, with modern 3TB disks you could build most Oracle database on one disc. Let’s make it two in a raid 10 configuration for redundancy. My point is, your 3TB database could well be being run right now, for real, across say 5 physical disks with a total sustainable physical throughput of around 500MB a second.
- Solid state storage seems to be halving in price in more like 8-10 months.
- IO subsystems are made faster by using RAID so that several physical discs can contribute to get towards the 300MB or so speed of the interface – but solid state is already that fast.
- IO subsystems are made faster by building big caches into them and pre-fetching data that “might” be requested next. Oh, that is kind of solid state storage already.
- Solid state storage, at least the cheap stuff in your USB stick, has the problem that you can only write to each bit a thousand or so times before it starts to get unreliable. But physical disk has exactly the same issue.
- There are new methods of solid-state memory storage coming along – “New Scientist” had a nice article on it a few months ago, and these versions will be even higher density and more long-term reliable.
- Seek time on solid-state memory is virtually zero, so random IO is going to be particularly fast compared to spinning disk.
Solid state memory needs less power, and thus less cooling, is silent, is potentially denser and is less vulnerable to temperature and humidity fluctuations. I can see it not needing to be kept in a specialist server room with the need for all that air con and ear defenders when you go in the room.
Just somewhere with normal air con and a lock on the door should suffice.
We do not need Solid State storage to match the size of current disks or even be as cheap to take over. As I have already pointed out, it is not acreage you need with physical disks but enough spindles and caches to make it fast enough in relation to the space. Further, we can afford to pay more for solid state if we do not need to keep it in such expensive clean-room like environments.
I can see that in a couple of years for a given computer system, say a mixed-workload order processing system, to support the storage needs we will have maybe a dozen solid-state chunks of storage, perhaps themselves consisting of several small units of memory in some sort of raid for resilience, all able to flood the IO channels into our processing server and the issue will be getting the network and io channels into the server to go fast enough. So don’t, stick all the storage directly into the server. You just got rid of half your SAN considerations.
I’m going to stop there. Partly because I have run out of time and partly because, in checking out what I am writing, I’ve just spotted someone did a better job of this before me. Over to James Morle who did a fantastic post on this very topic back in May. Stupid me for not checking out his blog more often. Jame also mentions that often it is not total throughput you are interested in at all but IOPS. That zero latency of solid-state memory is going to be great for supporting very high IOPS.
Friday Philosophy – The power of cooperation June 27, 2010
Posted by mwidlake in Friday Philosophy, Perceptions, performance.Tags: behaviour, perception, performance
3 comments
Being the person responsible for the performance of an Oracle-based system can be an oddly lonely working experience. It can also be a very gregarious position as you get to meet a lot of people and discuss lots of different aspects of many systems – all systems seem to have performance issues and so people come along to you, hoping you can identify a “work faster” database setting for them.
But you are often the only person who specialises in Oracle performance. You generally need to be in an organisation that is very large or where performance is key to success for there to be justification for dedicating a person to the topic. To have more than one person dedicated to performance your organisations has to have a very strong focus on getting the best performance out of the Oracle and related systems {or really, really atrocious performance issues
}. So usually there is no one else around who is as experienced (or more so) as yourself to discuss such things over with or ask for a second opinion.
Which is why I am very lucky at the moment. I’m working in a team of oracle performance people. There are 2.5 of us (one is a manager with other responsibilities, so he only counts as half). Being able to say “Hey, Dave, What do you think of the wait times on scattered reads?” or “how in heck do I force this nested subquery on a view to use a hash join?” and get some other ideas is very valuable.
What is also interesting is how opinions and preferred techniques on tuning can be different and just as valid. As an example, last week I was working on a poorly performing statement. I was at home and it was the evening, so I was not communicating with the rest of the team. I managed to get the code down from 40 minutes to just under 20 by using a combination of a LEADING and USE_HASH hint. I sent the code back to the user. Only to see that within thirty seconds of each other my colleague Graeme had also sent the user a response, again getting the code down to around 20 minutes. Graeme had pulled a chunk of the code into a subquery factoring “WITH” clause and added cardinality hints. Totally different changes.
So Graeme and I then had a “philosophical” discussion about the different changes {“Mine is best” – “No! Mine is, yours is all bloated and complex”- “Your hint is less future-flexible!!!”}. Only joking, we actually discussed the changes and why we each chose what we did. Graeme commented that is showed that tuning was an art and not a science and I countered that it was a science, as we had both identified where we felt the execution plan could be improved but used different techniques to get there. The thing is, Oracle is so complex and has so many options to influence the optimiser that you have flexibility to chose different tools and techniques.
We had both identified the same main improvement but had each come up with different tweaks for later in the plan.
The end result was that we went with Graeme’s main plan {he is bigger than me} but we pulled in my tweak. That bought the execution time down to around 10 minutes, so about four times faster over all and twice as fast of either of us alone. That is one of the advantages of not working alone.
We also then discussed how we could get this code down to seconds with the use of either Materialized views or changing the process that generated the report to do so incrementally and store the daily results. Until one of us realised we had reached the boundary of compulsive tuning disorder. The report running in 10 minutes was enough improvement to satisfy the business, the report was only going to be run over the next couple of months, so spending a day or two re-working it further was going to be fun – but of no advantage to the business. We told each other to finish for the day. So another advantage of not working alone is that not only do you get more technical input but your help prevent each other losing sight of the overall aim.
It really does help to have two people working on the same area.
{There is a sneaky way of getting beyond being a lone performance specialist. If you are in an organisation long enough you can usually find some other idiot who is silly enough to want to learn more about performance and you can train them up. It does not take long before they know enough to start coming up with things you never thought of. Oracle is, after all, full of many ways to do the same thing and you can’t know it all}.



