jump to navigation

Table High Water Mark and How Empty the Table Is April 30, 2012

Posted by mwidlake in performance, statistics.
Tags: , ,
1 comment so far

On a recent post about reducing the high water mark in a table by shrinking it someone asked if you could tell how much “space” there was in a table that could be recovered by compressing it.

Yes. As pseudo-code:

select number_of_rows*average_row_length. That is how much data there is.
Divided by the block size, to give how many blocks-worth of data you have.
Divided this by a “overhead” fudge factor to allow for wasted space (see later) of about 20%
Compare to the number of blocks used by the table.

Here is some code to do it, showing some extra data and how we calculate the effective percentage of space used (pct_spc_used). It runs against an example table ‘HWM’

select uts.blocks                                     blks_used
      ,uts.empty_blocks                               empty_blks
      ,usse.blocks                                    alloc_blks
      ,greatest(uts.blocks,1)/greatest(usse.blocks,1) pct_hwm
      ,uts.num_rows*uts.avg_row_len                   data_in_bytes
      ,(uts.num_rows*uts.avg_row_len)/8192            data_in_blks
      ,((uts.num_rows*uts.avg_row_len)/8192)*1.25     mod_data_in_blks
      ,(((uts.num_rows*uts.avg_row_len)/8192)*1.25)/usse.blocks pct_spc_used
from user_tab_statistics uts
    ,user_segments       usse
where uts.table_name='HWM'
and   uts.table_name=usse.segment_name
---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630       1221      50000         213         34       1664   .97957
     10650000   1300.04883       1625.06104    .97659918

I am collecting the data from USER_TAB_STATISTICS and USER_SEGMENTS. For this code to work you must have reasonably good stats against the table.

I’d like to run through some of the columns I have selected:
First of all, columns AVG_SPACE and EMPTY_BLKS are not populated by dbms_stats.gather_table_stats.. They are populated by the deprecated ANALYZE command that you should NOT use to gather table stats since V10 came along. These columns are populated as I did an ANALYZE to get the data in there, as well as a dbms_stats.
Next, I collect BLOCKS_ALLOCATED from DBA_SEGMENTS {and for this demo I just ignored the potential for partitioned tables) and I compare this to the BLOCKS_USED to get the High Water Mark, as a percentage of the table. I do this as EMPTY_BLOCKS is set to zero if you have never used ANALYZE and, even if you did, unless you use this deprecated command all the time, the value will not change.
On the second line of output I calculate the DATA_IN_BYTES as a simple num_rows*avg_row_len, convert it into blocks {for simplicity I do not collect the block size, I know it is 8k}. I then apply my “Overhead” fudge factor. A block has a header, using around 100 bytes {I’ve not checked the exact figure for years}, pctfree can be varied but defaults to 10% and as only whole rows fit, then an average of half a row of space is empty in each “full” block. Thus I reduce the space available by 20-25%. In this case, 25% as my rows are large.
Finally, I compare this modified data volume to the used blocks to get the actual space

Below I run through creating some test data, looking at the stats and my calculated High Water Mark and pct_space_used and finally shrink my table to see if my guesstimate is a reasonable guesstimate:

populate table

drop table hwm purge;
prompt populate table
set feed on
create table hwm
(id   number(10)
,num1 number(2)
,vc1  varchar2(100)
,vc2  varchar2(100)
insert into hwm 
select rownum
from dual connect by level < 50001
50000 rows created.

exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'HWM')

--where is the HWM compared to total segment size

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630          0      50000         210          0       1664   .97957
     10500000   1281.73828       1602.17285   .962844262

The high water mark is 1630 blocks out of 1664 in the segment
My calculated PCT_SPC_USED is 96%. That is probably close enough.
{remember, the last used block will be only partly used, accounting for a bit of the difference}

-- I will use ANALYZE to fill the missing columns
analyze table hwm compute statistics;

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630       1150      50000         213         34       1664   .97957
     10650000   1300.04883       1625.06104    .97659918

Now those two columns are populated. 
Not the slightly different AVG_ROW_LEN even though dbms_stats used 100% (as the table is so small)
and ANALYZE was compute 

-- clear 90% of the data randomly

45461 rows deleted.

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630       1150       4539         210         34       1664   .97957
       953190   116.356201       145.445251   .087407002

PCT_HWM is not altered of course but PCT_SPC_USED has dropped dramatically. 
The table is now only 8.7% used, according to my calculations (compared to
90% empty) 

The BLKS_USED does not change. The AVG_SPACE and EMPTY_BLOCKS are the same as I 
used dbms_stats to update the statistics and it DOES NOT ALTER the columns that it does not
populate. Thus you have no idea how recent those columns are if you use a mixture of commands.

-- clear some blocks completely by deleting a range

2181 rows deleted.

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630       1150       2358        210         34       1664   .97957
       495180   60.4467773       75.5584717   .045407735

Now the PCT_SPC_USED is down to 4.5%

-- has EMPTY_BLOCKS changed if I use ANALYZE?
analyze table hwm compute statistics;

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630       7682       2358         213         34       1664   .97957
       502254   61.3103027       76.6378784   .046056417

As you can see, if I use ANALYZE AVG_SPACE alters. But EMPTY_BLOCKS does not, even though I cleared
a chunk of the table. So there are blocks that can be reused but not listed as empty.

I'll just take a quick side-step and show a quick "oddity" about dbms_stats
--deleting the stats (using dbms_stats)

exec dbms_stats.delete_table_stats(ownname=>user,tabname=>'HWM')

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------

dbms_stats.delete_table_statistics clears ALL statistics, even the ones it does not populate

--and now collect them via dbms_stats again

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
      1630          0       2358         210          0       1664   .97957
       495180   60.4467773       75.5584717   .045407735

--now to shrink the table
alter table hwm enable row movement;
alter table hwm shrink space;

---------- ---------- ---------- ----------- ---------- ---------- --------
------------- ------------ ---------------- ------------
        72          0       2358         210          0         80   .90000
       495180   60.4467773       75.5584717   .944480896

So I calculated that there was about 75 blocks of data in that table. having shrunk it, I was
a little bit out.

Having run through those examples we can see that the accuracy of the PCT_SPC_USED is down to the fudge factor employed but is probably close enough at 25%. After all, you are only likely to shrink a table that very clearly would benefit from it.

Shrinking Tables to Aid Full Scans April 18, 2012

Posted by mwidlake in performance, SQL.
Tags: ,

{This blog is about shrinking tables where the High Water Mark is higher than it needs to be. Prompted by one of the comments, I wrote a follow-up post on finding the High Water Mark and tables that consist mostly of empty space, which would be candidates for shrinking.}

This blog is about one of those things I do as almost an autonomous “not thinking about it” performance housekeeping task, one which I have been meaning to mention for ages.

There can be quite a lot to gain by checking out full scans on “small” tables and seeing if it is as efficient as it can be. Sometimes it is, sometimes it is not. Often it is “good enough”. Occasionally it is awful.

Recently I was just casting an eye over the “top 20” SQL on a system for any code generating a lot of consistent gets. I came across the below:

-------------------- ------------- --------- ----------- -------------- ---------- ------------
-------------- -------------- ---------- ---------- --------------
2-03-14/18:00:10 bk9b5u6zyvy59 36,262 36,262 90,634,158 7 36,261
320,102.3 326,920.9 0 0 546

It caught my eye as it struck me as a lot of buffer gets for what I thought was quite a small table and it is executed quite often. In this case, a few thousand times a day. It takes 2500 buffer gets per count(*). All buffered, but the CPU has to chunk through it.

I check, it is a small table:

ABCD01> select count(*) from dev_ODI_XXXX.snp_session


ABCD01> @tab_lst
Enter value for tab_name: snp_session

------------------------------ ------------------------------ -------------

-- and a quick check on those key columns

-------- -------------------- ------------ ---------- --------------- --------------- ---- -----

The table IS small, only 73 rows. It is also interesting that the stats are out by quite a percentage, but the table is still seen as small according to the stats. This indicates the table is quite volatile, but consistently small. Another thing to notice is that the number of distinct values for each column in the WHERE clause is only 4 and thus an index lookup would not be very specific. With such a small number of records in the table and not very specific index I “know” that an index lookup would not be of much benefit {BTW Everything that you “know” in this way – do check it in reality every now and again…}. I don’t look at adding an index, I look at why such a small table needs so many buffer gets.

So, why so many buffer gets?

First of all, I need to be careful. The original source of information I was looking as was V$SQL and that holds cumulative stats for all executions of the same SQL statement. Thus the buffer gets could have been very high a while ago (say, yesterday) and now are small, and the average is 2500 BGs per select but right now it is running fine. So I verify that the code is still doing a lot of work for such a small number of records:

set autotrace on

Elapsed: 00:00:00.99

Execution Plan
Plan hash value: 1837310352
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 15 | 546 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
|* 2 | TABLE ACCESS STORAGE FULL| SNP_SESSION | 1 | 15 | 546 (1)| 00:00:01 |

Predicate Information (identified by operation id):

1 recursive calls
0 db block gets
2013 consistent gets
0 physical reads

That’s still a lot of consistent gets to scan all of 75 rows and do a count (which is the sort aggregate step). A second to process and 2013 consistent gets to look at 75 rows in a skinny table – just trust me it is a skinny table. And the optimiser cost the full scan step at 546 “equivalent single IOs”.
{Oh, if anyone is wondering what the “table access storage full” and predicate step “storage(“SESS_STATUS”=’D’ AND “AGENT_NAME”=’DEV_ODI_AGENT’)” it’s because this system is an exadata box. Just ignore this for now, it is not relevant to this issue. I’ll blog about it *next week*}.

What is the problem?

Well, let’s see how big the table is.

ABCD01> @seg_dets
Enter value for seg_name: snp_session
Enter value for owner: dev_odi%
Any Key>

-------- --------------- --- -------- ---------- --------- ---- ------- -------
DEV_ODI_ SNP_SESSION TAB DEV_ODI_ 16,384 2,048 31 64 1024

The table is 2048 blocks in size, the consistent gets to look at the table is 2013. That’s close.

The problem is when you full scan a table, it scans up to the high water mark of that table, ie the highest block that Oracle knows there has been data in. As Oracle inserts new data into a table and extends it, it keeps a note of the highest block data has been put into, the High Water Mark (HWM). When you delete data from a table, Oracle marks individual blocks to be reused as they empty (drop below a certain percentage full). But Oracle does NOT check to see if many blocks are being fully emptied and it can reduce the high water mark. In fact, there could well be a few records in the first block of the table and then nothing until a few records in the last block that has been used in the table, the High Water Mark. Oracle just scans all blocks and counts relevant records up to the high water mark. This is actually a very long-standing performance tweek, you can allocate GBs of data to a table but Oracle knows the very highest block it has put data into for that segment and only scans up to the High Water Mark. Usually, the HWM is useful.

I happen to know that this table used to have several thousand records in it but is now used far less extensively and holds only a few records. They are constantly deleted and new ones created. I have seen {very regularly, at least once a year} some extreme situations where a table has had many millions of rows in it at one point, setting the HWM to several hundred thousand blocks into the table segment, but the table now only holds a few hundred records. But a Full Table Scan scans all the way up to the HWM, taking several minutes to access a few hundred records. Often that table is being inefficiently cleared with a “delete from table_x” statement and then re-populated.

What is the answer? Well, I could truncate the table and set the HWM to zero, but this table actually holds some real data. I can’t truncate it in this situation. I could MOVE the table, ie get Oracle to recreate the table and move the existing data into it. That would set the HWM of the new segment to block one and then raise it as it inserts the data neatly into the first few blocks. The third option, and the one I took, is to SHRINK the table. This basically tells oracle to shuffle the data down to the start of the table and reset the HWM. You can do this on the fly and I have done it on relatively active systems, but I must give the usual advice to test this yourself and you may have locking issues on a busy system hitting the table you shrink very hard. Also, you do need to enable row movement on the table for Shrink to work, as is demonstrated below:

ABCD01> alter table dev_ODI_XXXX.snp_session shrink space cascade;
alter table dev_ODI_XXXX.snp_session shrink space cascade
ERROR at line 1:
ORA-10636: ROW MOVEMENT is not enabled

ABCD01> alter table dev_ODI_XXXX.snp_session enable row movement;

Table altered.

ABCD01> alter table dev_ODI_XXXX.snp_session shrink space cascade;

Table altered.

Elapsed: 00:00:01.98

So, how does my little select perform now?

set autotrace on

Elapsed: 00:00:00.04

Execution Plan
Plan hash value: 1837310352
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 15 | 546 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
|* 2 | TABLE ACCESS STORAGE FULL| SNP_SESSION | 1 | 15 | 546 (1)| 00:00:01 |

Predicate Information (identified by operation id):

1 recursive calls
0 db block gets
8 consistent gets
0 physical reads

Fabulous. The run time is very short and the number of consistent gets is tiny. You might notice that the Explain Plan still thinks the cost of the full scan will be 546 (ie the equivalent of 546 single block reads) but that is as I have not re-gathered stats. I left the system to do this automatically that night.

I’ll just mention one more thing. Shrinking a table is a bit like the old trick of rebuilding indexes to compact them. People misunderstood what rebuilding indexes was achieving and why the indexes were so much larger than they “needed” to be and would spend CPU and IO rebuilding indexes overnight – only for them to quickly explode back to the original size (creating a lot of overhead as they did). Shrinking oversized tables can have a great benefit but keep in mind that if some process pushes the HWM right up every day by shoving lots of data into the table and then clearing it down later, shrinking the table is just a waste of time. It gives benefit when the table really has shrunk down in size for a while. Or it only occasionally explodes to a massive size. I have in the past written “regular shrink” scripts for tables I know explode once in a while (eg once a week) but are usually small, but you need to know what the pattern is. Don’t just blindly shrink tables.

For me in this situation, that code scanning that table dropped out of my “top 20 most demanding SQL statements” and has stayed out of it for weeks. The code now completed in considerably under 0.1 seconds as opposed to a whole second – but no one has noticed. I’d removed about 5% of the workload of the system – but no one noticed. But it took me 1/2 an hour to investigate and fix and I’d freed up a non-trivial chunk of processing resource. In my Performance Tuning heart I know I did a good job on that day.

Dropped Tables, Hiding Extents and Slow DBA_FREE_SPACE Queries February 16, 2012

Posted by mwidlake in database design, internals, performance.
Tags: , , ,

My last post was on slow dictionary queries caused by having many, many thousands of extents in your database. This post is about a special case of this problem, which I encountered recently. It was on an Exadata box – Exadata is not particularly relevant to the issue, but I’m curious to see if mentioning Exadata will make this post more popular :-)

I was querying the used and free space on a very small database on the client’s X2-2 1/4 rack Exadata machine. The query was really slow, taking about 30 seconds. This is a FAST box, what is going on?

I quickly realised that the problem was specific to one tablespace:

Enter the tablespace (or leave null)&gt; DATA_01

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
DATA_01              alloc     262,144    2,097,152   2,097,152        1
                     free       63,128      505,024     504,384       11
2 rows selected.
Elapsed: 00:00:00.21

&gt; @SPC_SUM
Enter the tablespace (or leave null)&gt; USERS

TS_NAME              ORD      SUM_BLKS        SUM_K  MAX_CHNK_K NUM_CHNK
-------------------- ----- ----------- ------------ ----------- --------
USERS                alloc     748,320    5,986,560   5,372,160        2
                     free      127,904    1,023,232       6,144    3,058
2 rows selected.

Elapsed: 00:00:26.05

We can see it takes 0.21 seconds for tablespace DATA_01, 26.05 seconds for the USERS. Yes, USERS is larger but not 150 times larger. What is going on? Well, as the title of this post suggests, it is down to dropping tables. The below shows my route to that conclusion.

What I could see was that for the USERS tablespace the maximum chunk of free space was relatively small compared to the sum of free space – 6MB out of 1GB, pretty much – and that there was a lot of individual pieces of free space, 3,058. This tablespace was shattered into a lot of bits. So, what sort of extent size management do we have? How big are the tables and indexes in this tablespace {NB I already knew we had no partitions so I did not have to worry about that}.

select tablespace_name,initial_extent,next_extent
from dba_tablespaces where tablespace_name ='USERS'

------------------------------ -------------- ----------- ---------- ---------- ---------
USERS                                   65536             LOCAL           65536 SYSTEM

select tablespace_name,blocks,count(*) from dba_extents
where tablespace_name = 'USERS'
group by tablespace_name,blocks
having count(*) >1
order by blocks desc,tablespace_name

TABLESPACE_NAME                    BLOCKS   COUNT(*)
------------------------------ ---------- ----------
USERS                                2560          2
USERS                                2048          3
USERS                                1536          7
USERS                                1408          5
USERS                                1280          2
USERS                                1248          2
USERS                                1152          2
USERS                                1024        229
USERS                                 896         15
USERS                                 768         21
USERS                                 736          3
USERS                                 720          3
USERS                                 704          2
USERS                                 672          2
USERS                                 640         25
USERS                                 624          2
USERS                                 576          2
USERS                                 512        117
USERS                                 400          2
USERS                                 384         34
USERS                                 360          2
USERS                                 312          2
USERS                                 288          4
USERS                                 256         49
USERS                                 248          2
USERS                                 240          2
USERS                                 192          5
USERS                                 160          4
USERS                                 128       1165
USERS                                   8       1788

30 rows selected.

So we have system controlled extent allocation, this should be fine. I did a quick check of the table contents – 300 or so tables and 200 or so indexes. And, as you can see from the check of extents above, very few larger extents and a lot of small ones. {As a little aside, note the relatively large number of 8-block, 128-block, 256-block and 1024-block extents; These are the sizes that LMTs with system managed extents tend to use unless a similar sized chunk is found to be available, in which case it is used}.

I did some other checks on indexes and segment sizes and it all boiled down to one thing. There were not a lot of things in this tablespace, what was there was small and… The total of all those table and index blocks was way less than the used space in the tablespace.

That is the major indicator of what is going on here. The difference in space used in the tablespace and the total of the visible tables and indexes.

The last link in the chain is the recycle bin.

select owner,ts_name,count(*) from dba_recyclebin group by owner,ts_name

OWNER                          TS_NAME                          COUNT(*)
------------------------------ ------------------------------ ----------
USER1                          USERS                              542356
USER1                                                                  2
WEGWEGWEG                      USERS                                  97
KKKUKUYLLX                     USERS                                 149
USOVFPKEKS                     USERS                                   3
ERHJTRTTTURT                   USERS                                   4

11 rows selected.

That’s 542,356 objects in the recyclebin for one user, in the one tablespace. My problem tablespace. The penny clicked, something I had already noticed and was a little uncomfortable about fell into place.

The client is using Oracle Data Integrator (ODI) to pull data together and put it into the database. The process they are using basically creates a table, uses it to load some data into and then pushes the data into the target tables. Then drops the table. The drop does not do a “DROP TABLE … PURGE;”. This is done many, many times per load cycle, which is intended to run several times a day.

Something you should always keep in mind with the recyclebin is that the tables, indexes and their extents that go into the recycle bin do not get reported in several of the key data dictionary views. I did mention this in a blog post way back but maybe a whole post on it is called for.

So, the tablespace was being shattered by the constant creating and dropping of small tables. It was hidden from easy view due to how the recyclebin is (not) exposed in data dictionary views.

It is not good practice to constantly create and drop lots of tables. As well as the admittedly rather odd impact that this posting is all about, there is a lot of internal work involved for the oracle database in creating and dropping tables. It is an overhead best avoided. The client had good reasons for this approach but now they are going to look at the alternatives.

I’ve not actually proven in this posting that all those dropped tables is the actual cause of the slow performance in querying the free space in that tablespace {though the corroborative evidence is very strong}. Neither have I said how we cleared up the mess. I’ll cover both of those in the next post on this mini-thread.

Slow querying of DBA_EXTENTS, DBA_FREE_SPACE and dropping tables February 14, 2012

Posted by mwidlake in performance.
Tags: ,

Are you finding queries to identify free space, the size of segments or details of extents in your database are very slow? You could have a database with a very, very large number of extents. If you are on version 9 or up, I would suggest you check the extent management settings for any tablespace holding large segments (see later).

{Caveat – I’m writing this posting based on what I remember of some issues I had on two V10.2 databases 2 and 3 years ago respectively, and those systems are no longer available to me. So please treat everything I say here with some caution – I’ve not checked and verified it to the level I normally would. But I wanted to move this information from another posting I am writing on a recent experience I will verify…}

First, what this article is NOT about. It has been known for a very long time that creating tables with a large number of extents can have a negative impact on SQL performance on that table. Except it never really did and it was a myth. Way back prior to Oracle 7.3 you could only create so many extents per segment, depending on block size, but that was another issue. It used to be argued that SQL select against such tables with many extents was slower. I did some tests and it was not – unless you were in the strange situation where your extent size was less than your multi-block read count, and even then the impact was not huge, it was a slowdown of a few percent to maybe 25%.

However, dropping such tables, truncating such tables and queries against the dictionary objects that deal with extents and free space could and still can become very slow. As we have progressed through the Oracle versions from 8 ,9 and 10 this problem has become less common and the impact has become less, mostly due to Locally Managed Tablespaces (LMTs) and Automatic Segment Space Management {though that is more in respect of concurrent DML than select}.

LMTs in particular have generally removed the issue. If you do not use LMTS and have no very,very pressing reason to not {like 3rd party support for applications}, then swap to LMTs. LMTs have been the default for user defined tablespaces since oracle 9 and have two options. Uniform (Fixed) extent sizes or Automatic, where oracle uses 8 block extents for a new segment to start, then 64 block extents, then 1024 block extents as the segment grows {I might be wrong on the exact size details but don’t worry about it, I certainly don’t}. You can check the settings for tablespaces as is demonstrated below. I create two tablespaces, one with uniform extent sizes and then one with automanaged extent sizes, and check the relevant information (this is on

create tablespace mdw_uni_1m
size 100m
extent management local uniform size 1M;

create tablespace mdw_auto
size 100m
extent management local autoallocate;

select tablespace_name,initial_extent,next_extent
from dba_tablespaces where tablespace_name like 'MDW%';

------------------ -------------- ----------- ---------- ---------- ---------
MDW_UNI_1M                1048576     1048576 LOCAL         1048576 UNIFORM
MDW_AUTO                    65536             LOCAL           65536 SYSTEM

As you can see, tablespace MDW_UNI_1M uses uniform extents of 1M and MDW_AUTO has system managed allocation and starts with 64K extents – 8 blocks with my 8k block size.

As a quick demo, I’ll just create and populate two simple tables and see what extent sizes are created for them:

create table mdw_big
tablespace mdw_auto
as select rownum id,lpad('A',1000,'A') pad
from dual connect by level <10000

create table mdw_big_uni
tablespace mdw_uni_1m
as select rownum id,lpad('A',1000,'A') pad
from dual connect by level <10000

select owner,segment_name,blocks,count(*)
from dba_extents
where segment_name like 'MDW%'
group by owner,segment_name,blocks

--------------- --------------- ---------- ----------
ERIC            MDW_BIG_UNI            128         12
ERIC            MDW_BIG                  8         16
ERIC            MDW_BIG                128         11

3 rows selected.

So, how do issues with large numbers of extents still arise with modern oracle systems? Well, the two situations I’ve seen on Oracle 10 had the same cause {and, as a teaser for later this week, I’ve seen a variation of this issue on a nice, shiny Exadata X2-2 Oracle box :-) You can readabout that here }. What alerted me was slow performance querying the data dictionary, in particular my scripts for checking free space, the size of segments and how many extents they consisted of.

If you create a tablespace with fixed extent sizes of eg 64K and then create a table in there that is 100GB in size, you will get an awful lot of extents. Now make it worse and have a partitioned table that ends up being several hundreds of GB in size with all those segments in that tablespace (or similarly defined tablespaces).

Since I hit the problem twice myself, I’ve chatted to others who have had the same issue. The above usually happens because of a mistake. The tablespace(s) in question are set up to hold small tables and then get used for large tables, either by a mistake in stating the exact tablespace to use or having the tablespace default to your default tablespace – which just happens to be a tablespace with fixed but small extent sizes.

The end result is a massive number of small extents in these tablespaces, usually with extents for different objects mixed in. Some dictionary queries slow down and, in particular, anything to do with looking at extents. For one site, I was trying to use my own code to gather statistics on tables that replaced the standard automated job. It’s fairly “smart” code and chooses a sample size based on the size of the segments. Only, the data dictionary was performing so slowly for the check on segment size that it was taking over 5 seconds to get the information – longer than some of the stats gathers.

You can logically understand why dropping or truncating the table is slow. Oracle has to sort out all those extents, remove the information from the data dictionary. This is not helped by the fact that part of the data dictionary is being slowed down due to all those pesky records…

You MAY be able to get some relief from this situation by gathering fixed object statistics. I did so at one site, where the queries against free_space and segment size sped up by around 80%. I have no demonstrated proof of this, it is just what I saw in one situation, so feel free to try it but don’t sue me if it does not help. Also, it took over 3 hours to gather the fixed object stats and you only do this sort of thing, untested, on a production system if you are already in a bad place.

{update – I just tested this on a private 11.2 db that was taking 10.3 seconds to count all extents, all 12,742 of them. Gathering fixed object stats made no difference at all.}

However, the real answer is to laboriously rebuild those segments in tablespaces with correctly specified uniform extent sizes. Which we did, over several weeks, and it made a difference.

If I was doing this task today, if I could get the outage to do it, I would create COPIES of those segments that were in the wrong tablespaces, re-name and re-apply any constraints and move the other other, smaller tables and indexes to a new tablespace – and then drop the tablespaces including contents. Why? As dropping a table with lots and lots of small extents seemed to take a very long time (many minutes per partition and we had thousands of them). Again, my proof is lost in the mists of time, but that’s what I would aim to do.

IOTs by the Oracle Indexing Expert January 10, 2012

Posted by mwidlake in Blogging, performance.
Tags: , ,
add a comment

I’m really pleased to see that Richard Foote has started a series on Index Organized Tables. You can see his introductory post on the topic here. As ever with Richard, he puts in lots of detail and explanation and I’ve been a fan of his blogging style for a long time.

I’ve got a few posts on the topic left to do myself, maybe this competition will spur me to get on and write them!

What I will also be very interested to see is the different way we will cover the same topic. Richard has already put in a block dump and dug into the details a little at a low level about how the data is stored, which I have not done. He has also shown how using an IOT instead of a fully overloaded index (where you create an index covering all the columns of the table, to avoid visiting the table for other columns) results in not only saving the space taken up by the redundant heap table but that the IOT index is smaller than the fully overloaded index. This is due to the lack of a rowid.

I put in occasional pictures and maybe write more about how the example matches real world situations. If you want, you can look back at my own introduction to the topic.

I’m sure this is going to be an excellent series and I’ll be following it myself.

IOT Part 6(B) – OLTP Inserts into an IOT November 10, 2011

Posted by mwidlake in development, performance, Testing.
Tags: , , , , , ,

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
<…………IOT6(A) – Bulk Insert slowed down

IOT 6 part A covered the sort of insert work used to create an IOT or push a large number of rows into it via an “insert into…select from” type format. ie Bulk Load.
In this, part B, I am going to test inserting a small number of rows (10,000) into a table already holding a million rows. Something more like an hourly load or the inserts that would occur over a working day – a more OLTP-like test

I’m going to do the tests against four things. A “non-insert” test to ensure my data creation is not taking a lot of resource {the Null test}, then against my three versions of test table – a normal heap table, my IOT and my Partitioned IOT. I run each test on each segment type three times, the minimum number needed to help identify variation that could invalidate the test.

You will probably be surprised by the results….

All three tables have the same column structure, the heap has indexes on the primary key and a unique constraint, the IOT and partitioned IOT are organized on a modified version of the Primary key and have the second index for the unique key. Below is the creation statement for the partitioned IOT (the IOT is the same without the partition clause). You can get full creation details from IOT part 2:

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(1000) not null
,vc_2      varchar2(1000)
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110901 values less than (to_date('01-09-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20111001 values less than (to_date('01-10-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20111101 values less than (to_date('01-11-2011','DD-MM-YYYY'))
  tablespace data_01
  tablespace USERS

The three tests are:

  • Insert 10,000 records for random accounts, in time order for the same day. This would be like processing data for or during a “day”
  • Generating the 10,000 records in order of account, to better match the IOT structure, again for a single day. For any given account 0,1,2 or 3 records are created, to be a little more “real life”. This would be like loading a pre-processed set of data.
  • All 10,000 records are created for the one account, in datetime order, as an extreme example of inserting into an IOT data that is ordered to perfectly match the order of the IOT.

The code is written ‘badly’, ie a single record is prepared and inserted into the table at a time, within a loop of 10000 iterations. This is to avoid the bulk-processing that may benefit the heap table and also to better match the reality of an OLTP system. I do however only commit after all records are created for any given test, which is maybe a mistake but that’s what I did.

Now, before I go any further, does anyone want to guess by how much inserting into the heap table will be faster than inserting into the IOT (partitioned or otherwise)? Also, whether the partitioned IOT will be faster or slower than the IOT and what, if any, impact there will be by the ordering of data?

To allow you to think about that and not see the results by accident, I’ll stick in one of the insert harnesses here so you can see how I am doing it. Skip over if you just want the results

execute s_snap.my_snap(' finished non-insert test1')
-- Transaction_heap random data test
v_num         number :=10000; -- number of accounts
v_offset      number :=-4;
v_str1        varchar2(60);
v_str2        varchar2(60);
v_tran_type   number(2);
v_tran_id     number(10);
v_acco_type   number(2);
v_acco_id     number(10);
v_cre_date    date;
v_vc_1        varchar2(1000);
v_vc_2        varchar2(1000);
v_date_1      date :=sysdate;
v_num_1       number(2) :=10;
v_num_2       number(2) :=15;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..10000 loop
    v_tran_type :=mod(1,3)+1;
    v_tran_id   := 1000000+i;
    v_acco_type := 5+(trunc(dbms_random.value(1,3))*5); -- 10 or 15
    v_acco_id   :=trunc(dbms_random.value(1,v_num/2)); -- 1 to 5000
    v_cre_date  :=trunc(sysdate-v_offset)+(i/(60*60*24));
    v_vc_1      :=substr(v_str1,1,51+mod(i,10));
    v_vc_2      :=substr(v_str2,1,46+mod(i,15));
    insert into transaction_heap
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
execute s_snap.my_snap(' finished th insert test1')

Here are the results of three runs for all tests:

Elapsed time in seconds:

                    random insert             ordered-by-account       perfect IOT order match
Segment type    Test 1    2    3            Test 1    2    3           Test 1    2     3    
------------    ------ ---- ----            ------ ---- ----            ----- ---- ----
No insert          2.1  0.1  0.1               0.1  0.1  0.1              0.1  0.0  0.1   

                                    AVG                         AVG                        AVG
                                   ----                        ----                       ----
Heap              52.3 20.5 11.4   28.1       33.4 13.8 10.7   19.3       9.1 13.0  9.8   10.6
IOT              137.1 83.7 26.4   82.4      110.8 63.6 39.0   71.2       6.4  6.7  4.0    5.8
Partn'd IOT       33.2  9.5  8.7   17.1       17.1 10.4 14.3   13.9       8.1  5.3  4.5    6.0
                                   ----                        ----                       ----

Allow me to summarise that.

First, the preparation of the data and cycling around the loop, the no-insert tests, are negligible. {The first no_insert test is taking 2 seconds due to the parsing and collecting dictionary information overhead of the first thing done}. We can ignore the overhead of the test harness.

Much more significantly:

  • Each iteration of each test, things get faster {with one odd exception}
  • For Random and semi-ordered inserts, inserting into the IOT is slower than the Heap by a factor of 300% to 400% (3 times to 4 times slower) – what we all probably expected.
  • Inserting into the IOT is Faster than the heap when the insert order matches the structure of the IOT in the perfect_IOT test. Not just the same but 40% faster
  • The partitioned IOT is as Fast or Faster than the the Heap table, never mind the IOT. Over all the partitioned IOT takes 50% TO 75% THE TIME OF THE HEAP to be populated.

Hands up everyone expecting this? NB My hand would be down if I had not done similar tests before.

Why do we see these results?

I’ll just mention that the reason all the tests get quicker with each iteration is that the computer is doing the same thing over and over again and gets used to it, the relevant circuits in the chip are “oiling in”. Of course not, how silly – but it seems to be how non-IT people think of it :-) Basically more and more of each segment is getting cached and that reduces physical IO and speed things up. You see this a lot in testing and you need to be constantly vigilant about it. You might think I should really run the tests until they are steady-state, but that is a very false situation. On real system you don’t tend to get the same thing run over and over and over again on a system with nothing else running.

Much more interestingly, why is the partitioned IOT winning? Well, for the Heap table data is being poured into the leading edge of the table, which takes a little processing and some blocks to be created. However, there are also two indexes being maintained, the Primary Key and that Unique key. For each insert oracle has to find the right spot in each index (root block ->branch block->leaf block->check for row) and insert the new record, occasionally splitting a full leaf block.

For the IOT and the Partitioned IOT there is no table to maintain. In effect there are just two indexes to maintain. For the normal IOT this benefit is outweighed by the cost of working with a less space efficient Primary Key index. As there is the whole of the row in each index entry there are fewer index entries possible per leaf block and so more block need to be read in from disc and more splits are needed. Thus the inserts into the IOT are slower.
But for the partitioned IOT not only is there no table to maintain but the table (really the Primary Key) is partitioned – on date. We are creating data for a single day at a time. All the relevant blocks will be in the one partition (and it’s the latest and smallest partition) and quickly all cached in memory and processed in memory. It’s the efficient use of the cache again. For both the IOT and the Heap, the primary key segment being processed is larger and the inserts are spread across the whole index.

Ordering data to match the index order (ordered_by_account tests) helps for all three segment types – heap, IOT and partitioned IOT. We have two indexes and the data is being ordered for both in effect, as one is on the columns we are ordering the data creation for and the other, the unique key, is generated by an ascending sequence. The gain is about 25, 30% and is due to working through the indexes in order as opposed to skipping about. If the indexes were significantly larger than the buffer cache or on a highly active system where the turnover in the cache was very high (indeicated by a low buffer cache hit ratio :-) ) then the benefit would be higher.
The perfect ordering test (perfect_iot_order_match) is a special case. Not only is the data being processed in order of the indexes but only one part of the index is being updated – for a single account. In this situation the maintaining of the index is so efficient that the time saved not processing the heap table means both the IOT and IOT_P win out over the heap. So in this special case, inserts into IOTs are faster than inserts into a heap table.

And finally a quick summary:

  • Inserting records into an IOT carries about a 200-400% overhead, depending on the number of indexes
  • Ordering the data to match the IOT gives a 25-30% advantage, but it similarly aids update of the primary key on the heap
  • Partitioning the IOT such that a smaller chunk of the index is processed can be highly beneficial. I don’t cover it specifically in the above but you can probably appreciate that partitioning on a heap can be beneficial to processing indexes on the heap also.
  • If you can organise to create your IOT such that data is created only in one or a few distinct points in the structure (you might have to tweak you data load too of course), insert performance might match or even beat that for a heap table with the same indexes
  • Using methods to maximise efficient use of your buffer cache and reduce IO is a very powerful performance tuning aid, not just for select

BTW I am conscious of the fact that the above may surprise many and you might wonder if my data or tests are flawed. I have stripped out most of the code from this post as it was just becoming massive and confusing, but contact me and I will happily supply all the example scripts and data.

IOT P6(a) Update November 8, 2011

Posted by mwidlake in Architecture, development, performance, Testing.
Tags: , , , ,

In my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, much slower than on a larger system I had used for professional testing. A major contributing factor was that the insert was working on the whole of the IOT as data was created. The block buffer cache was not large enough to hold the whole working set (in this case the whole IOT) once it grew beyond a certain size. Once it no longer fitted in memory, Oracle had to push blocks out of the cache and then read them back in next time they were needed, resulting in escalating physical IO.

I’ve just done another test which backs up this claim. I altered my test database so that the block buffer cache was larger, 232MB compared to 100MB in my first tests. The full IOT is around 200MB

Bottom line, the creation of the IOT was greatly sped up (almost by a factor of 4) and the physical IO dropped significantly, by a factor of 20. As a result, the creation of the IOT was almost as fast as the partitioned IOT. It also shows that the true overhead on insert of using an IOT is more like a factor of 2 to 4 as opposed 6 to 8.

You can see some of the details below. Just to help you understand them, it is worth noting that I had added one new, larger column to the test tables (to help future tests) so the final segments were a little larger (the IOT now being 210MB as opposed to 180MB in the first tests) and there was a little more block splitting.

                        Time in Seconds
Object type           Run with       Run with
                     100MB cache    232MB cache
------------------  ------------    -----------   
Normal Heap table          171.9          119.4   
IOT table                1,483.8          451.4     
Partitioned IOT            341.1          422.6 

-- First reading 100MB cache
-- second reading 232MB cache 
STAT_NAME                            Heap    	IOT	      IOT P
-------------------------------- ---------- -----------  ----------
CPU used by this session            5,716         7,222       6,241
                                    5,498         5,967       6,207

DB time                            17,311       148,866      34,120
                                   11,991        45,459      42,320

branch node splits                     25            76          65
                                       25            82         107

leaf node 90-10 splits                752         1,463       1,466
                                      774         1,465       1,465

leaf node splits                    8,127        24,870      28,841
                                    8,162        30,175      40,678

session logical reads           6,065,365     6,422,071   6,430,281
                                6,150,371     6,544,295   6,709.679

physical read IO requests             123        81,458       3,068
                                      36          4,012       1,959

physical read bytes             2,097,152   668,491,776  25,133,056
                                1,400,832    34,037,760  16,048,128

user I/O wait time                    454       139,585      22,253
                                       39        34,510      19,293

The heap table creation was faster with more memory available. I’m not really sure why, the cpu effort was about the same as before and though there was some reduction in physical IO with the larger cache, I suspect it might be more to do with both the DB and the machine having been recently restarted.

All three tests are doing a little more “work” in the second run due to that extra column and thus slightly fewer rows fitting in each block (more branch node and leaf node splits), but this just highlights even more how much the IOT performance has improved, which correlates with a massive drop in physical IO for the IOT creation. If you check the session logical reads they are increased by a very small, consistent amount. Physical read IO requests have dropped significantly and, in the case of the IOT, plummeted.

I believe the 90:10 leaf node splits are consistent as that will be the maintaining of the secondary index on ACCO_TYPE and ACCO_ID, which are populated in order as the data is created (derived from rownum).

What this second test really shows is that the efficiency with which you are able to make use of the database cache is incredibly significant. Efficiently accessing data via good indexes or tricks like IOTs and hash tables is important but it really helps to also try and consider how data is going to be recycled within the cache or used, pushed out and then reused. A general principle for batch-type work seems to me to be that if you can process it in chunks that can sit in memory, rather than the whole working set, there are benefits to be gained. Of course, partitioning can really help with this.

{If anyone is wondering why, for the heap table, the number of physical IO requests has dropped by 70% but the actual number of bytes has dropped by only 30%, I’m going to point the finger to some multi-block read scan going on, either in recursive code or, more likely, my code that actually gathers those stats! That would also help explain the drop in user IO wait time for the heap run.}

Just for completeness, here is a quick check of my SGA components for the latest tests, just to show I am using the cache size I claim. All of this is on Oracle 11.1 enterprise edition, on a tired old Windows laptop. {NB new laptop arrived today – you have no idea how hard it has been to keep doing this blog and not play with the new toy!!!}. If anyone wants the test scripts in full, send me a quick email and I’ll provide them.:

-- sga_info.sql
-- Martin Widlake /08
-- summary
set pages 32
set pause on
col bytes form 999,999,999,999,999 head byts___g___m___k___b
spool sga_info.lst
select * 
from v$sgainfo
order by name
spool off
clear col
NAME                             byts___g___m___k___b RES
-------------------------------- -------------------- ---
Buffer Cache Size                         243,269,632 Yes
Fixed SGA Size                              1,374,892 No
Free SGA Memory Available                           0
Granule Size                                4,194,304 No
Java Pool Size                              4,194,304 Yes
Large Pool Size                             4,194,304 Yes
Maximum SGA Size                          401,743,872 No
Redo Buffers                                6,103,040 No
Shared IO Pool Size                                 0 Yes
Shared Pool Size                          142,606,336 Yes
Startup overhead in Shared Pool            50,331,648 No
Streams Pool Size                                   0 Yes

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Want to Know More about Oracle’s Core? October 19, 2011

Posted by mwidlake in performance, Private Life, publications.
Tags: , , ,

I had a real treat this summer during my “time off” in that I got to review Jonathan Lewis’s up-coming new book. I think it’s going to be a great book. If you want to know how Oracle actually holds it’s data in memory, how it finds records already in the cache and how it manages to control everything so that all that committing and read consistency really works, it will be the book for you.

{Update, Jonathan has confirmed that, unexpected hiccups aside, Oracle Core: Essential Internals for DBAs and Developers should be available from October 24, 2011}

{Thanks to Mike Cox, who let me know it is already available to be reserved at Amazon}

Jonathan got in touch with me around mid-May to say he was working on the draft of his new book, one that would cover “how does Oracle work”, the core mechanics. Would I be willing to be one of his reviewers? Before anyone comments that there is not likely to be much about core Oracle that I know and Jonathan does not, he did point out that he had already lined up someone to be his technical reviewer, ie someone he expected to know as much as he and help spot actual errors. The technical reviewer is the most excellent Tanel Poder, who posted a little mention of it a couple of months back.

I was to act more like a typical reader – someone who knew the basics and wanted to learn more. I would be more likely to spot things he had assumed we all know but don’t, or bits that did not clearly explain the point if you did not already know the answer. ie an incomplete geek. I figured I could manage that :-).

It was a lot harder work than I expected and I have to confess I struggled to supply back feedback as quickly as Jonathan wanted it – I was not working but I was very busy {and he maybe did not poke me with a sharp stick for feedback soon enough}. As anybody who has had to review code specifications or design documents will probably appreciate, you don’t just read stuff when you review it, you try and consider if all the information is there, can it be misunderstood and, if you find that you don’t understand a section, you need to work out if the fault is with you, with the way it is written or with what is written. When I read a technical {or scientific} document and I do not fully understand it, I usually leave it a day, re-read it and if it still seems opaque, I just move on. In this case I could not do that, I had to ensure I understood it or else tell Jonathan why I thought I did not understand it. If there are sections in the end book that people find confusing, I’ll feel I let Jonathan down.

Just as tricky, on the one hand, as I’ve been using Oracle for so long and I do know quite a lot about Oracle {although clearly not enough in the eyes of the author :-) } I had to try and “not know” stuff to be able to decide if something was missing. On the other, when I wanted to know more about something was I just being a bit too nerdy? I swung more towards the opinion that if I wanted to know more, others would too.

I have to say that I really enjoyed the experience and I learnt a lot. I think it might change how I read technical books a little. I would run through each chapter once to get the feel of it all and then re-read it properly, constantly checking things in both version 11 and 10 of Oracle as I read the drafts and would not let myself skip over anything until I felt I really understood it. As an example, I’ve never dug into internal locks, latches and mutexes much before and now that I’ve had to learn more to review the book, I have a much better appreciation of some issues I’ve seen in the wild.

Keep an eye out for the book, it should be available by the end of this year and be called something like “Oracle Core” {I’ll check with Jonathan and update this}. I won’t say it will be an easy read – though hopefully a little easier as a result of my input – as understanding things always takes some skull work. But it will certainly be a rewarding read and packed full of information and knowledge.

Friday Philosophy – Human Tuning Issues September 23, 2011

Posted by mwidlake in Friday Philosophy, humour, Perceptions, performance.
Tags: , ,

Oracle Tuning is all about technical stuff. It’s perhaps the most detail-focused and technical aspect of Oracle Administration there is. Explain Plans, Statistics, the CBO, database design, Physical implementation, the impact of initialisation variables, subquery factoring, sql profiles, pipeline functions,… To really get to grips with things you need to do some work with 10046 and 10053 traces, block dumps, looking at latching and queueing…

But I realised a good few years ago that there is another, very important aspect and one that is very often overlooked. People and their perception. The longer I am on an individual site, the more significant the People side of my role is likely to become.

Here is a little story for you. You’ll probably recognise it, it’s one that has been told (in many guises) before, by several people – it’s almost an IT Urban Myth.

When I was but a youth, not long out of college, I got a job with Oracle UK (who had a nice, blue logo back then) as a developer on a complex and large hospital system. We used Pyramid hardware if I remember correctly. When the servers were put in place, only half the memory boards and half the CPU boards were initiated. We went live with the system like that. Six months later, the users had seen the system was running quite a bit slower than before and started complaining. An engineer came in and initiated those other CPU boards and Memory boards. Things went faster and all the users were happy. OK, they did not throw a party but they stopped complaining. Some even smiled.

I told you that you would recognise the story. Of course, I’m now going to go on about the dishonest vendor and what was paid for this outrageous “tuning work”. But I’m not. This hobbling of the new system was done on purpose and it was done at the request of “us”, the application developers. Not the hardware supplier. It was done because some smart chap knew that as more people used the system and more parts of it were rolled out, things would slow down and people would complain. So some hardware was held in reserve so that the whole system could have a performance boost once workload had ramped up and people would be happy. Of course, the system was now only as fast as if it had been using all the hardware from day one – but the key difference was that rather than having unhappy users as things “were slower than 6 months ago”, everything was performing faster than it had done just a week or two ago, and users were happy due to the recent improvement in response time. Same end point from a performance perspective, much happy end point for the users.

Another aspect of this Human side of Tuning is unstable performance. People get really unhappy about varying response times. You get this sometimes with Parallel Query when you allow Oracle to reduce the number of parallel threads used depending on the workload on the server {there are other causes of the phenomena such as clashes with when stats are gathered or just random variation in data volumes}. So sometimes a report comes back in 30 minutes, sometimes it comes back in 2 hours. If you go from many parallel threads to single threaded execution it might be 4 hours. That really upsets people. In this situation you probably need to look at if you can fix the degree of parallelism that gives a response time that is good enough for business reasons and can always be achieved. OK, you might be able to get that report out quicker 2 days out of 5, but you won’t have a user who is happy on 3 days and ecstatic with joy on the 2 days the report is early. You will have a user who is really annoyed 3 days and grumbling about “what about yesterday!” on the other 2 days.

Of course this applies to screens as well. If humans are going to be using what I am tuning and would be aware of changes in performance (ie the total run time is above about 0.2 seconds) I try to aim for stable and good performance, not “outright fastest but might vary” performance. Because we are all basically grumpy creatures. We accept what we think cannot be changed but if we see something could be better, we want it!

People are happiest with consistency. So long as performance is good enough to satisfy the business requirements, generally speaking you just want to strive to maintain that level of performance. {There is one strong counter-argument in that ALL work on the system takes resource, so reducing a very common query or update by 75% frees up general resource to aid the whole system}.

One other aspect of Human Tuning I’ll mention is one that UI developers tend to be very attuned to. Users want to see something happening. Like a little icon or a message saying “processing” followed soon by another saying “verifying” or something like that. It does not matter what the messages are {though spinning hour glasses are no longer acceptable}, they just like to see that stuff is happening. So, if a screen can’t be made to come back in less than a small number of seconds, stick up a message or two as it progresses. Better still, give them some information up front whilst the system scrapes the rest together. It won’t be faster, it might even be slower over all, but if the users are happier, that is fine. Of course, Oracle CBO implements this sort of idea when you specify “first_n_rows” as the optimizer goal as opposed to “all_rows”. You want to get some data onto an interactive screen as soon as possible, for the users to look at, rather than aim for the fastest overall response time.

After all, the defining criteria of IT system success is that the users “are happy” -ie accept the system.

This has an interesting impact on my technical work as a tuning “expert”. I might not tune up a troublesome report or SQL statement as much as I possibly can. I had a recent example of this where I had to make some batch work run faster. I identified 3 or 4 things I could try and using 2 of them I got it to comfortably run in the window it had to run in {I’m being slightly inaccurate, it was now not the slowest step and upper management focused elsewhere}. There was a third step I was pretty sure would also help. It would have taken a little more testing and implementing and it was not needed right now. I documented it and let the client know about it, that there was more that could be got. But hold it in reserve because you have other things to do and, heck, it’s fast enough. {I should make it clear that the system as a whole was not stressed at all, so we did not need to reduce system load to aid all other things running}. In six months the step in the batch might not be fast enough or, more significantly, might once more be the slowest step and the target for a random management demand for improvement – in which case take the time to test and implement item 3. (For those curious people, it was to replace a single merge statement with an insert and an update, both of which could use different indexes).

I said it earlier. Often you do not want absolute performance. You want good-enough, stable performance. That makes people happy.


Get every new post delivered to your Inbox.

Join 206 other followers