jump to navigation

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,
trackback

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Comments»

1. Index Organized Tables – the Basics. « Martin Widlake's Yet Another Oracle Blog - November 1, 2011

[…] ..>IOT2 – Examples and proofs ….>IOT3 – Greatly reducing IO with IOTs ……>IOT4 – Boosting Buffer Cache Efficiency ……..>IOT5 – Primary Key issues ……….>IOT6a – Slowing Down Insert […]

2. IOT 2 – First examples and proofs « Martin Widlake's Yet Another Oracle Blog - November 1, 2011

[…] <.. IOT1 – Basics ..>IOT3 – Great reductions in IO for IOTs ….>IOT4 – Boosting Buffer Cache Efficiency ……>IOT5 – Primary Key issues ……….>IOT6a – Slowing Down Insert […]

3. IOT part 3 – Significantly Reducing IO « Martin Widlake's Yet Another Oracle Blog - November 1, 2011

[…] <..IOT1 – the basics <….IOT2 – Examples and proofs ……>IOT4 – Boosting Buffer Cache Efficiency ……..>IOT5 – Primary Key issues ……….>IOT6a – Slowing Down Insert […]

4. IOT Part 4 – Greatly Boosting Buffer Cache Efficiency « Martin Widlake's Yet Another Oracle Blog - November 1, 2011

[…] IOT Part 6 – Inserts and Updates Slowed Down (part A) « Martin Widlake's Yet Anothe… – November 1, […]

5. IOT Part 5 – Primary Key Drawback – and Workaround « Martin Widlake's Yet Another Oracle Blog - November 1, 2011

[…] IOT Part 6 – Inserts and Updates Slowed Down (part A) « Martin Widlake's Yet Anothe… – November 1, […]

6. Randolf Geist - November 1, 2011

Hi Martin,

regarding the bulk load and the block splitting: The space efficiency of an index or IOT for that matter heavily relies on the pattern of data arrival, so for “bulk” loads at least the index efficiency can be influenced by sorting the data before inserting. Of course this sorting adds some overhead that needs to be evaluated.

Another interesting aspect with bulk manipulations is the point that with partitioned IOTs you can also use the “scale to infinity” approach by creating a copy of the to be loaded / manipulated data and exchanging with the target partition rather than applying DML directly to the target table.

Whereas with heap organized tables it doesn’t matter for the space efficiency if the copy is created via bulk inserts into an existing table or Create Table As Select (CTAS), it does make quite a difference for IOTs. Only with a CTAS operation the index structure will be as efficiently packed as possible and also adhere to any PCTFREE specified (if any subsequent direct manipulations are anticipated).

So potentially the most efficient way to bulk load data into an IOT is using partitioning exchange in combination with CTAS to create a copy of the data to be exchanged.

Randolf

mwidlake - November 1, 2011

Hi Randolf,

Thanks for those points, excellent as always. As you say, you would have to load data ordered by the PK and via a bulk operation (CTAS though maybe also via a pre-created SQL array – I should try that) in order to get “90:10” block splits as the index is populated and thus higher row density. However, you can achieve gains at the buffer caching level by ordering the data to be loaded in a manner that means it will be processed into one partition at a time or chunk of the IOT at a time, so the same blocks are being processed. For example, if you had an IOT with a primary key of Surname, Forename, date of birth and something to make it unique, if you loaded data ordered by first letter of Surname alone, then all the A’s would be processed into one contiguous chunk of the IOT and then B’s into the next chunk…. No space efficiency gains but it would only be a subset of the IOT blocks being shuffled around in the buffer cache at a time.

Using partition swap with an IOT is tricky. The more partitions you have, the more you potentially reduce the grouping-in-blocks benefit of the IOT – I will be doing a full post about that. Thus what you gain in loading data could seriously impact select performance. However, if you can combine the two then of course partition swap is a brilliant way to prepare and pull in new data.

There is a way to reduce the space inefficiency of IOTs – I’m sure you know it Randolf, so don’t go spoiling my future post on it 🙂

7. IOT P6(a) Update « Martin Widlake's Yet Another Oracle Blog - November 8, 2011

[…] my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, […]

8. IOTs « Oracle Scratchpad - November 22, 2011

[…] Part 6a: Inserts and upates slowed down […]

9. Construyendo un Arbol B+ (B+Tree) « El bienestar de la mayoría, supera al bienestar de la minoría. O de uno solo. - August 20, 2012
10. newbie01.oracle@gmail.com - June 10, 2014

Hi,
We have an IOT that we’ve done a large number of deletion on, do we have to rebuild it, i.e ALTER TABLE MOVE? This table used to have 100million rows, we’ve run some deletion that has removed about 50million rows. Used to be the count on the table is 5min, now it is 10min. We’ve run DBMS_STATS but that does not help. I can’t find anything any doc on MOS or otherwise with regards to rebuilding IOTs. There is a section on the doc about maintaining IOTs but not the reason or logic behind it. I’ve also found a MOS note about possible CORRUPTION when doing IOT move so a bit hesitant. Although the note is for Oracle9 and we are currently on Oracle11, not overly confident that it will not happen.

Any feedback much appreciated. Or maybe you can run another PART of the IOT series with regards of IOT rebuild?

mwidlake - June 10, 2014

Hi Newbie01

I thought I had done a post on shrinking IOTs but it would appear that I have not – I cover it in the presentations I used to do on IOTs though. So, to answer your question…
Firstly, Gathering stats will not reduce the time it takes to count the rows in your IOT as Oracle will be visiting every block in the segment (or segments if you use partitions) to do the count. You have not reduced the number of blocks oracle believes are used by gathering stats again.
Secondly, it is only worth rebuilding the IOT or shrinking it if you are not going to add back enough rows “soon” to push the table back up to 100M rows or beyond. You do not have to get back the space unless space usage itself or the time to do full scans is an issue.
You can do an ALTER TABLE MOVE or you might want to look into using shrink:

alter table mdw.child_iot shrink space;

or for partitions

alter table mdw.child_iot
modify partition eve_d_2010_04 shrink space;

This will recompact the IOT segment, it can be done online and queries against the table will continue BUT I strongly recommend you do not do this on a live system where the table is active without lots of testing on how it slows down inserts, deletes etc. You cannot use parallel to SHRINK a segment but you can to MOVE it.

Secondary indexes will continue to work after the shrink but will be less efficient (the row guess method, see earlier in this series) will be incorrect, so you may wish to rebuild the indexes as a second step.

Be wary of the *coalesce* option, it is similar to SHRINK but does not alter the high water mark.

I could add more but that would really need a separate blog post.

I hope that helps.

11. Matias - October 15, 2015

Hello, good explanation, it has helped me a lot and I learned a lot, I would like to ask you a question. We have a table partitioned by day where calls are saved, the calls are coming in batches during the day, and validate duplicity. To do this, instead of using the unique of the main table, we create an IOT apart that is also partitioned by day, defining its pk with fields that identify a call. Then we inserted massively and catch the error if pk is violated. Then, the IOT has no other purpose, only they perform insert, delete or select not perform. Does it seem right? Is there any other strategy to validate duplicate records more efficient? From already thank you very much. Matias.

mwidlake - October 15, 2015

Hi Matias

I am unclear why your temporary holding table is organized as an IOT – are you trying to enforce that order on the final target table? If so you could do an ordered insert from a heap table to get the same effect. Also, I am not clear if you use the holding table for purposes other than loading the final target table. If it is being used solely to hold the data temporarily, then I do not think an IOT is your best choice. If you are doing lots of DML on the IOT and it is not mostly via the IOT key you could well see poorer performance than a heap table, but without more information I have no idea what the cause of the slow performance really is. Again, an IOT may not be the correct structure.

To bulk load data from one table to another and allow for primary key violations you have several options:
1) If the unique key constraint is failing due to rows already existing in the target table you could do an INTERSECT between the load table and the target before you do the load (if you can limit the data you compare via partitioning) and not proceed if you find duplicates
2) if the unique key constraint is failing due to duplicate rows in your load table, you can check that before load or add a unique index on the load table.
3) you could do a bulk load via PL/SQL FORALL and SQL%BULKEXCEPTIONS (just search for it on the web, lots and lots of examples).
4) you could check out the hint ignore_row_on_dupkey_index

Your choice of solution will be influenced by how often you get the duplicates. If it often then option 1 or 2 could be best and you may need to look at what is causing the dirty data and fix that first. If it is infrequent then option 3 or 4 may be more suitable.

I hope that helps


Leave a reply to Randolf Geist Cancel reply