jump to navigation

Dropped Partitions do not go in the Recycle Bin January 24, 2012

Posted by mwidlake in SQL.
Tags: , ,
3 comments

If you alter table TEST drop partition Q1, does it go in the recycle bin?

That is the question I was asked today. “Of course it….” Pause. More pause.

No, I did not know. I knew I’d seen partitions in the recyclebin on Oracle but I thought there was no option to state PURGE when you dropped a partition.

So, a quick test was needed.

First I tried a drop of a partition I knew I could live without {NB this is on version 11.2.0.3, I am 99.8% sure this is the same on 10}

mdw1123> alter table person_call drop partition d_20111205 purge
  2  /
alter table person_call drop partition d_20111205 purge
                                                  *
ERROR at line 1:
ORA-14048: a partition maintenance operation may not be combined with other operations


mdw1123> alter table person_call drop partition d_20111205
  2  /

Table altered.

mdw1123> select count(*) from dba_recyclebin
  2  /
Any Key>

  COUNT(*)
----------
         0

1 row selected.

That’s pretty conclusive then, Oracle does not like the PURGE option when dropping a partitions, which suggests it does not go into the recyclebin, and in fact my recyclebin remains empty when I correctly drop the partition. {NB I had purged the recyclebin before this test}.

However, I want to make sure that it is indeed just the partition operation that does not go to the recyclebin (you can of course turn off the recyclebin and I have not proven that I have not done this). So:


mdw1123> create table mdw (id number,vc1 varchar2(10))
  2  partition by range (id)
  3  (partition p1 values less than (10)
  4  ,partition p2 values less than (20)
  5  ,partition p3 values less than (30)
  6  ,partition pm values less than (maxvalue)
  7  )
  8
mdw1123> /
Table created.

mdw1123> insert into mdw
  2  select rownum,'AAAAAAAA'
  3  from dual
  4  connect by level <40
  5  /
39 rows created.

mdw1123> select * from dba_recyclebin
  2  /
no rows selected

mdw1123> alter table mdw drop partition p3
  2  /
Table altered.

mdw1123> select * from dba_recyclebin
  2  /
no rows selected

mdw1123> drop table mdw
  2  /
Table dropped.

mdw1123> select * from dba_recyclebin;
Any Key>
OWNER                          OBJECT_NAME                    ORIGINAL_NAME
------------------------------ ------------------------------ --------------------------------
OPERATION TYPE                      TS_NAME                        CREATETIME
--------- ------------------------- ------------------------------ -------------------
DROPTIME               DROPSCN PARTITION_NAME                   CAN CAN    RELATED BASE_OBJECT
------------------- ---------- -------------------------------- --- --- ---------- -----------
PURGE_OBJECT      SPACE
------------ ----------
MDW                            BIN$DI/3ZwiMRneOxLrJ4jidyg==$0 MDW
DROP      Table Partition           INDEX_01                       2012-01-24:16:13:55
2012-01-24:16:15:33    2787392                                  NO  NO       77672       77672
       77672       1024
MDW                            BIN$DI/3ZwiMRneOxLrJ4jidyg==$0 MDW
DROP      Table Partition           INDEX_01                       2012-01-24:16:13:55
2012-01-24:16:15:33    2787392                                  NO  NO       77672       77672
       77672       1024
MDW                            BIN$DI/3ZwiMRneOxLrJ4jidyg==$0 MDW
DROP      Table Partition           INDEX_01                       2012-01-24:16:13:55
2012-01-24:16:15:33    2787392                                  NO  NO       77672       77672
       77672       1024
MDW                            BIN$DI/3ZwiMRneOxLrJ4jidyg==$0 MDW
DROP      TABLE                                                    2012-01-24:16:13:55
2012-01-24:16:15:33    2787393                                  YES YES      77672       77672
       77672

4 rows selected.

So, if you go and drop the wrong partition, you can’t go getting it back without some sort of restore or recovery (as there is nothing to get it back from) but you can if you drop the whole table. And I was not going mad, I had seen partitions in the Recyclebin (there is, after all, a PARTITION_NAME column in the table)

This seems a little odd to me, I would imagine that dropping the wrong partition is something you would want to use flashback query to fix (as you can with accidentally dropping a table) but there you go. You can’t.

Back to the day job….

IOT Part 6(B) – OLTP Inserts into an IOT November 10, 2011

Posted by mwidlake in development, performance, Testing.
Tags: , , , , , ,
14 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
<…………IOT6(A) – Bulk Insert slowed down

IOT 6 part A covered the sort of insert work used to create an IOT or push a large number of rows into it via an “insert into…select from” type format. ie Bulk Load.
In this, part B, I am going to test inserting a small number of rows (10,000) into a table already holding a million rows. Something more like an hourly load or the inserts that would occur over a working day – a more OLTP-like test

I’m going to do the tests against four things. A “non-insert” test to ensure my data creation is not taking a lot of resource {the Null test}, then against my three versions of test table – a normal heap table, my IOT and my Partitioned IOT. I run each test on each segment type three times, the minimum number needed to help identify variation that could invalidate the test.

You will probably be surprised by the results….

All three tables have the same column structure, the heap has indexes on the primary key and a unique constraint, the IOT and partitioned IOT are organized on a modified version of the Primary key and have the second index for the unique key. Below is the creation statement for the partitioned IOT (the IOT is the same without the partition clause). You can get full creation details from IOT part 2:

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(1000) not null
,vc_2      varchar2(1000)
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110901 values less than (to_date('01-09-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20111001 values less than (to_date('01-10-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20111101 values less than (to_date('01-11-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)

The three tests are:

  • Insert 10,000 records for random accounts, in time order for the same day. This would be like processing data for or during a “day”
  • Generating the 10,000 records in order of account, to better match the IOT structure, again for a single day. For any given account 0,1,2 or 3 records are created, to be a little more “real life”. This would be like loading a pre-processed set of data.
  • All 10,000 records are created for the one account, in datetime order, as an extreme example of inserting into an IOT data that is ordered to perfectly match the order of the IOT.

The code is written ‘badly’, ie a single record is prepared and inserted into the table at a time, within a loop of 10000 iterations. This is to avoid the bulk-processing that may benefit the heap table and also to better match the reality of an OLTP system. I do however only commit after all records are created for any given test, which is maybe a mistake but that’s what I did.

Now, before I go any further, does anyone want to guess by how much inserting into the heap table will be faster than inserting into the IOT (partitioned or otherwise)? Also, whether the partitioned IOT will be faster or slower than the IOT and what, if any, impact there will be by the ordering of data?

To allow you to think about that and not see the results by accident, I’ll stick in one of the insert harnesses here so you can see how I am doing it. Skip over if you just want the results

--
execute s_snap.my_snap(' finished non-insert test1')
--
-- Transaction_heap random data test
declare
v_num         number :=10000; -- number of accounts
v_offset      number :=-4;
v_str1        varchar2(60);
v_str2        varchar2(60);
v_tran_type   number(2);
v_tran_id     number(10);
v_acco_type   number(2);
v_acco_id     number(10);
v_cre_date    date;
v_vc_1        varchar2(1000);
v_vc_2        varchar2(1000);
v_date_1      date :=sysdate;
v_num_1       number(2) :=10;
v_num_2       number(2) :=15;
begin
v_str1:=dbms_random.string('U',60);
v_str2:=dbms_random.string('U',60);
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..10000 loop
    v_tran_type :=mod(1,3)+1;
    v_tran_id   := 1000000+i;
    v_acco_type := 5+(trunc(dbms_random.value(1,3))*5); -- 10 or 15
    v_acco_id   :=trunc(dbms_random.value(1,v_num/2)); -- 1 to 5000
    v_cre_date  :=trunc(sysdate-v_offset)+(i/(60*60*24));
    v_vc_1      :=substr(v_str1,1,51+mod(i,10));
    v_vc_2      :=substr(v_str2,1,46+mod(i,15));
    insert into transaction_heap
      (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,vc_2,date_1,num_1,num_2)
    values
      (v_tran_type,v_tran_id,v_ACCO_type,v_ACCO_id,v_cre_date
      ,V_vc_1,v_vc_2,v_date_1,v_num_1,v_num_2);
end loop;
commit;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
--
execute s_snap.my_snap(' finished th insert test1')
--

Here are the results of three runs for all tests:

Elapsed time in seconds:

                    random insert             ordered-by-account       perfect IOT order match
Segment type    Test 1    2    3            Test 1    2    3           Test 1    2     3    
------------    ------ ---- ----            ------ ---- ----            ----- ---- ----
No insert          2.1  0.1  0.1               0.1  0.1  0.1              0.1  0.0  0.1   

                                    AVG                         AVG                        AVG
                                   ----                        ----                       ----
Heap              52.3 20.5 11.4   28.1       33.4 13.8 10.7   19.3       9.1 13.0  9.8   10.6
IOT              137.1 83.7 26.4   82.4      110.8 63.6 39.0   71.2       6.4  6.7  4.0    5.8
Partn'd IOT       33.2  9.5  8.7   17.1       17.1 10.4 14.3   13.9       8.1  5.3  4.5    6.0
                                   ----                        ----                       ----

Allow me to summarise that.

First, the preparation of the data and cycling around the loop, the no-insert tests, are negligible. {The first no_insert test is taking 2 seconds due to the parsing and collecting dictionary information overhead of the first thing done}. We can ignore the overhead of the test harness.

Much more significantly:

  • Each iteration of each test, things get faster {with one odd exception}
  • For Random and semi-ordered inserts, inserting into the IOT is slower than the Heap by a factor of 300% to 400% (3 times to 4 times slower) – what we all probably expected.
  • Inserting into the IOT is Faster than the heap when the insert order matches the structure of the IOT in the perfect_IOT test. Not just the same but 40% faster
  • The partitioned IOT is as Fast or Faster than the the Heap table, never mind the IOT. Over all the partitioned IOT takes 50% TO 75% THE TIME OF THE HEAP to be populated.

Hands up everyone expecting this? NB My hand would be down if I had not done similar tests before.

Why do we see these results?

I’ll just mention that the reason all the tests get quicker with each iteration is that the computer is doing the same thing over and over again and gets used to it, the relevant circuits in the chip are “oiling in”. Of course not, how silly – but it seems to be how non-IT people think of it :-) Basically more and more of each segment is getting cached and that reduces physical IO and speed things up. You see this a lot in testing and you need to be constantly vigilant about it. You might think I should really run the tests until they are steady-state, but that is a very false situation. On real system you don’t tend to get the same thing run over and over and over again on a system with nothing else running.

Much more interestingly, why is the partitioned IOT winning? Well, for the Heap table data is being poured into the leading edge of the table, which takes a little processing and some blocks to be created. However, there are also two indexes being maintained, the Primary Key and that Unique key. For each insert oracle has to find the right spot in each index (root block ->branch block->leaf block->check for row) and insert the new record, occasionally splitting a full leaf block.

For the IOT and the Partitioned IOT there is no table to maintain. In effect there are just two indexes to maintain. For the normal IOT this benefit is outweighed by the cost of working with a less space efficient Primary Key index. As there is the whole of the row in each index entry there are fewer index entries possible per leaf block and so more block need to be read in from disc and more splits are needed. Thus the inserts into the IOT are slower.
But for the partitioned IOT not only is there no table to maintain but the table (really the Primary Key) is partitioned – on date. We are creating data for a single day at a time. All the relevant blocks will be in the one partition (and it’s the latest and smallest partition) and quickly all cached in memory and processed in memory. It’s the efficient use of the cache again. For both the IOT and the Heap, the primary key segment being processed is larger and the inserts are spread across the whole index.

Ordering data to match the index order (ordered_by_account tests) helps for all three segment types – heap, IOT and partitioned IOT. We have two indexes and the data is being ordered for both in effect, as one is on the columns we are ordering the data creation for and the other, the unique key, is generated by an ascending sequence. The gain is about 25, 30% and is due to working through the indexes in order as opposed to skipping about. If the indexes were significantly larger than the buffer cache or on a highly active system where the turnover in the cache was very high (indeicated by a low buffer cache hit ratio :-) ) then the benefit would be higher.
The perfect ordering test (perfect_iot_order_match) is a special case. Not only is the data being processed in order of the indexes but only one part of the index is being updated – for a single account. In this situation the maintaining of the index is so efficient that the time saved not processing the heap table means both the IOT and IOT_P win out over the heap. So in this special case, inserts into IOTs are faster than inserts into a heap table.

And finally a quick summary:

  • Inserting records into an IOT carries about a 200-400% overhead, depending on the number of indexes
  • Ordering the data to match the IOT gives a 25-30% advantage, but it similarly aids update of the primary key on the heap
  • Partitioning the IOT such that a smaller chunk of the index is processed can be highly beneficial. I don’t cover it specifically in the above but you can probably appreciate that partitioning on a heap can be beneficial to processing indexes on the heap also.
  • If you can organise to create your IOT such that data is created only in one or a few distinct points in the structure (you might have to tweak you data load too of course), insert performance might match or even beat that for a heap table with the same indexes
  • Using methods to maximise efficient use of your buffer cache and reduce IO is a very powerful performance tuning aid, not just for select

BTW I am conscious of the fact that the above may surprise many and you might wonder if my data or tests are flawed. I have stripped out most of the code from this post as it was just becoming massive and confusing, but contact me and I will happily supply all the example scripts and data.

Performance Tipping Points April 13, 2010

Posted by mwidlake in performance, statistics.
Tags: , ,
add a comment

This week I came across a nice example of a performance tipping point. This is where Everything is OK until you reach a point where it all quickly cascades to Not OK.

The below shows the timings for a regulalry run “Alert”. Four times an hour we want to know if something quite important has happened on the system.

ID S START_DATE END_DATE DUR TIME_RAN START_PARAM END_PARAM
---- - ------------- ------------- --------- ------------- -------------------- -----------------
292 S 0410 03:06:00 0410 03:06:31 .000359 0410 03:06:00 20100410 02:51:00 20100410 03:06:00
292 S 0410 03:21:00 0410 03:21:35 .000405 0410 03:21:00 20100410 03:06:00 20100410 03:21:00
292 S 0410 03:36:00 0410 03:36:38 .000440 0410 03:36:00 20100410 03:21:00 20100410 03:36:00
292 S 0410 03:51:00 0410 03:51:33 .000382 0410 03:51:00 20100410 03:36:00 20100410 03:51:00
292 S 0410 04:06:00 0410 04:06:28 .000324 0410 04:06:00 20100410 03:51:00 20100410 04:06:00
292 S 0410 04:21:00 0410 04:21:44 .000509 0410 04:21:00 20100410 04:06:00 20100410 04:21:00
292 S 0410 04:36:00 0410 04:36:27 .000313 0410 04:36:00 20100410 04:21:00 20100410 04:36:00
292 S 0410 04:51:00 0410 04:51:34 .000394 0410 04:51:00 20100410 04:36:00 20100410 04:51:00
292 S 0410 05:06:00 0410 05:06:44 .000509 0410 05:06:00 20100410 04:51:00 20100410 05:06:00
292 S 0410 05:21:00 0410 05:21:43 .000498 0410 05:21:00 20100410 05:06:00 20100410 05:21:00
292 S 0410 05:36:00 0410 05:37:01 .000706 0410 05:36:00 20100410 05:21:00 20100410 05:36:00
292 S 0410 05:51:00 0410 05:52:21 .000938 0410 05:51:00 20100410 05:36:00 20100410 05:51:00
292 S 0410 06:06:00 0410 06:08:09 .001493 0410 06:06:00 20100410 05:51:00 20100410 06:06:00
292 S 0410 06:21:01 0410 06:24:00 .002072 0410 06:21:01 20100410 06:06:00 20100410 06:21:01
292 S 0410 06:36:00 0410 06:40:12 .002917 0410 06:36:00 20100410 06:21:01 20100410 06:36:00
292 S 0410 06:51:00 0410 06:56:54 .004097 0410 06:51:00 20100410 06:36:00 20100410 06:51:00
292 S 0410 07:06:00 0410 07:13:17 .005058 0410 07:06:00 20100410 06:51:00 20100410 07:06:00
292 S 0410 07:21:00 0410 07:29:42 .006042 0410 07:21:00 20100410 07:06:00 20100410 07:21:00
292 S 0410 07:36:00 0410 07:47:48 .008194 0410 07:36:00 20100410 07:21:00 20100410 07:36:00
292 S 0410 07:51:00 0410 08:08:07 .011887 0410 07:51:00 20100410 07:36:00 20100410 07:51:00
292 S 0410 08:08:07 0410 08:29:43 .015000 0410 08:08:07 20100410 07:51:00 20100410 08:08:07
292 S 0410 08:29:43 0410 08:50:10 .014201 0410 08:29:43 20100410 08:08:07 20100410 08:29:43
292 S 0410 08:50:10 0410 09:22:28 .022431 0410 08:50:10 20100410 08:29:43 20100410 08:50:10
292 S 0410 09:22:28 0410 10:27:11 .044942 0410 09:22:28 20100410 08:50:10 20100410 09:22:28
292 S 0410 10:27:11 0410 12:57:16 .104225 0410 10:27:11 20100410 09:22:28 20100410 10:27:11
292 F 0410 12:57:16 0410 14:50:26 .078588 0410 12:57:16 20100410 10:27:11 20100410 12:57:16
292 F 0410 14:50:26 0410 16:49:42 .082824 0410 14:50:26 20100410 10:27:11 20100410 14:50:26
292 F 0410 16:49:42 0410 19:06:10 .094769 0410 16:49:42 20100410 10:27:11 20100410 16:49:42

The first half dozen records show the program ID 292 running in half aminute or so. It varies a little, from just under 30 seconds to 45 seconds. Each run kicks off 15 minutes after the previous and, if you check the START_PARAM and END_PARAM on the right of the listing, you can see that the report runs for the previous 15 minutes. ie the period since the last run ( including the running time of the last itteration).

Then, at 05:36 something happens. The execution takes a full minute. The next run takes 1 minute and 20 seconds. The next run takes over two minutes and each following execution takes a little longer and longer.

At 07:51, 1.5 hours later, something significant happens. The execution takes more than 15 minutes to run. This report is run every 15 minutes. I’m sure you can appreciate that this is a problem. What happens now depends on your architecture.

If you have a scheduler that simply kicks off the report every 15 minutes, at the next scheduled time (08:06 in my case) a new execution will start and you will have two version of the report running for a couple of minutes. As time progresses this overlap increases. When the run time reaches 30 minutes you will now start having 3 reports running at the same time. And soon you will get three, four, five etc version running at the same time. Depending on the number of CPUs and IO bandwidth of your system, how long will it be until it will be doing nothing but running this code?
If the code is doing somthing more than simply selecting data, the chance of the concurrent versions competing for locks or resources is high and can mean that the slow-down of the code escalates rapidly and in a very, very short time, your system is on it’s knees.

In our case, we have some protection against this. The scheduler detects that the previous version has not finished and it waits until it has done so before starting the next execution. so the next execution is delayed. In my example the 08:06 execution is delayed until 08:08, 2 minutes late.

We still have two possible situations. The report could simply continue to slow down at the previous rate and just be delayed longer and longer. In my example it has been slowing down at around 1 minute 20 seconds per run for the last 5 runs.

However, The next execution is further delayed until 08:29 – which is 6 minutes beyond the 15 minute window from 08:08. This latets run of the code is 5 minutes 30 seconds slower than the previous execution, not 1 minute 20 seconds slower. We still have a tipping point. If you remember, this report runs to cover the period since the last execution. As the last execution took more than 15 minutes, the next run has to report on more than 15 minutes. 21 minutes in this case.

The slow-down now rapidly escalates due to not just whatever was slowing the report down initially but also the growing reporting period.

Just 5 runs and 5 hours later, at 12:57:16, the report fails. It now runs for so long that it cannot recreate the data as it looked at the start of the run and we get snapshot-too-old errors. In the listing you just see the second column go from S to F.

It is game over. Thankfully only one copy of the report is running at any time (A version is kicked off as soon as the previous one fails) and so, worst case, is using only one process on the box, one CPU maximum and the IO that one thread can demand.

What was the root cause? Time and statistics and data volume.
Statistics because the job that collects statistics failed earlier.
Time because at 05:50 or so, the run window of 15 minutes was far enough out of the range of known date-times for a partition that the CBO decided only a few rows would be found and swapped to a nested-loop execution.
Data volume as data was coming in quicker as the day progressed and the nested loop plan performance was highly susceptible to increases in data volume.

For an excellent posting on why code suddenly changes it’s execution plan when nothing has happened but time passing, seethis excellent post and comments on the topic on Richard Foote’s blog

{And isn’t his blog so much skinnier than mine? :-) }

Stats Need Stats to Gather Stats February 16, 2010

Posted by mwidlake in performance.
Tags: , ,
10 comments

Did you know that you sometimes need good stats so that the stats-gathering package can gather stats in an efficient way? This is a recent, quite extreme example.

I’ve been forced to learn a lot about gathering Oracle stats using DBMS_STATS over the last 4 or 5 years. But no matter how much I learn about the “challengingly bizarre” way in which it works, it seems every week or two there is a new oddity. I plan a whole series on the topic “soon”.

This particular example is from a 10.2.0.3 system.

I am gathering partition-only table stats as we are using the ability of Oracle to roll up Partition stats to Global stats under certain specific conditions. One of the conditions is that you need stats for every partition. Plus, to get global column stats, each partition must have stats for each column. Some of our partitions lacked or had very bad stats.

So I quickly knocked up a script-generating script to create DBMST_STATS.GATHER_TABLE_STATS statements that collected, for those partitions:

  • ONLY partition stats.
  • NO cascade down to indexes
  • BLOCK sampling so it is fast {and poor, but there you go}
  • ESTIMATE_PERCENT of 2, which is quite low for block sampling
  • collect histograms as we decided the devil of having them was better than the devil of not having them.

the above is not ideal to get “good stats”, but it is quick and gets OK stats which is what we need right now. An example statement is:

begin                                                                                              
-- part SD_INFO-DY07032004 rows 34554                                                              
dbms_stats.gather_table_stats(ownname => 'ERIC',tabname=> 'SD_INFO'
,partname=> 'DY07032004'
,granularity=> 'PARTITION', estimate_percent => 2,block_sample=> true ,cascade=> FALSE
,method_opt=> 'FOR ALL COLUMNS SIZE AUTO' ,no_invalidate=> false);
END;
/

Not a big partition (34554 rows and not very wide rows) and so did not take long to gather:

PARTITION_NAME                 LAST_ANALYZED
------------------------------ --------------------
DY04012004                     12-FEB-2010 16:26:57
DY05012004                     12-FEB-2010 16:27:00
DY06012004                     12-FEB-2010 16:27:04
DY07012004                     12-FEB-2010 16:27:07
DY08012004                     12-FEB-2010 16:27:11 -- 4 seconds difference
DY09012004                     12-FEB-2010 16:27:15

I’ve collected statistics for a few thousand partitions over the last couple of days and the time taken is anything between just under half a second to 10 seconds per partition, the odd unusually large partition taking a minute or so. {I believe it takes half a second to gather stats on an empty partition, on our system at least, due to the time it takes for the internal housekeeping, including copying the old statistics information to the SYS.WRI$_OPSTAT_… tables to support restoring stats}. Sorry, I drift away from my main point.

This partition took a lot longer than 10 seconds:

begin
-- part W_ACTIVITY_FAILURE-DY02092008 rows 49425
dbms_stats.gather_table_stats(ownname => 'ERIC',tabname=> 'W_ACTIVITY_FAILURE'
,partname=> 'DY02092008'
,granularity=> 'PARTITION', estimate_percent => 2,block_sample=> true ,cascade=> FALSE
,method_opt=> 'FOR ALL COLUMNS SIZE AUTO' ,no_invalidate=> false);
END;
/

After 10 minutes it was still running. WHY? I quickly checked the number of rows in the partition and then the size of the partition segment, incase either was much larger than I expected. Neither were.
select count(*) from eric.W_ACTIVITY_FAILURE partition (DY07092008)
COUNT(*)
———-
42182

From dba_segments.
BYTES BLOCKS
———- ———-
2621440 320

There is one advantage of a DBMS_STATS statement running for a long time – you can grab from the SGA the actual code being executed for the DBMS_STATS statement. I saw this.

select substrb(dump(val,16,0,32),1,120) ep, cnt 
from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring */
"ACCOUNT_ID" val,count(*) cnt 
from "ERIC"."W_ACTIVITY_FAILURE" t 
where TBL$OR$IDX$PART$NUM("ERIC"."W_ACTIVITY_FAILURE",0,4,0,"ROWID") = :objn
and "ACCOUNT_ID" is not null 
group by "ACCOUNT_ID") 
order by val

DBMS_STATS is scanning the ACCOUNT_ID column on this table and it is taking a long time about it. The only index on the table is on ACCOUNT_ID. It then struck us.

The index is a global index.
INDEX_NAME PARTITIONED
—————————— ————-
API_XXXXXXXXXX_ACCOUNTS_IND NO

And that global index actually lacked stats {another feature of DBMS_STATS and rolling up partition stats had led to that}.

INDEX_NAME TYP UNQ BL L_BLKS DIST_KEYS CLUSTF LB_KEY DB_KEY LST_ANL
————— ——— — —- ———- ———– ———– ———- ———- ——–
API_XXXXXXX_ NOR NON
ACCOUNTS_IND

By this time 3 partitions for this table had been processed by my code, taking around 15 minutes each one. Incredibly slow.

I did a very quick 0.01% sample size DBMS_STATS.GATHER_INDEX_STATS on that index which took about 1 minute. As soon as the partition DBMS_STATS.GATHER_TABLE_STATS statement that was in flight finished, the following similar statements on that table’s partitions took under 3 seconds each. I’ll buy a pint for the first person to guess (within 10 minutes) WHEN I collected the global index stats {You can collect from any pub in central London if you give me a few day’s notice :-) }.

PARTITION_NAME   NUM_ROWS     BLOCKS LAST_ANALYZED
-------------- ---------- ---------- ---------------------
DY01092008          31461        452 16-FEB-2010 09:51:41
DY02092008          49425        686 16-FEB-2010 10:03:44
DY03092008          54472        719 16-FEB-2010 10:16:31
DY04092008          35762        491 16-FEB-2010 10:30:52
DY05092008          42182        587 16-FEB-2010 10:44:24
DY06092008          21186        312 16-FEB-2010 10:56:13
DY07092008          20898        313 16-FEB-2010 11:12:59
DY08092008         469500       1233 16-FEB-2010 11:13:02
DY09092008         480300        741 16-FEB-2010 11:13:04
DY10092008          15724        223 16-FEB-2010 11:31:01
DY11092008          55671        732 16-FEB-2010 11:31:06
DY12092008         820100       1779 16-FEB-2010 11:31:08
DY13092008          80215       1113 16-FEB-2010 11:31:16
DY14092008          10094        155 16-FEB-2010 11:31:18
DY15092008          10268        158 16-FEB-2010 11:31:20
DY16092008          24578        330 16-FEB-2010 11:31:22
DY17092008          21012        290 16-FEB-2010 11:31:24
DY18092008          22755        318 16-FEB-2010 11:31:27
DY19092008          21276        293 16-FEB-2010 11:31:29
DY20092008          20882        281 16-FEB-2010 11:31:31
DY21092008          24131        323 16-FEB-2010 11:31:34

I will investigate this oddity further if I get time this week, but the lesson to myself so far is:

Global indexes lacking stats can result in very long stats gathering times for partitions EVEN WHEN YOU DO NOT CASCADE DOWN TO INDEXES.

Ensuring Correlated Partition Exclusion #2 December 20, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
2 comments

<Previous Post

A few days ago {oh dear, it is now two weeks!} I showed how you could better control Correlated Partition Exclusion by creating a lookup table that showed, for each partition, the minimum and maximum ID (the partition key) and the minimum and maximum CRE_DATETIME (what you want to limit your query on). Using this range table in the way I described, you did not have to take an educated guess as to what range of IDs you used to include the partitions for the date range you were interested in, you could check the table.

But this only worked reliably where the ID and CRE_DATETIME increase directly in proportion to each other. It was a special case.

What about the more normal case, where there is a correlation between the two columns but it is not a perfect correlation? This would happen, for example, if the CRE_DATETIME is entered from another system, or from paper records, where the order of the records is not enforced. So some records from today get put into the system before some from yesterday. Or if the correlation is even loser than this. eg you are looking at orders for new customers. You “know” there are going to be no orders for these customers from 5 years ago but you are not sure how far back in the orders table you should go to find what you want.

You can still use the lookup table method. The lookup table in effect becomes a meta-index – an index of the segments where you will find data but not actually the rows.

To demonstrate this, I created a table where the ID and CRE_DATETIME increase in order:

create a partitioned table test_p4
(id           number(10) not null
,cre_datetime date not null
,status        number(1) not null
,num_1         number(4) not null -- random 20
,num_2         number(4) -- random 500
,num_3         number(5) -- cycle smoothly
,num_4         number(5) -- Random 10000
,vc_1          varchar2(10)
,vc_2          varchar2(10)
,vc_pad        varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
,partition id_max values less than (maxvalue)
tablespace users
)
--
-- local partitioned indexes on the table
 IND_NAME           TAB_NAME           PSN       COL_NAME
 ------------------ ------------------ --------- ------------
 TP4_CRE_DT         TEST_P4            1         CRE_DATETIME

 TP4_PK             TEST_P4            1         ID
--
-- populate the table with data
insert into test_p4(id,cre_datetime,status,num_1,num_2,num_3,num_4
                  ,vc_1,vc_2,vc_pad)
select rownum
,to_date('01-JUL-2009','DD-MON-YYYY')+(rownum/360)
,decode(mod(rownum,100),0,1
              ,0)
,trunc(dbms_random.value(1,20)) 
,trunc(dbms_random.value(1,50)) 
,mod(rownum,10)+1
,trunc(dbms_random.value(1,10000))
,dbms_random.string('U',5)
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )
,lpad('A',1000,'A')
from dba_objects
where rownum <43000

I then messed with the data, updating 10% of records and setting the CRE_DATETIME to plus or minus a random amount up to 2 days different, so data in partitions would overlap. I then created a range table in the same way as I did for the previous post.

I ended up with a range table like the below:

MIN_CRE_DATI      MAX_CRE_DATI          MIN_ID     MAX_ID
----------------- ----------------- ---------- ----------
29-JUN-2009 20:35 05-JUL-2009 14:24          1        999
02-JUL-2009 04:50 07-JUL-2009 12:56       1000       1999
05-JUL-2009 02:34 10-JUL-2009 08:31       2000       2999
08-JUL-2009 05:23 13-JUL-2009 03:32       3000       3999
11-JUL-2009 08:07 15-JUL-2009 18:41       4000       4999
14-JUL-2009 00:14 18-JUL-2009 18:27       5000       5999
16-JUL-2009 08:58 21-JUL-2009 14:18       6000       6999
19-JUL-2009 01:28 24-JUL-2009 11:21       7000       7999
22-JUL-2009 08:02 27-JUL-2009 07:01       8000       8999
24-JUL-2009 22:06 30-JUL-2009 05:37       9000       9999
28-JUL-2009 04:59 01-AUG-2009 10:57      10000      10999
...
24-SEP-2009 01:52 28-SEP-2009 18:36      31000      31999
26-SEP-2009 16:49 01-OCT-2009 01:26      32000      32999
29-SEP-2009 13:20 04-OCT-2009 13:43      33000      33999
02-OCT-2009 08:40 07-OCT-2009 10:11      34000      34999
05-OCT-2009 04:29 10-OCT-2009 04:09      35000      35999
08-OCT-2009 02:04 12-OCT-2009 17:34      36000      36999
10-OCT-2009 20:03 15-OCT-2009 08:39      37000      37999
13-OCT-2009 15:09 18-OCT-2009 12:01      38000      38999
16-OCT-2009 06:49 21-OCT-2009 03:53      39000      39999
18-OCT-2009 20:16 23-OCT-2009 21:01      40000      40999
21-OCT-2009 12:10 26-OCT-2009 07:13      41000      41999
25-OCT-2009 01:29 29-OCT-2009 20:56      42000      42999

You can see that the MIN_CRE_DATI-MAX_CRE_DATI from record to record (partition to partition) overlap but generally increase.

How do you find the start and end of the ID range to cover a date period you are interested in? I always have to sit down with a pen a paer to work this out. It is the classic “overlapping ranges” check, but my brain cannot hold on to it. So here goes. I want all records between the 1st of October and the 6th, this year.

You want to find the partition with the lowest ID range which has a record that falls into the date range you want. ie the Maximum CRE_DATETIME record is as late or later than the range you are interested in. If the maximum CRE_DATETIME is less than the date range you are interested in, no records will be in that partition.

Here is the code to find the lowest partition

select min(min_id)
from tp4_range 
where MAX_CRE_DATI>=to_date('01-OCT-2009','DD-MON-YYYY')
MIN(MIN_ID)
-----------
      32000

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    12 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    12 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    12 |   144 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MACD_MIAD |    12 |       |     1
----------------------------------------------------------

To cover the whole of the partition of interest you need to look for records with and I greater than the minimum in that partition, thus the selection of the min(min_id)

Similarly, you only want partitions where the minimum CRE_DATETIME is before the end of the date range you want. If all records in the partition have a CRE_DATETIME beyond the range, you are not interested in it.

select max(max_id)
             from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY')
MAX(MAX_ID)
-----------
      35999

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    13 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    13 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    37 |   481 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MICD_MIID |    37 |       |     1
----------------------------------------------------------

Then you put these sub-queries into the query to select the data you want. Below I show the “original” code which does not use the partition range table and then using the partition range table, to prove the same number of records come back and to see the plan and statistics change:

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
------------------------------------------------------
| Id  | Operation                           | Name       | R
ows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
------------------------------------------------------
|   0 | SELECT STATEMENT                    |            |
   1 |    11 |   356   (1)| 00:00:02 |       |       |
|   1 |  SORT AGGREGATE                     |            |
   1 |    11 |            |          |       |       |
|   2 |   PARTITION RANGE ALL               |            |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P4    |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  4 |     INDEX RANGE SCAN                | TP4_CRE_DT |
1788 |       |    52   (0)| 00:00:01 |     1 |    46 |
------------------------------------------------------------
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        676  consistent gets
          0  physical reads

Note the 676 consistent gets and the Pstart/Pstop of 1-46

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and id> (select min(min_id)
             from tp4_range where MAX_CRE_DATI
                 >=to_date('01-OCT-2009','DD-MON-YYYY'))
and id < (select max(max_id)
             from tp4_range where MIN_CRE_DATI
                <=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
-----------------------------------------------------------
| Id  | Operation                            | Name
  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
-----------------------------------------------------------
|   0 | SELECT STATEMENT                     |
  |     1 |    16 |    66   (5)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                      |
  |     1 |    16 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR           |
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P4
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|   4 |     BITMAP CONVERSION TO ROWIDS      |
  |       |       |            |          |       |       |
|   5 |      BITMAP AND                      |
  |       |       |            |          |       |       |
|   6 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|   7 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|*  8 |         INDEX RANGE SCAN             | TP4_PK
  |  1788 |       |     2   (0)| 00:00:01 |   KEY |   KEY |
|   9 |          SORT AGGREGATE              |
  |     1 |    12 |            |          |       |       |
|  10 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    12 |   144 |     2   (0)| 00:00:01 |       |       |
|* 11 |            INDEX RANGE SCAN          | TP4R_MACD_MIA
D |    12 |       |     1   (0)| 00:00:01 |       |       |
|  12 |          SORT AGGREGATE              |
  |     1 |    13 |            |          |       |       |
|  13 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    37 |   481 |     2   (0)| 00:00:01 |       |       |
|* 14 |            INDEX RANGE SCAN          | TP4R_MICD_MII
D |    37 |       |     1   (0)| 00:00:01 |       |       |
|  15 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|  16 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|* 17 |         INDEX RANGE SCAN             | TP4_CRE_DT
  |  1788 |       |    52   (0)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        350  consistent gets
          0  physical reads

The plan is admittedly more complex and the SQL can be tricky to understand if you are not used to code that looks for overlap of ranges. But Consistent gets is down to 350 and the Pstart/Pstop values are KEY-KEY. The CBO cannot tell you WHAT the ranges will be when the code is parsed {I think it should check but in 10.2.0.3 at least, the CBO is not that smart at parse time}, but it knows there will be a start and stop.

I am using tiny partitions for my example and only 45 of them for the table. When the partitions are for millions of rows and there are a couple of thousand of them, excluding partitions in a manner you can rely on is a powerful tool. Which leads onto a final word of caution.

Nothing is enforcing that the range table is maintained.

You could do things with triggers or regular re-calculation of the ranges table but this will be something you need to consider if you use ranges tables to help partition exclusion. Flipping tablespaces to read-only can help the worry go away though… :-)

As an example of the issue of needing to maintain the ranges table and also a demonstration that the ranges table does work correctly if maintainted, I’ll update a record well outside of the “expected window”, show that it does not appear in my range-check controlled code, then update the ranges table and try again.

update test_p4
set cre_datetime=to_date(’02-OCT-2009 11:15′,’DD-MON-YYYY HH24:MI’)
,NUM_2=5
where id=22100
/
3> select count(*) from test_p4
where cre_datetime between to_date(’01-OCT-2009′,’DD-MON-YYYY’)
and to_date(’06-OCT-2009′,’DD-MON-YYYY’)
and id> (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

COUNT(*)
———-
49
— success

I wonder if I will do the next posting on this topic in less than two weeks!

Ensuring Correlated Partition Exclusion December 7, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
9 comments

<Previous Post…Next Pos >
I’ve posted a couple of times recently about a tuning method with partition exclusion where you infer a relationship between the value you want to limit a query on and the partition key. It takes a while to explain the theory so I am going to give it a name, in the vain hope it catches on {forgive me if someone has already done this – in fact, just flame me with pointers if you already know a name for this}. At the very least, for my own postings, I can use this name from now on and link back to a single posting explaining it.

I’m going to call it Correlated Partition Exclusion. In essence, you have partitioned the table on a key, in my case the primary key ID, which is an ascending numeric probably sourced from a sequence.
You have a second column, in my case CRE_DATETIME, which increases in line with the PK. If you limit your query on the CRE_DATETIME partition exclusion is not possible as there is no guarantee which CRE_DATETIME values appear in which partition. But, as a human, you understand that if you create 10,000 records a day, if you want to look at the last week’s date you can use:

WHERE ID > MAX_ID-(7*10000)

to exclude partitions with an id more than 7 days worth ago.

So, you have your Correlated Partition Exclusion.

How can you be sure that going back 70,000 IDs is going to safely cover one week of data and how can you maximise your efficiency of including only partitions that cover the date range? {note, I am using a numeric ID as my partition range key and a datetime as my correlated column, this principle works just as well if you partition on datetime and want to correlate to a (generally) ascending numeric key}

Here is my test table :-

create table test_p3
(id number(10) not null
,cre_datetime date not null
,status number(1) not null
,num_1 number(4) not null -- random 20
,num_2 number(4) -- random 500
,num_3 number(5) -- cycle smoothly
,num_4 number(5) -- Random 10000
,vc_1 varchar2(10)
,vc_2 varchar2(10)
,vc_pad varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
,partition id_03k values less than (3000)
tablespace users
,partition id_04k values less than (4000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
--
,partition id_max values less than (maxvalue)
tablespace users
)
/
--@ind_cols
IND_NAME TAB_NAME PSN COL_NAME
------------------ ------------------ --------- --------------------
TP3_PK TEST_P3 1 ID

TP_CRE_DT TEST_P3 1 CRE_DATETIME

If I want to look for all records between two dates I could use code like the below (based in that suggested by Bernard Polarski, any mistakes are mine).

with get_min_id as
(select max(id) min_id from test_p3
where cre_datetime >= TO_DATE('18-OCT_2009','DD-MON-YYYY') )
,get_max_id as
(select min(id) max_id from test_p3
where cre_datetime <= TO_DATE('20-OCT_2009','DD-MON-YYYY') )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')
/
COUNT(*)
----------
721

1 row selected.

ie find the minimum ID for the start date and the maximum ID for the end date and query between them.

This works fine. Unfortunately, you get a plan like the below

-------------------------------------------------------
| Id | Operation | Name |
Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
-------------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 39 | 887 (4)| 00:00:04 | | |
| 1 | SORT AGGREGATE | |
1 | 39 | | | | |
| 2 | NESTED LOOPS | |
2 | 78 | 887 (4)| 00:00:04 | | |
| 3 | NESTED LOOPS | |
1 | 26 | 717 (5)| 00:00:03 | | |
| 4 | VIEW | |
1 | 13 | 505 (5)| 00:00:02 | | |
|* 5 | FILTER | |
| | | | | |
| 6 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 7 | VIEW | index$_join$_001 |
38393 | 974K| 505 (5)| 00:00:02 | | |
|* 8 | HASH JOIN | |
| | | | | |
| 9 | PARTITION RANGE ALL | |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
|* 10 | INDEX RANGE SCAN | TP_CRE_DT |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
| 11 | PARTITION RANGE ALL | |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 12 | INDEX FAST FULL SCAN | TP3_PK |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 13 | VIEW | |
1 | 13 | 212 (5)| 00:00:01 | | |
| 14 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 15 | VIEW | index$_join$_002 |
2972 | 77272 | 212 (5)| 00:00:01 | | |
|* 16 | HASH JOIN | |
| | | | | |
| 17 | PARTITION RANGE ALL | |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
|* 18 | INDEX RANGE SCAN | TP_CRE_DT |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
| 19 | PARTITION RANGE ALL | |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 20 | INDEX FAST FULL SCAN | TP3_PK |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 21 | PARTITION RANGE ITERATOR | |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 22 | TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 23 | INDEX RANGE SCAN | TP_CRE_DT |
707 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
892 consistent gets
0 physical reads

If you look at the plan {towards the end, and sorry about the naff layout, my blog is not wide enough for this sort of printout} there are two checks on the TP_CRE_DT indexes that scan all partitions of the index – Pstart/Pstop are 1-46. This is the CBO looking for the partitions where the stated CRE_DATETIME records occur in the whole table. Cost is 892 consistent gets, much of which is the checking of local index partitions that will hold no relevant entries.

Bernard also spotted that the code was considering all partitions and was not as efficient as it could be but did not know how to get around it except with a Global index, which brings it’s own issues.

One way is to just say “well, I know how many records per day I get so I will fake up the ID limits based on this”. The problem is, and this is why the CBO cannot make the decision for you, is that there is no guarantee, no fixed rule, saying that CRE_DATETIME will always increment with the ID. In fact, there is nothing stopping you altering a record which has an ID from yesterday having a CRE_DATETIME from 10 years ago {but forget I said that until tomorrow’s post}. Now, in my example the CRE_DATETIME is going to increment with ID. We will use this special case for now, I know it is flawed and will address that flaw {tomorrow}.

So to ensure yo do not miss data you end up making your ID window pretty large to endure you do not miss records. Say you want all records for the last day, you process 1500 records a day, so you consider a window covering all samples with an ID within the last 10,000. It will still be more efficient than scanning all partitions and should be safe. Fairly safe.

The way out of this is to have a ranges table. Something that tells you, for each partition, which is the maximum and minimum CRE_DATE and the IDs covered by that range. You can then use that to identify the partitions that cover the date range you are interested in.

Here is an example. I will create a simple table to hold the ranges:

create table tp3_range
(min_cre_dati date
,max_cre_dati date
,min_id number
,max_id number)

Now you have to populate it.
The below code will create the first record for the first partition.

insert into tp3_range
SELECT MIN(CRE_DATETIME)
,MAX(CRE_DATETIME)
,MIN(ID)
,MAX(ID) FROM TEST_P3 PARTITION (ID_01K)
/
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (
%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 12 | 102
(1)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 12 |
| | | |
| 2 | PARTITION RANGE SINGLE| | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
| 3 | TABLE ACCESS FULL | TEST_P3 | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
------------------------------------------------------------------

Statistics
----------------------------------------------------------
1 recursive calls
3 db block gets
195 consistent gets
0 physical reads

As you can see from the plan and cost, this is not very efficient as it has to scan the whole partition. Maybe not a problem if you do this once, but there are indexes on both these columns, can’t this be done more efficiently? Yes, if you split up the code into four in-line selects (if you want more details about it being more performant to do MIN and MAX on their own than in one statement then see this post on the topic ):

insert into tp3_range
SELECT (SELECT MIN(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MIN(id ) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(ID ) FROM TEST_P3 PARTITION (ID_04K))
FROM DUAL

PLAN
---------------------------------------------------------
SORT AGGREGATE
PARTITION RANGE SINGLE cst:2 rws:1000
INDEX FULL SCAN (MIN/MAX) TP_CRE_DT cst:2 rws:1000
Statistics
----------------------------------------------------------
1 recursive calls
5 db block gets
9 consistent gets
0 physical reads

{You may wonder why the Explain Plan section above has a different look. This is because there seems to be a bug in 10.2.0.3 where the autotrace plan for the insert statement comes out wrong, as a FAST DUAL access, so I had to Explain the statement in a different way}

You would run one of the above statements against each partition. Probably, the “Best Practice” way would be to generate a SQL script from a query against DBA_TAB_PARTITIONS.

To populate my table I cheated – I just assumed all my partitions are of the same size (1000 records) and used:-
insert into tp3_range
select min(cre_datetime),max(cre_datetime),min(id),max(id)
from test_p3
— where id between 10000 and 20000
group by trunc(id/1000)
/

Note I left in a commented line. You could run the above against the whole table and then limit it to just new partitions as you add them. This is a tad risky though, you are relying on the partitioning being perfect and it would not scale to hundreds of very large partitions. You would be better off with the partition-by-partition methods above.

You end up with a table like the below:-

select * from tp3_range order by min_cre_dati


MIN_CRE_DATI MAX_CRE_DATI MIN_ID MAX_ID
----------------- ----------------- ---------- ----------
01-JUL-2009 00:04 03-JUL-2009 18:36 1 999
03-JUL-2009 18:40 06-JUL-2009 13:16 1000 1999
06-JUL-2009 13:20 09-JUL-2009 07:56 2000 2999
09-JUL-2009 08:00 12-JUL-2009 02:36 3000 3999
12-JUL-2009 02:40 14-JUL-2009 21:16 4000 4999
14-JUL-2009 21:20 17-JUL-2009 15:56 5000 5999
17-JUL-2009 16:00 20-JUL-2009 10:36 6000 6999
20-JUL-2009 10:40 23-JUL-2009 05:16 7000 7999
23-JUL-2009 05:20 25-JUL-2009 23:56 8000 8999
26-JUL-2009 00:00 28-JUL-2009 18:36 9000 9999
28-JUL-2009 18:40 31-JUL-2009 13:16 10000 10999
31-JUL-2009 13:20 03-AUG-2009 07:56 11000 11999
...
14-OCT-2009 13:20 17-OCT-2009 07:56 38000 38999
17-OCT-2009 08:00 20-OCT-2009 02:36 39000 39999
20-OCT-2009 02:40 22-OCT-2009 21:16 40000 40999
22-OCT-2009 21:20 25-OCT-2009 15:56 41000 41999
25-OCT-2009 16:00 28-OCT-2009 10:36 42000 42999

add the two below indexes:
create index tp3r_micd_miid on tp3_range(min_cre_dati,min_id);
create index tp3r_macd_miad on tp3_range(max_cre_dati,max_id);

And now you can find the upper and lower ID bounds for a date range with the following code:

select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
)
MIN_ID
----------
39000

Execution Plan
---------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
---------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MICD_MIID |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |    40 |   360 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MICD_MIID |    40 |   360 |     1
---------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
)

MAX_ID
----------
39999

Execution Plan
----------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MACD_MIAD |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |     4 |    36 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MACD_MIAD |     4 |    36 |     1
----------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

Put it all together into one statement and let’s see how much it costs:

with get_min_id as
(select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
) )
,get_max_id as
(select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
) )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')

COUNT(*)
----------
721
1 row selected.

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Row
s | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
----------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 57 | 67 (5)| 00:00:01 | | |
| 1 | SORT AGGREGATE | |
1 | 57 | | | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P3 |
2 | 26 | 65 (5)| 00:00:01 | | |
| 3 | NESTED LOOPS | |
2 | 114 | 65 (5)| 00:00:01 | | |
| 4 | MERGE JOIN CARTESIAN | |
1 | 44 | 2 (0)| 00:00:01 | | |
|* 5 | INDEX RANGE SCAN | TP3R_MICD_MIID |
1 | 22 | 1 (0)| 00:00:01 | | |
| 6 | SORT AGGREGATE | |
1 | 9 | | | | |
| 7 | FIRST ROW | |
40 | 360 | 1 (0)| 00:00:01 | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MICD_MIID |
40 | 360 | 1 (0)| 00:00:01 | | |
| 9 | BUFFER SORT | |
1 | 22 | 1 (0)| 00:00:01 | | |
|* 10 | INDEX RANGE SCAN | TP3R_MACD_MIAD |
1 | 22 | 1 (0)| 00:00:01 | | |
| 11 | SORT AGGREGATE | |
1 | 9 | | | | |
| 12 | FIRST ROW | |
4 | 36 | 1 (0)| 00:00:01 | | |
|* 13 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MACD_MIAD |
4 | 36 | 1 (0)| 00:00:01 | | |
| 14 | PARTITION RANGE ITERATOR | |
| | | | KEY | KEY |
| 15 | BITMAP CONVERSION TO ROWIDS | |
| | | | | |
| 16 | BITMAP AND | |
| | | | | |
| 17 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 18 | SORT ORDER BY | |
| | | | | |
|* 19 | INDEX RANGE SCAN | TP3_PK | 7
07 | | 6 (0)| 00:00:01 | KEY | KEY |
| 20 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 21 | SORT ORDER BY | |
| | | | | |
|* 22 | INDEX RANGE SCAN | TP_CRE_DT | 7
07 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
133 consistent gets
0 physical reads

If you look way back up this post, you will see that the number of records selected by the above is the same as from the code indicated by Bernard (721 records) and for a lot less cost – 133 consistent gets compared to 892.

So I have hopefully explained the issue (having to visit all the index partitions to check the date range), shown how a ranges table can help, given some simple and a less-simple-but-more-efficient examples of code to populate the table and finally shown that using the range table can be more efficient.

And I arrogantly gave it a name – Correlated Partition Exclusion :-)

Some of you may remember tha this example has been a special case, as there is no overlap between the dates; the relationship between the CRE_DATETIME and ID is perfect. You are unlikely to have that in real life. Also, dome of you may also be tired of reading this post. So I will cover the general case in the NEXT post.

More on Assisting Partition Exclusion December 6, 2009

Posted by mwidlake in performance, VLDB.
Tags: , ,
2 comments

A couple of weeks ago I posted about how you could help the oracle CBO carry out partition exclusion and thus aid performance. In essence, you have a table partitioned on ID (based on an ascending numeric) and you also have a date column (CRE_DATETIME, indexed using a locally partitioned index), which you often want to limit queries on.
The ID and the CRE_DATETIME both increase over time, so as a human you can infer that a date created range will match to an ID range that you can limit any queries on and get partition exclusion. The CBO does not understand this relationship and so can’t help. If you want to look at records for the last week and you know that you generate 10,000 records a day maximum, you can add a where clause of {and this is pseudocode}
WHERE ID BETWEEN (MAX_ID-(10000*7))
AND (MAX_ID)
I’m afraid you will have to look back at the referenced post if this does not make sense.

The issue I want to address is that how can you be sure that you are using a fake ID range to cover the date range?

A simple real-world example of this is when someone (a business objects user or helpdesk user seems to be the usual source) wants to look up the details of a fairly new customer on the datawarehouse. They don’t know the ID of the record, but they know the new account was created this week and some other filtering data. So they run a query saying:

WHERE CRE_DATETIME >SYSDATE-7
AND COL_A = ‘XXXX’
AND COL_B = ‘YYYY’

This visits every partition in the table, even if there is a locally partitioned index on CRE_DATETIME. See this post for details of this problem. If you are really unlucky, and this seems to be the usual situation, there is no index on the CRE_DATETIME either, and a full scan of the billion-row table is initiated, tying up one CPU and giving that expansive {Sorry, type, expensive – though both are true) storage sub-system something to work on for a few hours.

However, in this situation, do this. Select max(ID) {the primary key} from the partitioned table, which will take less time than it does to type “select”. Then you say to the user:

 “OK, the max ID is currently 987,000,000. Add this to the WHERE clause:

AND ID > 980000000

If that does not find your record now change that additional clause to:

AND ID BETWEEN 970000000 AND 980000000

and just keep going down by 10 million each time.”

By doing this, the user will be limiting the query on the partition key, the ID, and partition exclusion will be possible and each query will come back quickly. The user will usually find the record they want in the first couple of attempts {or will be kept quiet and less demanding of the system until they get bored and come back to tell you “you are an idiot”, having scanned back over several 10’s of millions of records, but then you know they lied about the record being recent so you need to help them in some other way}.

This post is actually to remind you of where I had got on this thread, my being so unforgivably slow in keeping this thread running. Tomorrow (HONEST!) I will cover HOW you ensure your inferred ID range is covering the CRE_DATETIME range you are interested in.

Assisting Partition Exclusion – Partitions for Performance November 23, 2009

Posted by mwidlake in performance, VLDB.
Tags: , ,
10 comments

<Previous PostNext Post>
Partitions are often used to help increase the performance of SQL select activity via the concept of partition exclusion – the CBO will be able to identify which of the partitions could hold the data of interest and ignore the others. This is their main benefit from a SQL Select performance perspective.

The problem is, the partitioning key is not always included in your SQL statements.

I discussed this potential negative impact of partitioning on SQL select performance in this post and this one also. I did not give any proof then, so the first half of this post provides that.

As an example, I have created a table with 45 partitions, partitioned on the column ID, an ascending numeric primary key (traditionally sourced from a sequence). There is an index on the ID column, locally partitioned. The table also has a column CRE_DATETIME, which is the date the record was created. This column is also indexed with a locally partitioned index.

NAME                           VALUE
------------------------------ -------------
compatible                     10.2.0.3.0
cpu_count                      8
db_block_size                  8192
db_file_multiblock_read_count  16
optimizer_mode                 ALL_ROWS
sga_target                     0
sort_area_size                 65536

create table test_p3
(id           number(10) not null
,cre_datetime date not null
,status        number(1) not null
,num_1         number(4) not null -- random 20
,num_2         number(4) -- random 500
,num_3         number(5) -- cycle smoothly
,num_4         number(5) -- Random 10000
,vc_1          varchar2(10)
,vc_2          varchar2(10)
,vc_pad        varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
,partition id_03k values less than (3000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
--
,partition id_max values less than (maxvalue)
tablespace users
)
/
select table_name,partitioning_type p_type
,partition_count pt_count
from dba_part_tables
where owner=user and table_name = 'TEST_P3'
TABLE_NAME                     P_TYPE    PT_COUNT
------------------------------ ------- ----------
TEST_P3                        RANGE           46

>@ind_cols
IND_NAME           TAB_NAME           PSN       COL_NAME
------------------ ------------------ --------- ------------
TP3_PK             TEST_P3            1         ID
TP_CRE_DT          TEST_P3            1         CRE_DATETIME

{This is a very typical senario, as is the opposite situation where the table is partitioned on date but has a numeric column holding eg order number or customer ID.}

The below shows a select of a single record by ID.

select id,cre_datetime,num_1,vc_1
from test_p3 where id=37123
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     37123 12-OCT-2009 02:52         19 RMZAN
1 row selected.

Execution Plan
----------------------------------------------------------
----------------------------------------------
| Id  | Operation                          | Name    | Rows  | 
Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------------------
|   0 | SELECT STATEMENT                   |         |     1 |
  22 |     2   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE            |         |     1 |
  22 |     2   (0)| 00:00:01 |    38 |    38 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |     1 |
  22 |     2   (0)| 00:00:01 |    38 |    38 |
|*  3 |    INDEX UNIQUE SCAN               | TP3_PK  |     1 |
     |     1   (0)| 00:00:01 |    38 |    38 |
----------------------------------------------------------------
----------------------------------------------
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads

Note the PARTITION RANGE SINGLE and pstart/pstop are both 38. Consistent gets are only 3. The query is very efficient as the CBO identified that records matching the WHERE clause could only exist in the one partition. The relevant local index is examined for 2 consisted gets (the BLEVEL of the local index partitions is 1) and then one consistent get against the table.

Now I’ll scan for records for a range of IDs:

select id,cre_datetime,num_1,vc_1
from test_p3 where id between 30980 and 31019
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     30980 25-SEP-2009 01:20         15 JQLYA
     30981 25-SEP-2009 01:24          5 WYWCR
     30982 25-SEP-2009 01:28         10 QZSHD
     30983 25-SEP-2009 01:32         18 KQBSU
     30984 25-SEP-2009 01:36          2 HRPMA
....
     31018 25-SEP-2009 03:52          3 YOPJO
     31019 25-SEP-2009 03:56         10 GNGKG

40 rows selected.

Execution Plan
----------------------------------------------------------
----------------------------------------------
| Id  | Operation                          | Name    | Rows  | 
Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------------------
|   0 | SELECT STATEMENT                   |         |    40 |
 880 |     9   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ITERATOR          |         |    40 |
 880 |     9   (0)| 00:00:01 |    31 |    32 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |    40 |
 880 |     9   (0)| 00:00:01 |    31 |    32 |
|*  3 |    INDEX RANGE SCAN                | TP3_PK  |     2 |
     |     2   (0)| 00:00:01 |    31 |    32 |
----------------------------------------------------------------

Statistics
---------------------------------------------
          0  recursive calls
          0  db block gets
         52  consistent gets
          0  physical reads

Again, you can see partition exclusion, this time as a PARTITION RANGE ITERATOR and Pstart/Pstop of 31 and 32. Consistent gets were higher, but I did fetch several records.

Now a query is issued for a record with a specific value for CRE_DATETIME.

select id,cre_datetime,num_1,vc_1
from test_p3 where cre_datetime = 
   to_date('06-OCT-2009 12:20','DD-MON-YYYY HH24:MI')
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     35105 06-OCT-2009 12:20         11 JPQHO
1 row selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                          | Name      | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT                   |           |     1 |
    22 |    48   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL               |           |     1 |
    22 |    48   (0)| 00:00:01 |     1 |    46 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3   |     1 |
    22 |    48   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN                | TP_CRE_DT |     1 |
       |    47   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------

Statistics
---------------------------------------
          1  recursive calls
          0  db block gets
         90  consistent gets
          0  physical reads

Notice that partition exclusion does not occur, you see PARTITION RANGE ALL, Pstart/Pstop of 1 – 46. The CBO cannot identify which partitions may or may not hold records with that CRE_DATETIME. Consistent gets are now a lot higher, 90 (as compared to 3 for the situation where partition exclusion can occur}, as oracle has to prob each and every local index partition to identify if any records for the given value exist.

Of course, you probably would not look for a record with a specific datetime, but in the opposite senario of having partitioned on datetime, it would not be at all unusual to look for a record with a given ID and partition exclusion not being possible.

As I mentioned earlier, this need to scan all of the local indexes because no partition exclusion is possible is something I discussed a couple of weeks back.

Now I will select for the range of dates that my range scan in ID identified:

select id,cre_datetime,num_1,vc_1
from test_p3 
where cre_datetime between to_date('25-SEP-2009 01:20','DD-MON-YYYY HH24:MI')
                   and     to_date('25-SEP-2009 03:56','DD-MON-YYYY HH24:MI')

        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     30980 25-SEP-2009 01:20          3 BGAFO
     30981 25-SEP-2009 01:24         15 PXGJD
     30982 25-SEP-2009 01:28         12 PGQHJ
     30983 25-SEP-2009 01:32         17 TIBZG
     30984 25-SEP-2009 01:36         11 EQQQV
...
     31018 25-SEP-2009 03:52         18 FSNVI
     31019 25-SEP-2009 03:56         16 LJWNO
40 rows selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                          | Name      | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT                   |           |    42 |
   924 |    54   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL               |           |    42 |
   924 |    54   (0)| 00:00:01 |     1 |    46 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3   |    42 |
   924 |    54   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN                | TP_CRE_DT |    42 |
       |    47   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------
------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        136  consistent gets
          0  physical reads

That actually selects back the same records, but now a PARTITION RANGE ALL is invoked, Pstart/Pstop os 1-46 and the number of consistent gets goes up to 136.

I discussed the impact of partitions on range scans in this second post about performance issues with partitions.

Thankfully, you can often alleviate this problem of considering all partitions.

The below shows a scan for records for a range of dates.

select count(*)
from test_p3
where cre_datetime between to_date('18-OCT-2009','DD-MON-YYYY')
                   and     to_date('20-OCT-2009','DD-MON-YYYY')

  COUNT(*)
----------
       721
1 row selected.

Execution Plan
----------------------------------------------------------
----------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost
(%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------
|   0 | SELECT STATEMENT     |           |     1 |     8 |    49
   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE      |           |     1 |     8 |
      |          |       |       |
|   2 |   PARTITION RANGE ALL|           |   731 |  5848 |    49
   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN  | TP_CRE_DT |   731 |  5848 |    49
   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------
----------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         90  consistent gets
          0  physical reads

As you can see, no partitiion exclusion is possible, every partition is considered (PARTITION RANGE ALL, Pstart/Pstop of 1-46). 90 consistent gets are required.

The trick is to introduce a second WHERE clause, limiting the query by the partitioning key as well.
As a human, you can recognise that the ID and the CRE_DATETIME columns are going to increase pretty much in synchronisation. Records with a CRE_DATETIME a month ago are going to be found in partitions with a lower ID range than those from a week ago. Oracle does not know this business logic, so you have to tell it.

Let us say you never have more than 1,000 records created a day. so if you want “today’s” data, you need only consider IDs that are between the max(ID) and the max(ID) less 1,000. To be safe, you would increase this range substantially. The idea is to exclude most partitions without risking missing data:

select count(*)
from test_p3
where cre_datetime between to_date('18-OCT-2009','DD-MON-YYYY')
                   and     to_date('20-OCT-2009','DD-MON-YYYY')
and  id between 39000 and 41000

  COUNT(*)
----------
       721
1 row selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                   | Name             | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT            |                  |     1 |
    13 |    14   (8)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE             |                  |     1 |
    13 |            |          |       |       |
|*  2 |   VIEW                      | index$_join$_001 |    34 |
   442 |    14   (8)| 00:00:01 |       |       |
|*  3 |    HASH JOIN                |                  |       |
       |            |          |       |       |
|   4 |     PARTITION RANGE ITERATOR|                  |    34 |
   442 |     7  (15)| 00:00:01 |    40 |    42 |
|*  5 |      INDEX RANGE SCAN       | TP_CRE_DT        |    34 |
   442 |     7  (15)| 00:00:01 |    40 |    42 |
|   6 |     PARTITION RANGE ITERATOR|                  |    34 |
   442 |     9  (23)| 00:00:01 |    40 |    42 |
|*  7 |      INDEX RANGE SCAN       | TP3_PK           |    34 |
   442 |     9  (23)| 00:00:01 |    40 |    42 |
----------------------------------------------------------------
------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         18  consistent gets
          0  physical reads

In the above I have managed to exclude most of the partitions and still get the correct data, by adding the clause and id between 39000 and 41000 . There is something a little unusal {if you have not seen it before} with the plan being scans of two indexes and then hashed together and viewed, but both are PARTITION RANGE ITERATOR scans and Pstart/Pstop values are 40-42. Consistent gets are down to 18.

You do have to take great care with this method that the extra WHERE clause you add on the partitioning key will never exclude records you want to find. But so long as you do, it is a very powerful technique.

I’ll go into that a little more in the next posting {which hopefully will not be as delayed as this one!}

Depth of Indexes on VLDBs November 18, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
3 comments

In a couple of recent posts on partitions, on single row and range scan access I have started exploring the performance of partitions. I will continue on the main theme soon, honest.

I feel a little bad that I have not given concrete examples so far {creating very large objects for the sake of demonstrations and proof-of-concepts is a bit of an issue with VLDBs, due to the space and time requirements to do it}. So, here is just some real-world information on index depth, or BLEVEL of indexes on Very Large DataBases.

The BLEVEL is the number of levels in the index. Minus one. See this post for a full description, but in brief, if an index has a BLEVEL of 3, that means there is a root node, a level of branch nodes, a second level of branch nodes and then the leaf nodes. 4 levels. 4 IOs to check if the index holds the value you are looking for. And then one or more IOs against the table to get the actual records from the table, if required (one block if there is only one row, one or more blocks if there are several rows).

The below is from a 25TB datawarehouse system. I am looking for the maximum index depth on the database:-

select max(blevel) from dba_indexes

MAX(BLEVEL)
———–
4

OK, how many indexes are this deep?…

select owner,index_name from dba_indexes
where blevel=4

OWNER INDEX_NAME
—————————— ———————-
ERIC W_IIIIIIII_URLS_PK

1 row selected.

Oh. I have a pretty massive database and the deepest index I have is BLEVEL 4 and there is one of them. So the index height is 5. And access to that table for a given referral ID is going to take 5 IOs to work down the index and one against the table.

What is more, the index is an IOT

INDEX_NAME INDEX_TYPE
—————————— —————————
W_REFERRAL_URLS_PK IOT – TOP

so, in fact, to get to the table record it is just the 5 IOs as the table record is held in the IOT.

So with a 25TB data warehouse, the deepest normal indexes are BLEVEL 3.

select index_name,num_rows/leaf_blocks
from dba_indexes
where BLEVEL =3
order by num_rows desc

INDEX_NAME                             NUM_ROWS  LEAF_BLOCKS
------------------------------ ---------------- ------------
W_WL_SH_API_XIIIIIIIIIII_ID     4,585,108,192   13,406,015
W_WL_SHIIIIIIIIIIIXXXX_ID    4,564,350,002   16,892,898
W_WL_SIIIIIIIIIIIIIIIIIIIIII_ID     4,422,775,820   17,189,536
W_WL_SHIIIIIIIIIIIIIIIIIIIIII_ID    4,182,087,545   12,243,438
W_WL_GX_RESIIIIIIIIIII_LOG_ID1     3,690,374,216   14,613,388
IDX_W_WL_LIIIIIIIIIIIESS_ID          351,507,905    1,958,109
WL_LOGIN_SUIIIIIIIIIIIIE_ID          348,004,861    1,550,180
IND_IIII_LEG_ID                     312,727,000      972,690
IND_TMLR_MARIIIIIIIIIII_ID           306,988,247    1,121,592
PK_TBL_IIIIIIIIIIIUNNER              305,566,240      939,775
PK_IIIIIIIIIIIGS                    284,208,000      863,000

I asked an old friend, Tony Webb, to check out the depth of indexes on the largest databases they have, which range from a few TB to several 10s of TB. {one  is 100TB plus, but a lot of the data is in LOBs, so the actual number of records is only billions, not 10’s of billions :-) } No index had a BLEVEL greater than 3.

Hi Martin,

I ran the first query on DAVE, DEE(our biggest db), DOZY and MICK. The max depth results were 3,2,3 and 3. Our largest database only had 2. In fact some of the other largish dbs also only have a max depth of 2.

That enough for your blog Martin?

Now, some of you may be smelling a Rat, and you would be right. At both sites, the VLDBs have been designed with partitioning as a key part of the Physical Implementation. All the largest objects are partitioned. What impact does partitioning have on BLEVEL?

The depth of the index segment is dependent on the number of keys per block.

Let us say you have 8k blocks {this is a questionably low block size for a data warehouse, but it is what some of these systems have} and your index is on a couple of columns, totalling 20 bytes per key on average, including rowid. Allowing block overhead, that is approx 8000/20 entries = 400 max….

Blow that, that’s whooly theory, I’ll look at my real 25TB database for number of rows per leaf block… 

select leaf_blocks, num_rows\leaf_blocks rows_per_leaf
from dba_ind_statistics
where num_rows &gt; 100000000
order by leaf_blocks desc

 LEAF_BLOCKS ROWS_PER_LEAF
------------ -------------
  30,211,949    325.602273
  28,717,447    262.762331
  26,308,608    382.505959
  25,429,862    385.047045
  24,054,626     309.92864
  23,660,747    382.571434
  22,458,058    411.063205
  22,316,976    346.620625
  18,875,827    379.952198
  17,451,901    338.600909
  17,189,536     257.29466

From the above let us take a low figure of 300 average enteries per block. For a BLEVEL 0 , a one block index, that would be 300 rows that are referenced. For a level 1 index, 300 entries in the root block will point to 300 leaf blocks, so that will be 300*300 rows…90,000 entires.

Go down another BLEVEL to 2 (and a height of 3) and you have a root node referencing 300 Branch nodes, referenceing 300 Leaf nodes each referencing 300 records. 300*300*300 = 27 million.

Another level (BLEVEL 3), another factor of 300 and that is 8.1 billion.

BLEVEL 0= 300
BLEVEL 1 = 90,000
BLEVEL 2 = 27,000,000
BLEVEL 3 = 8,100,000,000
BLEVEL 4 = 6,480,000,000,000 (6,480 billion entries).

You can see that with an 8k block size and being very pessimistic about the number of entries per block (I used a low average to calculate enttries per block where as in reality most indexes will fill the block at the root and branch nodes to close to full before throwing a new level) your have to increase your data volume by amost 300 to throw a new level of index.

On average, knowing nothing about the actual number of records in an index, you would have to reduce your number of indexed rows by 150 times (half of 300) to have a 50% chance of lowering the BLEVEL of a locally partitioned index by one. ie 150 or more partitions.

with a 32k block size, you would have to reduce your data volume by more like 620 times to drop the BLEVEL (not 600 as you have less wastage from block overhead with larger blocks).

To reduce it by two it would be 300*150 I think, wiht 8k block size {can any statisticians, like Mr Lewis, help me out on this one?}. If I am right, that is 45,000 partitions. In that 25TB datawarehouse, no table has more than 4,000 partitions.

That is why I say, when you partition a table and have locally partitioned indexes, you might reduce the index level by 1. Might.

However, none of these systems has more than 4000 partitions per table. That might sound a lot, but it is only going to reduce an index BLEVEL by 1. Almost certainly it will, but if you have read the previous postings, that is not going to really help index lookups be that more efficient :-)

I’ll just add a couple of last comments.

  • If you have an index BLEVEL of 4 or 5 or more and do not have a VLDB, you might want to look at why {hints, it could be an IOT, it could be an index on several concatenated VARCHAR2 columns, it could be you are using 2k block size, it could be that you regularly delete a large pecentage of your data and then re-insert more data, it could be you have a one-table database, hehe.}.
  • Point one does not mean you should rebuild said index. I almost never rebuild indexes. The need to do so is one of those Oracle myths. There are situations where index rebuilds are advantageous, but they are pretty damned rare. If you regularly rebuild all your indexes or you are thinking of doing so, can I suggest you book a “meeting” for an afternoon and sit down with a cup of tea and google/bing/yahoo or whatever and check out index rebuilding first? Ignore any sites you land on offering database health checks or training courses on cruise liners.

That’s all for now.

Partitions are Still Not for Performance – Sometimes November 5, 2009

Posted by mwidlake in performance.
Tags: ,
6 comments

This is a follow up to my first posting on why Partitions are not for Performance where I discuss how lookups against a partitioned table with a locally partitioned index and no partition pruning can lead to performance problems. Basically, the CBO ends up scanning all local indexes to find the record(s) you want, which with a hundred partitions will probably result in several hundred buffer gets instead of 4 or 5.

The first posting dealt with the simple situation where a single record is being selected but this lack of partition pruning also crops up with range scans of course.

{warning, this is a long post, but if you use partitions you need to understand this}.
{Caveat, it has been pointed out the last comment, and by association this one, appears partition-negative. I am not partition-negative, partitions are great but for reasons generally other than performance and, to improve performance, you need to understand how partitions fit in with the CBO}

Imagine you have an unpartitioned ORDERS table with 10 million rows, the primary key is the traditional ascending numeric, derived from a sequence.
About 5,000 orders a day are received, the table goes back around 10 years (as in the past, less orders per day were received).
There is also an ORDER_DATETIME column, a DATE type column holding the data and time the order was placed. This column has a traditional index on it, which has a B-level of 3.

You want to query for all ORDERS placed on a date, say 1st November 2009, so you issue a query with the following WHERE clause:
WHERE ORDER_DATETIME >= TO_DATE(’01-NOV-2009′,’DD-MON-YYYY’)
AND ORDER_DATETIME < TO_DATE(’01-NOV-2009′,’DD-MON-YYYY’)

That will select around 5,000 out of 10 million rows and so a nested loop lookup on the index is the most suitable {just accept this for now, OK?}

The CBO will work down the levels on the index on ORDER_DATETIME to the first record of interest, taking 4 I/Os. It will then scan the index for all entries in that range, so maybe 10 leaf blocks, and then read the database blocks for each one – 5000 I/Os against the table, but as the data will be clustered well, most reads will be to the same say 100 blocks, so they will be cached in the block buffer cache. {Under Oracle – to 11.1 anyway – all IO is treated as physical, so the default optimizer will calculate the cost based on this, but in reality most IO will be logical IO, not physical IO}.

So the real cost will be 4 IOs to find the start of the index to scan, 10 I/Os to scan the index and 5000 IOs to get the table data – 5014 logical I/O’s, with between 0 and (4+10+100 =114) physical IOs to get the data into the cache, depending on what was cached already.

OK, remember those figures…

Now we partition the table by the ORDER_DATETIME column into 100 partitions and the index on ORDER_DATETIME is locally partitioned of course. Your same WHERE clause on ORDER_DATETIME will now resolve to a single index partition.

If you are lucky the local index partition will have a B-level of 2, 1 less than the standard non-partitioned index. So, with the same WHERE clause, the CBO will work down the local index partition to the first record of interest, for 3 I/Os. Then it will scan the index for the same 10 leaf blocks and find 5,000 records to check in the table partition, with the same 5,000 I/Os. You have saved….Yep, 1 logical IO. Physical IOs are between 0 and (3+10+100 =113). Let’s be kind and assume you saved a physical IO.

In this case, partitioning saved 1 out of 5104 logical and 1 out of 114 physical IOs.

Partitioning, in this case, is performance agnostic. It matters very, very little.

In reality, if your application is working on the same day of data over and over again, using the one partition, then that set of data will be cached and a nested-loop access to it will find the data in memory -but that would be true with the standard table and index too :-) So still no gain….

I’ve assume a nested-loop access path is still used by the CBO when accessing your partition. This is where things can be very different. {and it is also where I should swap from theory and start creating some test data to prove this, but I’ll hang myself out first and then look like an idiot next week when I am back with my test system :-)}.

Let’s estimate that the unpartitioned table is 100,000 blocks (100 rows per block).
With 5,104 I/Os against a 10 million row table/index , a nested loop lookup is likely to be chosen as the CBO estimated the number of I/Os to be less than scanning the whole table (100,000 blocks divided by the actual multi block read count {which is not the db_file_multi_block_read_count, but I’m skipping over that detail}, let’s say 12, for say 8,500 I/Os ).

With the table split into 100 partitions, there is 1,000 blocks in each partition. The whole table can be scanned with 1,000/real-multi-block-read-count {12}. So 80 or so I/Os

So in reality, your partitioning MAY be beneficial as the CBO decided to forget the index and nested lookups and simply scans the whole partition into memory for maybe 80-100 I/Os using multi block scans and processes the data in memory, burning some CPU as well, to find the records you want.

So, instead of 5,103 IOs the CBO decides to do 80-100 I/Os and some extra memory/cpu work, which is cheap given the power of modern CPUs.

So, partitions may help performance as the CBO swaps from nested loop lookups with an index to simply full scanning the partition.

You may start to wonder if that index on ORDER_DATETIME helps much… If it is to support range scans of a significant number of records, maybe not, but if it helps single or small-number record lookups, the index still helps.

I don’t know about you, but my brain is full for the night. I’ll leave this for another day….

Follow

Get every new post delivered to your Inbox.

Join 158 other followers