jump to navigation

Sail Away, Sail Away, Sail Away March 20, 2012

Posted by mwidlake in Meeting notes, VLDB.
Tags: , , , ,
1 comment so far

I’m just doing some last minute preparation for the Norwegian User Group Spring Seminar. This is quite a large conference with 5 concurrent streams and a very good line-up of presenters. You can see the agenda here.

What is also a little unusual about this conference is that it is mostly on a boat, or rather a ship. When I was first asked if I would put forward a talk or two I declined – as I am very prone to being sea-sick. Truls Bergersen, who helps organize the event, got in touch with me a few weeks later and said “we are doing a day on land first, do you want to present then?”. Sure! That would be great! So I am, I’ll be doing my talk on Index Organized Tables once more {and then I think I’ll put it to bed for a couple of years}.

Now, I am not exactly sure what made me agree to this, but then Truls asked if I would consider doing another talk I had suggested, on VLDBs (Very Large DataBases), the following day. Yes, why not?

Only, the next day we are on the ship. I’m a bit nervous about this. I don’t think that a good presentation should include the presenter going green and dashing for the door (though I did have a similar experience at Oracle Open World once, as a result of an all-you-can-eat Chinese meal the day before, but as it affected “the other end of things” you really don’t want details of that).

Thankfully, I’ve not been worrying about sea-sickness for the last couple of weeks. That will be because my laptop hard disk died and I lost everything on my still-new machine. It was still so new that I had not started backing it up properly – after all, there was nothing on there I could not replace easily.

It has been a real trial to replace all those things that I could easily replace. At least I had the presentations and some other critical bits on my temporary USB stick backup…

Database Sizing – How much Disk do I need? (The Easy Way) November 11, 2010

Posted by mwidlake in Architecture, development, VLDB.
Tags: , , , ,
7 comments

How much Disk do I need for my new Oracle database? Answer:-

  • 8-10 times the volume of raw data for an OLTP system
  • 2-4 times the raw data volume for a Data Warehouse.
  • The bigger the database, the nearer you will be to the lower multiplication factors.

{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}

Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…

In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.

First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.

Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.

If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.

If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.

The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:

  • The size of the Oracle binaries and the OS does not change as the database gets bigger.
  • The size of swap space does not increase in line wiht the database as, generally speaking, if you increase the database size from 100GB to 1TB you do not have the luxury of increasing the system memory of your server. It probably doubles.
  • Very large databases tend to have something making them big, like images or embedded documents, which are not indexed. Thus the ratio of table segments to index segments increases.
  • If you have a very large database you start removing indexes (often those that support constraints) to aid performance of data load and management, again improving the ratio of table segments to index segments.
  • Backups become partial or incremental to reduce the size and duration of the backup.
  • As mentioned before, the sheer size of system is such that you just take more care over cleaning up work areas, paring down the archived redo log areas (those files to compress well) and other areas.
  • If things get extreme or you have been doing this for donkeys years {note to none-UK people, this means many, many years} you start altering PCTFREE and checking over extent sizes.

My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).

I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.

Saturday Philosophy – The unbelievably small world of VLDBs June 12, 2010

Posted by mwidlake in VLDB.
Tags: , , ,
9 comments

Yesterday I posted about the potential for a Oracle in Science community within the UK Oracle user group {and wider for that matter, there is after all a world Oracle Life Science community but it is currently less vibrant than it was, sadly}.

My friend and occasional drinking partner Peter Scott replied to say he felt there was “a place for a SIG for stonking great databases” {now wouldn’t SGDB be a better TLA than VLDB? :-) }.

Well, I would agree but for one small detail. An apparent lack of anyone willing to be part of the community.

When I was building a very considerable VLDB {and I’m sorry I keep going on about it, I’ll try and stop soon} back in the early to mid 2000’s I seemed to be working in a vacuum of information, let alone prior experience. Yes, there was stuff in the Oracle manuals about how big things could theoretically be made and some vague advice on some aspects of it, but an absolute lack of any visible Oracle customers with anything even approaching the sizes I was contemplating. 2TB was about the limit and I was already way beyond that. Was this because I really was pushing the boundaries of database size? Well, I have since found out that whilst I was up there just behind the leading edge, there were several databases much, much bigger than mine and others already envisioned that might hit the Petabyte level, let alone Terabyte.

The thing is, no one would speak about them. At all.

We were left to do it all pretty much from scratch and it would not have been possible if I had not spent years building up with VLDBS as the definition of a VLDB size increased, plus of course cracking support by the other DBAs and Systems Admins around me. And to be fair, Oracle Corp helped us a lot with our efforts to build these massive databases. Interestingly, one Oracle Consultant would regularly tell me that our systems really were not so unusually big and there were plenty larger. He usually said this when I asked, exasperatedly as something else failed to scale, if Oracle had every tested things at this level :-). But despite constantly asking to meet with these people with massive systems, so we could exchange war stories and share advice, and being promised such contacts by Oracle, they never materialized except for CERN – who we already talked to as a fellow scientific organisation – and Amazon, who it turns out did things in a very different way to us {but it was really good to talk to them and find out how they did do their big databases, thanks guys}. Both were at the same scale or just behind where we were.

This is because most of the people with massive oracle databases will not talk about them as they are either run by the largest financial organisations, are to do with defense or in some other way just not talked about. In his comment Peter refers to a prior client with an OLTP-type system that is now around the PB scale. I would be pretty sure Peter can’t say who the client is or any details about how the system was designed.

So although I think there is a real need for a “stonking great databases” forum, I think there is a real problem in getting a user community of such people/organisations together. And if you did, none of the members would be allowed to say much about how they achieved it, so all you could do would be sit around and brag about who has the biggest. There is an Oracle community about such things, called the Terabyte Club, but last I knew it was invite-only and when I managed to get invited, it turned out that mine was biggest by a considerable margin, so I was still not meeting these elusive groups with 500TB databases. Maybe there is an Oracle-supported über database society but as I never signed the official secrets act might not have been eligible to play.

If I am wrong and anyone does form such a user group (or is in one!) I would love to be a member and I would strive to present and help.

I’ll finish with what appears to be a contradiction to what I have just written. There already is a UKOUG User Group that deals with large systems and I chair it – the Management and Infrastructure SIG. {sorry, the info on the web page could do with some updating}. Part of what we cover is VLDBs. But we also cover Very Many DataBases (companies with thousands of instances) and Very Complex DataBases plus how you go about the technical and management aspects of working in a massive IT Infrastructure. It might be that we could dedicate a meeting to VLDBs and see how it goes, but I know that whilst many who come along are dealing with database of a few TB, no one is dealing with hundreds of TB or PB database. Either that or they are keeping quiet about it, which takes us back to my main point. The MI SIG is probably the closest to a VLDB SIG we have in Europe though, and is a great bunch of people, so if you have a VLDB and want to meet some fellow sufferers, we have our next meeting on 23rd September in the Oracle City office.

Ensuring Correlated Partition Exclusion #2 December 20, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
2 comments

<Previous Post

A few days ago {oh dear, it is now two weeks!} I showed how you could better control Correlated Partition Exclusion by creating a lookup table that showed, for each partition, the minimum and maximum ID (the partition key) and the minimum and maximum CRE_DATETIME (what you want to limit your query on). Using this range table in the way I described, you did not have to take an educated guess as to what range of IDs you used to include the partitions for the date range you were interested in, you could check the table.

But this only worked reliably where the ID and CRE_DATETIME increase directly in proportion to each other. It was a special case.

What about the more normal case, where there is a correlation between the two columns but it is not a perfect correlation? This would happen, for example, if the CRE_DATETIME is entered from another system, or from paper records, where the order of the records is not enforced. So some records from today get put into the system before some from yesterday. Or if the correlation is even loser than this. eg you are looking at orders for new customers. You “know” there are going to be no orders for these customers from 5 years ago but you are not sure how far back in the orders table you should go to find what you want.

You can still use the lookup table method. The lookup table in effect becomes a meta-index – an index of the segments where you will find data but not actually the rows.

To demonstrate this, I created a table where the ID and CRE_DATETIME increase in order:

create a partitioned table test_p4
(id           number(10) not null
,cre_datetime date not null
,status        number(1) not null
,num_1         number(4) not null -- random 20
,num_2         number(4) -- random 500
,num_3         number(5) -- cycle smoothly
,num_4         number(5) -- Random 10000
,vc_1          varchar2(10)
,vc_2          varchar2(10)
,vc_pad        varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
,partition id_max values less than (maxvalue)
tablespace users
)
--
-- local partitioned indexes on the table
 IND_NAME           TAB_NAME           PSN       COL_NAME
 ------------------ ------------------ --------- ------------
 TP4_CRE_DT         TEST_P4            1         CRE_DATETIME

 TP4_PK             TEST_P4            1         ID
--
-- populate the table with data
insert into test_p4(id,cre_datetime,status,num_1,num_2,num_3,num_4
                  ,vc_1,vc_2,vc_pad)
select rownum
,to_date('01-JUL-2009','DD-MON-YYYY')+(rownum/360)
,decode(mod(rownum,100),0,1
              ,0)
,trunc(dbms_random.value(1,20)) 
,trunc(dbms_random.value(1,50)) 
,mod(rownum,10)+1
,trunc(dbms_random.value(1,10000))
,dbms_random.string('U',5)
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )
,lpad('A',1000,'A')
from dba_objects
where rownum <43000

I then messed with the data, updating 10% of records and setting the CRE_DATETIME to plus or minus a random amount up to 2 days different, so data in partitions would overlap. I then created a range table in the same way as I did for the previous post.

I ended up with a range table like the below:

MIN_CRE_DATI      MAX_CRE_DATI          MIN_ID     MAX_ID
----------------- ----------------- ---------- ----------
29-JUN-2009 20:35 05-JUL-2009 14:24          1        999
02-JUL-2009 04:50 07-JUL-2009 12:56       1000       1999
05-JUL-2009 02:34 10-JUL-2009 08:31       2000       2999
08-JUL-2009 05:23 13-JUL-2009 03:32       3000       3999
11-JUL-2009 08:07 15-JUL-2009 18:41       4000       4999
14-JUL-2009 00:14 18-JUL-2009 18:27       5000       5999
16-JUL-2009 08:58 21-JUL-2009 14:18       6000       6999
19-JUL-2009 01:28 24-JUL-2009 11:21       7000       7999
22-JUL-2009 08:02 27-JUL-2009 07:01       8000       8999
24-JUL-2009 22:06 30-JUL-2009 05:37       9000       9999
28-JUL-2009 04:59 01-AUG-2009 10:57      10000      10999
...
24-SEP-2009 01:52 28-SEP-2009 18:36      31000      31999
26-SEP-2009 16:49 01-OCT-2009 01:26      32000      32999
29-SEP-2009 13:20 04-OCT-2009 13:43      33000      33999
02-OCT-2009 08:40 07-OCT-2009 10:11      34000      34999
05-OCT-2009 04:29 10-OCT-2009 04:09      35000      35999
08-OCT-2009 02:04 12-OCT-2009 17:34      36000      36999
10-OCT-2009 20:03 15-OCT-2009 08:39      37000      37999
13-OCT-2009 15:09 18-OCT-2009 12:01      38000      38999
16-OCT-2009 06:49 21-OCT-2009 03:53      39000      39999
18-OCT-2009 20:16 23-OCT-2009 21:01      40000      40999
21-OCT-2009 12:10 26-OCT-2009 07:13      41000      41999
25-OCT-2009 01:29 29-OCT-2009 20:56      42000      42999

You can see that the MIN_CRE_DATI-MAX_CRE_DATI from record to record (partition to partition) overlap but generally increase.

How do you find the start and end of the ID range to cover a date period you are interested in? I always have to sit down with a pen a paer to work this out. It is the classic “overlapping ranges” check, but my brain cannot hold on to it. So here goes. I want all records between the 1st of October and the 6th, this year.

You want to find the partition with the lowest ID range which has a record that falls into the date range you want. ie the Maximum CRE_DATETIME record is as late or later than the range you are interested in. If the maximum CRE_DATETIME is less than the date range you are interested in, no records will be in that partition.

Here is the code to find the lowest partition

select min(min_id)
from tp4_range 
where MAX_CRE_DATI>=to_date('01-OCT-2009','DD-MON-YYYY')
MIN(MIN_ID)
-----------
      32000

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    12 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    12 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    12 |   144 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MACD_MIAD |    12 |       |     1
----------------------------------------------------------

To cover the whole of the partition of interest you need to look for records with and I greater than the minimum in that partition, thus the selection of the min(min_id)

Similarly, you only want partitions where the minimum CRE_DATETIME is before the end of the date range you want. If all records in the partition have a CRE_DATETIME beyond the range, you are not interested in it.

select max(max_id)
             from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY')
MAX(MAX_ID)
-----------
      35999

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    13 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    13 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    37 |   481 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MICD_MIID |    37 |       |     1
----------------------------------------------------------

Then you put these sub-queries into the query to select the data you want. Below I show the “original” code which does not use the partition range table and then using the partition range table, to prove the same number of records come back and to see the plan and statistics change:

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
------------------------------------------------------
| Id  | Operation                           | Name       | R
ows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
------------------------------------------------------
|   0 | SELECT STATEMENT                    |            |
   1 |    11 |   356   (1)| 00:00:02 |       |       |
|   1 |  SORT AGGREGATE                     |            |
   1 |    11 |            |          |       |       |
|   2 |   PARTITION RANGE ALL               |            |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P4    |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  4 |     INDEX RANGE SCAN                | TP4_CRE_DT |
1788 |       |    52   (0)| 00:00:01 |     1 |    46 |
------------------------------------------------------------
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        676  consistent gets
          0  physical reads

Note the 676 consistent gets and the Pstart/Pstop of 1-46

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and id> (select min(min_id)
             from tp4_range where MAX_CRE_DATI
                 >=to_date('01-OCT-2009','DD-MON-YYYY'))
and id < (select max(max_id)
             from tp4_range where MIN_CRE_DATI
                <=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
-----------------------------------------------------------
| Id  | Operation                            | Name
  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
-----------------------------------------------------------
|   0 | SELECT STATEMENT                     |
  |     1 |    16 |    66   (5)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                      |
  |     1 |    16 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR           |
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P4
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|   4 |     BITMAP CONVERSION TO ROWIDS      |
  |       |       |            |          |       |       |
|   5 |      BITMAP AND                      |
  |       |       |            |          |       |       |
|   6 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|   7 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|*  8 |         INDEX RANGE SCAN             | TP4_PK
  |  1788 |       |     2   (0)| 00:00:01 |   KEY |   KEY |
|   9 |          SORT AGGREGATE              |
  |     1 |    12 |            |          |       |       |
|  10 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    12 |   144 |     2   (0)| 00:00:01 |       |       |
|* 11 |            INDEX RANGE SCAN          | TP4R_MACD_MIA
D |    12 |       |     1   (0)| 00:00:01 |       |       |
|  12 |          SORT AGGREGATE              |
  |     1 |    13 |            |          |       |       |
|  13 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    37 |   481 |     2   (0)| 00:00:01 |       |       |
|* 14 |            INDEX RANGE SCAN          | TP4R_MICD_MII
D |    37 |       |     1   (0)| 00:00:01 |       |       |
|  15 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|  16 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|* 17 |         INDEX RANGE SCAN             | TP4_CRE_DT
  |  1788 |       |    52   (0)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        350  consistent gets
          0  physical reads

The plan is admittedly more complex and the SQL can be tricky to understand if you are not used to code that looks for overlap of ranges. But Consistent gets is down to 350 and the Pstart/Pstop values are KEY-KEY. The CBO cannot tell you WHAT the ranges will be when the code is parsed {I think it should check but in 10.2.0.3 at least, the CBO is not that smart at parse time}, but it knows there will be a start and stop.

I am using tiny partitions for my example and only 45 of them for the table. When the partitions are for millions of rows and there are a couple of thousand of them, excluding partitions in a manner you can rely on is a powerful tool. Which leads onto a final word of caution.

Nothing is enforcing that the range table is maintained.

You could do things with triggers or regular re-calculation of the ranges table but this will be something you need to consider if you use ranges tables to help partition exclusion. Flipping tablespaces to read-only can help the worry go away though… :-)

As an example of the issue of needing to maintain the ranges table and also a demonstration that the ranges table does work correctly if maintainted, I’ll update a record well outside of the “expected window”, show that it does not appear in my range-check controlled code, then update the ranges table and try again.

update test_p4
set cre_datetime=to_date(’02-OCT-2009 11:15′,’DD-MON-YYYY HH24:MI’)
,NUM_2=5
where id=22100
/
3> select count(*) from test_p4
where cre_datetime between to_date(’01-OCT-2009′,’DD-MON-YYYY’)
and to_date(’06-OCT-2009′,’DD-MON-YYYY’)
and id> (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

COUNT(*)
———-
49
— success

I wonder if I will do the next posting on this topic in less than two weeks!

Ensuring Correlated Partition Exclusion December 7, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
9 comments

<Previous Post…Next Pos >
I’ve posted a couple of times recently about a tuning method with partition exclusion where you infer a relationship between the value you want to limit a query on and the partition key. It takes a while to explain the theory so I am going to give it a name, in the vain hope it catches on {forgive me if someone has already done this – in fact, just flame me with pointers if you already know a name for this}. At the very least, for my own postings, I can use this name from now on and link back to a single posting explaining it.

I’m going to call it Correlated Partition Exclusion. In essence, you have partitioned the table on a key, in my case the primary key ID, which is an ascending numeric probably sourced from a sequence.
You have a second column, in my case CRE_DATETIME, which increases in line with the PK. If you limit your query on the CRE_DATETIME partition exclusion is not possible as there is no guarantee which CRE_DATETIME values appear in which partition. But, as a human, you understand that if you create 10,000 records a day, if you want to look at the last week’s date you can use:

WHERE ID > MAX_ID-(7*10000)

to exclude partitions with an id more than 7 days worth ago.

So, you have your Correlated Partition Exclusion.

How can you be sure that going back 70,000 IDs is going to safely cover one week of data and how can you maximise your efficiency of including only partitions that cover the date range? {note, I am using a numeric ID as my partition range key and a datetime as my correlated column, this principle works just as well if you partition on datetime and want to correlate to a (generally) ascending numeric key}

Here is my test table :-

create table test_p3
(id number(10) not null
,cre_datetime date not null
,status number(1) not null
,num_1 number(4) not null -- random 20
,num_2 number(4) -- random 500
,num_3 number(5) -- cycle smoothly
,num_4 number(5) -- Random 10000
,vc_1 varchar2(10)
,vc_2 varchar2(10)
,vc_pad varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
,partition id_03k values less than (3000)
tablespace users
,partition id_04k values less than (4000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
--
,partition id_max values less than (maxvalue)
tablespace users
)
/
--@ind_cols
IND_NAME TAB_NAME PSN COL_NAME
------------------ ------------------ --------- --------------------
TP3_PK TEST_P3 1 ID

TP_CRE_DT TEST_P3 1 CRE_DATETIME

If I want to look for all records between two dates I could use code like the below (based in that suggested by Bernard Polarski, any mistakes are mine).

with get_min_id as
(select max(id) min_id from test_p3
where cre_datetime >= TO_DATE('18-OCT_2009','DD-MON-YYYY') )
,get_max_id as
(select min(id) max_id from test_p3
where cre_datetime <= TO_DATE('20-OCT_2009','DD-MON-YYYY') )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')
/
COUNT(*)
----------
721

1 row selected.

ie find the minimum ID for the start date and the maximum ID for the end date and query between them.

This works fine. Unfortunately, you get a plan like the below

-------------------------------------------------------
| Id | Operation | Name |
Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
-------------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 39 | 887 (4)| 00:00:04 | | |
| 1 | SORT AGGREGATE | |
1 | 39 | | | | |
| 2 | NESTED LOOPS | |
2 | 78 | 887 (4)| 00:00:04 | | |
| 3 | NESTED LOOPS | |
1 | 26 | 717 (5)| 00:00:03 | | |
| 4 | VIEW | |
1 | 13 | 505 (5)| 00:00:02 | | |
|* 5 | FILTER | |
| | | | | |
| 6 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 7 | VIEW | index$_join$_001 |
38393 | 974K| 505 (5)| 00:00:02 | | |
|* 8 | HASH JOIN | |
| | | | | |
| 9 | PARTITION RANGE ALL | |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
|* 10 | INDEX RANGE SCAN | TP_CRE_DT |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
| 11 | PARTITION RANGE ALL | |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 12 | INDEX FAST FULL SCAN | TP3_PK |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 13 | VIEW | |
1 | 13 | 212 (5)| 00:00:01 | | |
| 14 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 15 | VIEW | index$_join$_002 |
2972 | 77272 | 212 (5)| 00:00:01 | | |
|* 16 | HASH JOIN | |
| | | | | |
| 17 | PARTITION RANGE ALL | |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
|* 18 | INDEX RANGE SCAN | TP_CRE_DT |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
| 19 | PARTITION RANGE ALL | |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 20 | INDEX FAST FULL SCAN | TP3_PK |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 21 | PARTITION RANGE ITERATOR | |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 22 | TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 23 | INDEX RANGE SCAN | TP_CRE_DT |
707 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
892 consistent gets
0 physical reads

If you look at the plan {towards the end, and sorry about the naff layout, my blog is not wide enough for this sort of printout} there are two checks on the TP_CRE_DT indexes that scan all partitions of the index – Pstart/Pstop are 1-46. This is the CBO looking for the partitions where the stated CRE_DATETIME records occur in the whole table. Cost is 892 consistent gets, much of which is the checking of local index partitions that will hold no relevant entries.

Bernard also spotted that the code was considering all partitions and was not as efficient as it could be but did not know how to get around it except with a Global index, which brings it’s own issues.

One way is to just say “well, I know how many records per day I get so I will fake up the ID limits based on this”. The problem is, and this is why the CBO cannot make the decision for you, is that there is no guarantee, no fixed rule, saying that CRE_DATETIME will always increment with the ID. In fact, there is nothing stopping you altering a record which has an ID from yesterday having a CRE_DATETIME from 10 years ago {but forget I said that until tomorrow’s post}. Now, in my example the CRE_DATETIME is going to increment with ID. We will use this special case for now, I know it is flawed and will address that flaw {tomorrow}.

So to ensure yo do not miss data you end up making your ID window pretty large to endure you do not miss records. Say you want all records for the last day, you process 1500 records a day, so you consider a window covering all samples with an ID within the last 10,000. It will still be more efficient than scanning all partitions and should be safe. Fairly safe.

The way out of this is to have a ranges table. Something that tells you, for each partition, which is the maximum and minimum CRE_DATE and the IDs covered by that range. You can then use that to identify the partitions that cover the date range you are interested in.

Here is an example. I will create a simple table to hold the ranges:

create table tp3_range
(min_cre_dati date
,max_cre_dati date
,min_id number
,max_id number)

Now you have to populate it.
The below code will create the first record for the first partition.

insert into tp3_range
SELECT MIN(CRE_DATETIME)
,MAX(CRE_DATETIME)
,MIN(ID)
,MAX(ID) FROM TEST_P3 PARTITION (ID_01K)
/
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (
%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 12 | 102
(1)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 12 |
| | | |
| 2 | PARTITION RANGE SINGLE| | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
| 3 | TABLE ACCESS FULL | TEST_P3 | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
------------------------------------------------------------------

Statistics
----------------------------------------------------------
1 recursive calls
3 db block gets
195 consistent gets
0 physical reads

As you can see from the plan and cost, this is not very efficient as it has to scan the whole partition. Maybe not a problem if you do this once, but there are indexes on both these columns, can’t this be done more efficiently? Yes, if you split up the code into four in-line selects (if you want more details about it being more performant to do MIN and MAX on their own than in one statement then see this post on the topic ):

insert into tp3_range
SELECT (SELECT MIN(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MIN(id ) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(ID ) FROM TEST_P3 PARTITION (ID_04K))
FROM DUAL

PLAN
---------------------------------------------------------
SORT AGGREGATE
PARTITION RANGE SINGLE cst:2 rws:1000
INDEX FULL SCAN (MIN/MAX) TP_CRE_DT cst:2 rws:1000
Statistics
----------------------------------------------------------
1 recursive calls
5 db block gets
9 consistent gets
0 physical reads

{You may wonder why the Explain Plan section above has a different look. This is because there seems to be a bug in 10.2.0.3 where the autotrace plan for the insert statement comes out wrong, as a FAST DUAL access, so I had to Explain the statement in a different way}

You would run one of the above statements against each partition. Probably, the “Best Practice” way would be to generate a SQL script from a query against DBA_TAB_PARTITIONS.

To populate my table I cheated – I just assumed all my partitions are of the same size (1000 records) and used:-
insert into tp3_range
select min(cre_datetime),max(cre_datetime),min(id),max(id)
from test_p3
— where id between 10000 and 20000
group by trunc(id/1000)
/

Note I left in a commented line. You could run the above against the whole table and then limit it to just new partitions as you add them. This is a tad risky though, you are relying on the partitioning being perfect and it would not scale to hundreds of very large partitions. You would be better off with the partition-by-partition methods above.

You end up with a table like the below:-

select * from tp3_range order by min_cre_dati


MIN_CRE_DATI MAX_CRE_DATI MIN_ID MAX_ID
----------------- ----------------- ---------- ----------
01-JUL-2009 00:04 03-JUL-2009 18:36 1 999
03-JUL-2009 18:40 06-JUL-2009 13:16 1000 1999
06-JUL-2009 13:20 09-JUL-2009 07:56 2000 2999
09-JUL-2009 08:00 12-JUL-2009 02:36 3000 3999
12-JUL-2009 02:40 14-JUL-2009 21:16 4000 4999
14-JUL-2009 21:20 17-JUL-2009 15:56 5000 5999
17-JUL-2009 16:00 20-JUL-2009 10:36 6000 6999
20-JUL-2009 10:40 23-JUL-2009 05:16 7000 7999
23-JUL-2009 05:20 25-JUL-2009 23:56 8000 8999
26-JUL-2009 00:00 28-JUL-2009 18:36 9000 9999
28-JUL-2009 18:40 31-JUL-2009 13:16 10000 10999
31-JUL-2009 13:20 03-AUG-2009 07:56 11000 11999
...
14-OCT-2009 13:20 17-OCT-2009 07:56 38000 38999
17-OCT-2009 08:00 20-OCT-2009 02:36 39000 39999
20-OCT-2009 02:40 22-OCT-2009 21:16 40000 40999
22-OCT-2009 21:20 25-OCT-2009 15:56 41000 41999
25-OCT-2009 16:00 28-OCT-2009 10:36 42000 42999

add the two below indexes:
create index tp3r_micd_miid on tp3_range(min_cre_dati,min_id);
create index tp3r_macd_miad on tp3_range(max_cre_dati,max_id);

And now you can find the upper and lower ID bounds for a date range with the following code:

select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
)
MIN_ID
----------
39000

Execution Plan
---------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
---------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MICD_MIID |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |    40 |   360 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MICD_MIID |    40 |   360 |     1
---------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
)

MAX_ID
----------
39999

Execution Plan
----------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MACD_MIAD |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |     4 |    36 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MACD_MIAD |     4 |    36 |     1
----------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

Put it all together into one statement and let’s see how much it costs:

with get_min_id as
(select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
) )
,get_max_id as
(select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
) )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')

COUNT(*)
----------
721
1 row selected.

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Row
s | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
----------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 57 | 67 (5)| 00:00:01 | | |
| 1 | SORT AGGREGATE | |
1 | 57 | | | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P3 |
2 | 26 | 65 (5)| 00:00:01 | | |
| 3 | NESTED LOOPS | |
2 | 114 | 65 (5)| 00:00:01 | | |
| 4 | MERGE JOIN CARTESIAN | |
1 | 44 | 2 (0)| 00:00:01 | | |
|* 5 | INDEX RANGE SCAN | TP3R_MICD_MIID |
1 | 22 | 1 (0)| 00:00:01 | | |
| 6 | SORT AGGREGATE | |
1 | 9 | | | | |
| 7 | FIRST ROW | |
40 | 360 | 1 (0)| 00:00:01 | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MICD_MIID |
40 | 360 | 1 (0)| 00:00:01 | | |
| 9 | BUFFER SORT | |
1 | 22 | 1 (0)| 00:00:01 | | |
|* 10 | INDEX RANGE SCAN | TP3R_MACD_MIAD |
1 | 22 | 1 (0)| 00:00:01 | | |
| 11 | SORT AGGREGATE | |
1 | 9 | | | | |
| 12 | FIRST ROW | |
4 | 36 | 1 (0)| 00:00:01 | | |
|* 13 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MACD_MIAD |
4 | 36 | 1 (0)| 00:00:01 | | |
| 14 | PARTITION RANGE ITERATOR | |
| | | | KEY | KEY |
| 15 | BITMAP CONVERSION TO ROWIDS | |
| | | | | |
| 16 | BITMAP AND | |
| | | | | |
| 17 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 18 | SORT ORDER BY | |
| | | | | |
|* 19 | INDEX RANGE SCAN | TP3_PK | 7
07 | | 6 (0)| 00:00:01 | KEY | KEY |
| 20 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 21 | SORT ORDER BY | |
| | | | | |
|* 22 | INDEX RANGE SCAN | TP_CRE_DT | 7
07 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
133 consistent gets
0 physical reads

If you look way back up this post, you will see that the number of records selected by the above is the same as from the code indicated by Bernard (721 records) and for a lot less cost – 133 consistent gets compared to 892.

So I have hopefully explained the issue (having to visit all the index partitions to check the date range), shown how a ranges table can help, given some simple and a less-simple-but-more-efficient examples of code to populate the table and finally shown that using the range table can be more efficient.

And I arrogantly gave it a name – Correlated Partition Exclusion :-)

Some of you may remember tha this example has been a special case, as there is no overlap between the dates; the relationship between the CRE_DATETIME and ID is perfect. You are unlikely to have that in real life. Also, dome of you may also be tired of reading this post. So I will cover the general case in the NEXT post.

More on Assisting Partition Exclusion December 6, 2009

Posted by mwidlake in performance, VLDB.
Tags: , ,
2 comments

A couple of weeks ago I posted about how you could help the oracle CBO carry out partition exclusion and thus aid performance. In essence, you have a table partitioned on ID (based on an ascending numeric) and you also have a date column (CRE_DATETIME, indexed using a locally partitioned index), which you often want to limit queries on.
The ID and the CRE_DATETIME both increase over time, so as a human you can infer that a date created range will match to an ID range that you can limit any queries on and get partition exclusion. The CBO does not understand this relationship and so can’t help. If you want to look at records for the last week and you know that you generate 10,000 records a day maximum, you can add a where clause of {and this is pseudocode}
WHERE ID BETWEEN (MAX_ID-(10000*7))
AND (MAX_ID)
I’m afraid you will have to look back at the referenced post if this does not make sense.

The issue I want to address is that how can you be sure that you are using a fake ID range to cover the date range?

A simple real-world example of this is when someone (a business objects user or helpdesk user seems to be the usual source) wants to look up the details of a fairly new customer on the datawarehouse. They don’t know the ID of the record, but they know the new account was created this week and some other filtering data. So they run a query saying:

WHERE CRE_DATETIME >SYSDATE-7
AND COL_A = ‘XXXX’
AND COL_B = ‘YYYY’

This visits every partition in the table, even if there is a locally partitioned index on CRE_DATETIME. See this post for details of this problem. If you are really unlucky, and this seems to be the usual situation, there is no index on the CRE_DATETIME either, and a full scan of the billion-row table is initiated, tying up one CPU and giving that expansive {Sorry, type, expensive – though both are true) storage sub-system something to work on for a few hours.

However, in this situation, do this. Select max(ID) {the primary key} from the partitioned table, which will take less time than it does to type “select”. Then you say to the user:

 “OK, the max ID is currently 987,000,000. Add this to the WHERE clause:

AND ID > 980000000

If that does not find your record now change that additional clause to:

AND ID BETWEEN 970000000 AND 980000000

and just keep going down by 10 million each time.”

By doing this, the user will be limiting the query on the partition key, the ID, and partition exclusion will be possible and each query will come back quickly. The user will usually find the record they want in the first couple of attempts {or will be kept quiet and less demanding of the system until they get bored and come back to tell you “you are an idiot”, having scanned back over several 10’s of millions of records, but then you know they lied about the record being recent so you need to help them in some other way}.

This post is actually to remind you of where I had got on this thread, my being so unforgivably slow in keeping this thread running. Tomorrow (HONEST!) I will cover HOW you ensure your inferred ID range is covering the CRE_DATETIME range you are interested in.

Friday Philosophy – Statistically Significant November 27, 2009

Posted by mwidlake in performance, VLDB.
Tags: ,
3 comments

There are very few generalist Oracle DBAs around, and very very very few expert generalist Oracle DBAs. Tom Kyte might count, but I’m sure if you pressed him he would admit to knowing nothing about {some key aspect of being a general DBA}. That is because to be an expert on something you have to spend a lot of time on it. Oh, and learn. {I’ve spent a lot of time around my wife but she still confuses me, I just don’t seem to be able to learn}. Oracle is simply too wide a topic to be able to spend a lot of time on all of it, even if by “Oracle” you mean just the core RDBMS technology, which I do. You have to pick an area.

Pete Finnigan specialises in security, Julian Dyke on RAC and Internals, Doug Burns on the relationship between the database and cuddly toys. Oh and ASH/AWR.

So I ask myself, what is my specialty?

Well, if I go on the last 5 working years of my life, it would probably be Oracle Database Statistics. Which is quite ironic given my woeful attempts with Statistics when I tried that maths ‘A’ level all those years back {for non-UK people, an ‘A’ level is what you do at age 17- 18. It’s that point in maths when most of us are convinced logic is replaced by magic}. I’ll go further and say I specialise in Oracle Database Statistics on VLDBS. Maybe even more specific, Gathering and Maintaining Oracle Database Statistics on VLDBs.

Not a very sexy-sounding specialty is it, even in the context of IT technical specialties. I am sure that if I was to tell a lady I wish to impress that I was “a specialist in gathering and maintaining Oracle database statistics on VLDBs” then I would soon be standing alone as she looked for an accountant to talk to {I refer back to my comment on my wife, I repeatedly try to impress her with this line and she never fails to walk away}. Heck, I could spend all my time at the UKOUG Conference next week and struggle to find someone who would respond positively to such a claim to greatness.

But the situation is that I have had to deal with the failures and idiosyncrasies of this area of Oracle for 4 major clients on a dozen or so systems and have discussed it with half a dozen other people who have had challenging times with it. And even now it trips me up all the time. Because, frankly, some if it is not very well implemented (choking automated stats gathering job anyone?), different parts work in different ways (if you state a statid, statown and stattab when you SET_TABLE_STATS the values go into the stats table, if you state them for GATHER_TABLE_STATS, the gathered values go into the data dictionary and the OLD ones go into the stats table – yeah, I know, if you did not know that you are now going ?huh?), some of it is wrapped up in a blanket of confusion and secrecy (what exactly DOES the automated stats job do and when you say “GATHER AUTO” what exactly will Oracle do automatically?).

Thankfully the secrecy side is reducing as Oracle (and others) say more about how some of these things are controlled or decided “automatically” , but then the world shifts under your feet with new versions of Oracle. Stats gathering under 11g is far more open to control than 10, but as my current client is not on 11g then I can’t spend too long looking at that.

So currently I am a expert in 10g Gathering and Maintaining Oracle Statistics on VLDBs. Now if that is not pretty damned specialist then I don’t know what is.

I should blog technical details on all of this {sadly I know the answers to the things I mention above}, but I suspect people would go “Hmmm, interesting…..” in that kind of “who in heck needs to know that! I’m off to do block dumps of temporary segments” way. But if you think otherwise, let me know.

Besides, I am being very, very poor at getting on with the Partition stuff I want to do, even though I have half-written about 4 more posts on it. I need to stick to that for my technical blogs for now.

That and I kind of shy away from being an expert in such an unexciting area, I might get offered {oh dear lord not more} work in it.

Thankfully I also specialise in beta testing new features of Oracle for clients in the travel and media industries, where on-site work in exotic locations is, at times, required… If anyone has any opening in that field, drop me a line :-)

Assisting Partition Exclusion – Partitions for Performance November 23, 2009

Posted by mwidlake in performance, VLDB.
Tags: , ,
10 comments

<Previous PostNext Post>
Partitions are often used to help increase the performance of SQL select activity via the concept of partition exclusion – the CBO will be able to identify which of the partitions could hold the data of interest and ignore the others. This is their main benefit from a SQL Select performance perspective.

The problem is, the partitioning key is not always included in your SQL statements.

I discussed this potential negative impact of partitioning on SQL select performance in this post and this one also. I did not give any proof then, so the first half of this post provides that.

As an example, I have created a table with 45 partitions, partitioned on the column ID, an ascending numeric primary key (traditionally sourced from a sequence). There is an index on the ID column, locally partitioned. The table also has a column CRE_DATETIME, which is the date the record was created. This column is also indexed with a locally partitioned index.

NAME                           VALUE
------------------------------ -------------
compatible                     10.2.0.3.0
cpu_count                      8
db_block_size                  8192
db_file_multiblock_read_count  16
optimizer_mode                 ALL_ROWS
sga_target                     0
sort_area_size                 65536

create table test_p3
(id           number(10) not null
,cre_datetime date not null
,status        number(1) not null
,num_1         number(4) not null -- random 20
,num_2         number(4) -- random 500
,num_3         number(5) -- cycle smoothly
,num_4         number(5) -- Random 10000
,vc_1          varchar2(10)
,vc_2          varchar2(10)
,vc_pad        varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
,partition id_03k values less than (3000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
--
,partition id_max values less than (maxvalue)
tablespace users
)
/
select table_name,partitioning_type p_type
,partition_count pt_count
from dba_part_tables
where owner=user and table_name = 'TEST_P3'
TABLE_NAME                     P_TYPE    PT_COUNT
------------------------------ ------- ----------
TEST_P3                        RANGE           46

>@ind_cols
IND_NAME           TAB_NAME           PSN       COL_NAME
------------------ ------------------ --------- ------------
TP3_PK             TEST_P3            1         ID
TP_CRE_DT          TEST_P3            1         CRE_DATETIME

{This is a very typical senario, as is the opposite situation where the table is partitioned on date but has a numeric column holding eg order number or customer ID.}

The below shows a select of a single record by ID.

select id,cre_datetime,num_1,vc_1
from test_p3 where id=37123
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     37123 12-OCT-2009 02:52         19 RMZAN
1 row selected.

Execution Plan
----------------------------------------------------------
----------------------------------------------
| Id  | Operation                          | Name    | Rows  | 
Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------------------
|   0 | SELECT STATEMENT                   |         |     1 |
  22 |     2   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE            |         |     1 |
  22 |     2   (0)| 00:00:01 |    38 |    38 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |     1 |
  22 |     2   (0)| 00:00:01 |    38 |    38 |
|*  3 |    INDEX UNIQUE SCAN               | TP3_PK  |     1 |
     |     1   (0)| 00:00:01 |    38 |    38 |
----------------------------------------------------------------
----------------------------------------------
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads

Note the PARTITION RANGE SINGLE and pstart/pstop are both 38. Consistent gets are only 3. The query is very efficient as the CBO identified that records matching the WHERE clause could only exist in the one partition. The relevant local index is examined for 2 consisted gets (the BLEVEL of the local index partitions is 1) and then one consistent get against the table.

Now I’ll scan for records for a range of IDs:

select id,cre_datetime,num_1,vc_1
from test_p3 where id between 30980 and 31019
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     30980 25-SEP-2009 01:20         15 JQLYA
     30981 25-SEP-2009 01:24          5 WYWCR
     30982 25-SEP-2009 01:28         10 QZSHD
     30983 25-SEP-2009 01:32         18 KQBSU
     30984 25-SEP-2009 01:36          2 HRPMA
....
     31018 25-SEP-2009 03:52          3 YOPJO
     31019 25-SEP-2009 03:56         10 GNGKG

40 rows selected.

Execution Plan
----------------------------------------------------------
----------------------------------------------
| Id  | Operation                          | Name    | Rows  | 
Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------------------
|   0 | SELECT STATEMENT                   |         |    40 |
 880 |     9   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ITERATOR          |         |    40 |
 880 |     9   (0)| 00:00:01 |    31 |    32 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |    40 |
 880 |     9   (0)| 00:00:01 |    31 |    32 |
|*  3 |    INDEX RANGE SCAN                | TP3_PK  |     2 |
     |     2   (0)| 00:00:01 |    31 |    32 |
----------------------------------------------------------------

Statistics
---------------------------------------------
          0  recursive calls
          0  db block gets
         52  consistent gets
          0  physical reads

Again, you can see partition exclusion, this time as a PARTITION RANGE ITERATOR and Pstart/Pstop of 31 and 32. Consistent gets were higher, but I did fetch several records.

Now a query is issued for a record with a specific value for CRE_DATETIME.

select id,cre_datetime,num_1,vc_1
from test_p3 where cre_datetime = 
   to_date('06-OCT-2009 12:20','DD-MON-YYYY HH24:MI')
/
        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     35105 06-OCT-2009 12:20         11 JPQHO
1 row selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                          | Name      | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT                   |           |     1 |
    22 |    48   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL               |           |     1 |
    22 |    48   (0)| 00:00:01 |     1 |    46 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3   |     1 |
    22 |    48   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN                | TP_CRE_DT |     1 |
       |    47   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------

Statistics
---------------------------------------
          1  recursive calls
          0  db block gets
         90  consistent gets
          0  physical reads

Notice that partition exclusion does not occur, you see PARTITION RANGE ALL, Pstart/Pstop of 1 – 46. The CBO cannot identify which partitions may or may not hold records with that CRE_DATETIME. Consistent gets are now a lot higher, 90 (as compared to 3 for the situation where partition exclusion can occur}, as oracle has to prob each and every local index partition to identify if any records for the given value exist.

Of course, you probably would not look for a record with a specific datetime, but in the opposite senario of having partitioned on datetime, it would not be at all unusual to look for a record with a given ID and partition exclusion not being possible.

As I mentioned earlier, this need to scan all of the local indexes because no partition exclusion is possible is something I discussed a couple of weeks back.

Now I will select for the range of dates that my range scan in ID identified:

select id,cre_datetime,num_1,vc_1
from test_p3 
where cre_datetime between to_date('25-SEP-2009 01:20','DD-MON-YYYY HH24:MI')
                   and     to_date('25-SEP-2009 03:56','DD-MON-YYYY HH24:MI')

        ID CRE_DATETIME           NUM_1 VC_1
---------- ----------------- ---------- ----------
     30980 25-SEP-2009 01:20          3 BGAFO
     30981 25-SEP-2009 01:24         15 PXGJD
     30982 25-SEP-2009 01:28         12 PGQHJ
     30983 25-SEP-2009 01:32         17 TIBZG
     30984 25-SEP-2009 01:36         11 EQQQV
...
     31018 25-SEP-2009 03:52         18 FSNVI
     31019 25-SEP-2009 03:56         16 LJWNO
40 rows selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                          | Name      | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT                   |           |    42 |
   924 |    54   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL               |           |    42 |
   924 |    54   (0)| 00:00:01 |     1 |    46 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3   |    42 |
   924 |    54   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN                | TP_CRE_DT |    42 |
       |    47   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------
------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        136  consistent gets
          0  physical reads

That actually selects back the same records, but now a PARTITION RANGE ALL is invoked, Pstart/Pstop os 1-46 and the number of consistent gets goes up to 136.

I discussed the impact of partitions on range scans in this second post about performance issues with partitions.

Thankfully, you can often alleviate this problem of considering all partitions.

The below shows a scan for records for a range of dates.

select count(*)
from test_p3
where cre_datetime between to_date('18-OCT-2009','DD-MON-YYYY')
                   and     to_date('20-OCT-2009','DD-MON-YYYY')

  COUNT(*)
----------
       721
1 row selected.

Execution Plan
----------------------------------------------------------
----------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost
(%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
----------------------------------
|   0 | SELECT STATEMENT     |           |     1 |     8 |    49
   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE      |           |     1 |     8 |
      |          |       |       |
|   2 |   PARTITION RANGE ALL|           |   731 |  5848 |    49
   (0)| 00:00:01 |     1 |    46 |
|*  3 |    INDEX RANGE SCAN  | TP_CRE_DT |   731 |  5848 |    49
   (0)| 00:00:01 |     1 |    46 |
----------------------------------------------------------------
----------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         90  consistent gets
          0  physical reads

As you can see, no partitiion exclusion is possible, every partition is considered (PARTITION RANGE ALL, Pstart/Pstop of 1-46). 90 consistent gets are required.

The trick is to introduce a second WHERE clause, limiting the query by the partitioning key as well.
As a human, you can recognise that the ID and the CRE_DATETIME columns are going to increase pretty much in synchronisation. Records with a CRE_DATETIME a month ago are going to be found in partitions with a lower ID range than those from a week ago. Oracle does not know this business logic, so you have to tell it.

Let us say you never have more than 1,000 records created a day. so if you want “today’s” data, you need only consider IDs that are between the max(ID) and the max(ID) less 1,000. To be safe, you would increase this range substantially. The idea is to exclude most partitions without risking missing data:

select count(*)
from test_p3
where cre_datetime between to_date('18-OCT-2009','DD-MON-YYYY')
                   and     to_date('20-OCT-2009','DD-MON-YYYY')
and  id between 39000 and 41000

  COUNT(*)
----------
       721
1 row selected.

Execution Plan
----------------------------------------------------------
------------------------------------------------
| Id  | Operation                   | Name             | Rows  |
 Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------
------------------------------------------------
|   0 | SELECT STATEMENT            |                  |     1 |
    13 |    14   (8)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE             |                  |     1 |
    13 |            |          |       |       |
|*  2 |   VIEW                      | index$_join$_001 |    34 |
   442 |    14   (8)| 00:00:01 |       |       |
|*  3 |    HASH JOIN                |                  |       |
       |            |          |       |       |
|   4 |     PARTITION RANGE ITERATOR|                  |    34 |
   442 |     7  (15)| 00:00:01 |    40 |    42 |
|*  5 |      INDEX RANGE SCAN       | TP_CRE_DT        |    34 |
   442 |     7  (15)| 00:00:01 |    40 |    42 |
|   6 |     PARTITION RANGE ITERATOR|                  |    34 |
   442 |     9  (23)| 00:00:01 |    40 |    42 |
|*  7 |      INDEX RANGE SCAN       | TP3_PK           |    34 |
   442 |     9  (23)| 00:00:01 |    40 |    42 |
----------------------------------------------------------------
------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         18  consistent gets
          0  physical reads

In the above I have managed to exclude most of the partitions and still get the correct data, by adding the clause and id between 39000 and 41000 . There is something a little unusal {if you have not seen it before} with the plan being scans of two indexes and then hashed together and viewed, but both are PARTITION RANGE ITERATOR scans and Pstart/Pstop values are 40-42. Consistent gets are down to 18.

You do have to take great care with this method that the extra WHERE clause you add on the partitioning key will never exclude records you want to find. But so long as you do, it is a very powerful technique.

I’ll go into that a little more in the next posting {which hopefully will not be as delayed as this one!}

Depth of Indexes on VLDBs November 18, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
3 comments

In a couple of recent posts on partitions, on single row and range scan access I have started exploring the performance of partitions. I will continue on the main theme soon, honest.

I feel a little bad that I have not given concrete examples so far {creating very large objects for the sake of demonstrations and proof-of-concepts is a bit of an issue with VLDBs, due to the space and time requirements to do it}. So, here is just some real-world information on index depth, or BLEVEL of indexes on Very Large DataBases.

The BLEVEL is the number of levels in the index. Minus one. See this post for a full description, but in brief, if an index has a BLEVEL of 3, that means there is a root node, a level of branch nodes, a second level of branch nodes and then the leaf nodes. 4 levels. 4 IOs to check if the index holds the value you are looking for. And then one or more IOs against the table to get the actual records from the table, if required (one block if there is only one row, one or more blocks if there are several rows).

The below is from a 25TB datawarehouse system. I am looking for the maximum index depth on the database:-

select max(blevel) from dba_indexes

MAX(BLEVEL)
———–
4

OK, how many indexes are this deep?…

select owner,index_name from dba_indexes
where blevel=4

OWNER INDEX_NAME
—————————— ———————-
ERIC W_IIIIIIII_URLS_PK

1 row selected.

Oh. I have a pretty massive database and the deepest index I have is BLEVEL 4 and there is one of them. So the index height is 5. And access to that table for a given referral ID is going to take 5 IOs to work down the index and one against the table.

What is more, the index is an IOT

INDEX_NAME INDEX_TYPE
—————————— —————————
W_REFERRAL_URLS_PK IOT – TOP

so, in fact, to get to the table record it is just the 5 IOs as the table record is held in the IOT.

So with a 25TB data warehouse, the deepest normal indexes are BLEVEL 3.

select index_name,num_rows/leaf_blocks
from dba_indexes
where BLEVEL =3
order by num_rows desc

INDEX_NAME                             NUM_ROWS  LEAF_BLOCKS
------------------------------ ---------------- ------------
W_WL_SH_API_XIIIIIIIIIII_ID     4,585,108,192   13,406,015
W_WL_SHIIIIIIIIIIIXXXX_ID    4,564,350,002   16,892,898
W_WL_SIIIIIIIIIIIIIIIIIIIIII_ID     4,422,775,820   17,189,536
W_WL_SHIIIIIIIIIIIIIIIIIIIIII_ID    4,182,087,545   12,243,438
W_WL_GX_RESIIIIIIIIIII_LOG_ID1     3,690,374,216   14,613,388
IDX_W_WL_LIIIIIIIIIIIESS_ID          351,507,905    1,958,109
WL_LOGIN_SUIIIIIIIIIIIIE_ID          348,004,861    1,550,180
IND_IIII_LEG_ID                     312,727,000      972,690
IND_TMLR_MARIIIIIIIIIII_ID           306,988,247    1,121,592
PK_TBL_IIIIIIIIIIIUNNER              305,566,240      939,775
PK_IIIIIIIIIIIGS                    284,208,000      863,000

I asked an old friend, Tony Webb, to check out the depth of indexes on the largest databases they have, which range from a few TB to several 10s of TB. {one  is 100TB plus, but a lot of the data is in LOBs, so the actual number of records is only billions, not 10’s of billions :-) } No index had a BLEVEL greater than 3.

Hi Martin,

I ran the first query on DAVE, DEE(our biggest db), DOZY and MICK. The max depth results were 3,2,3 and 3. Our largest database only had 2. In fact some of the other largish dbs also only have a max depth of 2.

That enough for your blog Martin?

Now, some of you may be smelling a Rat, and you would be right. At both sites, the VLDBs have been designed with partitioning as a key part of the Physical Implementation. All the largest objects are partitioned. What impact does partitioning have on BLEVEL?

The depth of the index segment is dependent on the number of keys per block.

Let us say you have 8k blocks {this is a questionably low block size for a data warehouse, but it is what some of these systems have} and your index is on a couple of columns, totalling 20 bytes per key on average, including rowid. Allowing block overhead, that is approx 8000/20 entries = 400 max….

Blow that, that’s whooly theory, I’ll look at my real 25TB database for number of rows per leaf block… 

select leaf_blocks, num_rows\leaf_blocks rows_per_leaf
from dba_ind_statistics
where num_rows &gt; 100000000
order by leaf_blocks desc

 LEAF_BLOCKS ROWS_PER_LEAF
------------ -------------
  30,211,949    325.602273
  28,717,447    262.762331
  26,308,608    382.505959
  25,429,862    385.047045
  24,054,626     309.92864
  23,660,747    382.571434
  22,458,058    411.063205
  22,316,976    346.620625
  18,875,827    379.952198
  17,451,901    338.600909
  17,189,536     257.29466

From the above let us take a low figure of 300 average enteries per block. For a BLEVEL 0 , a one block index, that would be 300 rows that are referenced. For a level 1 index, 300 entries in the root block will point to 300 leaf blocks, so that will be 300*300 rows…90,000 entires.

Go down another BLEVEL to 2 (and a height of 3) and you have a root node referencing 300 Branch nodes, referenceing 300 Leaf nodes each referencing 300 records. 300*300*300 = 27 million.

Another level (BLEVEL 3), another factor of 300 and that is 8.1 billion.

BLEVEL 0= 300
BLEVEL 1 = 90,000
BLEVEL 2 = 27,000,000
BLEVEL 3 = 8,100,000,000
BLEVEL 4 = 6,480,000,000,000 (6,480 billion entries).

You can see that with an 8k block size and being very pessimistic about the number of entries per block (I used a low average to calculate enttries per block where as in reality most indexes will fill the block at the root and branch nodes to close to full before throwing a new level) your have to increase your data volume by amost 300 to throw a new level of index.

On average, knowing nothing about the actual number of records in an index, you would have to reduce your number of indexed rows by 150 times (half of 300) to have a 50% chance of lowering the BLEVEL of a locally partitioned index by one. ie 150 or more partitions.

with a 32k block size, you would have to reduce your data volume by more like 620 times to drop the BLEVEL (not 600 as you have less wastage from block overhead with larger blocks).

To reduce it by two it would be 300*150 I think, wiht 8k block size {can any statisticians, like Mr Lewis, help me out on this one?}. If I am right, that is 45,000 partitions. In that 25TB datawarehouse, no table has more than 4,000 partitions.

That is why I say, when you partition a table and have locally partitioned indexes, you might reduce the index level by 1. Might.

However, none of these systems has more than 4000 partitions per table. That might sound a lot, but it is only going to reduce an index BLEVEL by 1. Almost certainly it will, but if you have read the previous postings, that is not going to really help index lookups be that more efficient :-)

I’ll just add a couple of last comments.

  • If you have an index BLEVEL of 4 or 5 or more and do not have a VLDB, you might want to look at why {hints, it could be an IOT, it could be an index on several concatenated VARCHAR2 columns, it could be you are using 2k block size, it could be that you regularly delete a large pecentage of your data and then re-insert more data, it could be you have a one-table database, hehe.}.
  • Point one does not mean you should rebuild said index. I almost never rebuild indexes. The need to do so is one of those Oracle myths. There are situations where index rebuilds are advantageous, but they are pretty damned rare. If you regularly rebuild all your indexes or you are thinking of doing so, can I suggest you book a “meeting” for an afternoon and sit down with a cup of tea and google/bing/yahoo or whatever and check out index rebuilding first? Ignore any sites you land on offering database health checks or training courses on cruise liners.

That’s all for now.

Partitions are Not for Performance October 29, 2009

Posted by mwidlake in Architecture, performance, VLDB.
Tags: , , ,
16 comments

There is a myth that Partitions in Oracle magically aid SQL Query performance, even a general assumption that the main role of partitioning is to aid such query performance. This is not so. There are certainly many cases where Partitioning can aid performance but in my experience they just as often hinder rather than help overall performance.

The myth is so strong that when asked at interview, most DBAs {and even some performance specialists} will site query performance as the main (and occasionally only) benefit of partitioning.

Why can partitions hinder SQL query performance? Let’s say for example that you have a 10 million row table and an index on that table. That index has a B-Level of 3, which means it has a root block, one level of branch nodes, a second layer of branch nodes and then the leaf-node level. To access a row in the table via the index Oracle needs to read the root block, two branch blocks and then a leaf block to get the rowid of the record. This allows the table block {and from that the row} to be read. This is depicted in the below diagram, the numbered orange squares represent the blocks as selected in turn from the index and then table:

index_access_global

That is 5 I/O operations to access that row.

Now we partition the table into 5 partitions, 2 million rows each. The index is locally partitioned. If you are lucky, you may, just may, end up with local index partitions with a B-level 1 less then the original table, so in our case a B-level of 2. The often suggest process is now that one partition is considered and thet the CBO will read one root node, a branch level block, a leaf block and then the block from the partition.

single_index_partition_access

4 I/Os and a saving of 20% {I’m ignoring for now caching and whether it is physical or logical IO}.

A saving of 20% IF {and only if} you have local index partitions with a lower B-Level than the equivalent non-partitioned index. And the access is to one partition alone.

However, I keep seeing situations where the index look up does not include the partition key. So you get the below situation:

index_lookup_partition

Lacking the partition key, the CBO cannot exclude any partitions – so it has to scan each one. For most partitions, maybe for all but one, no records are found, but the index has to be checked with 3 IO operations each. so in my example 16 I/Os are done to get the one record of interest.

16 I/O operations instead of 5.

The situation is often not spotted, at least initially, as the time taken to carry out the extra local index partition scans is “small”, especially for specific lookups. Usually any testing is done on a table with only a small numer of partitions.

I remember well the first time I came across this {on an Oracle 9.0 database I think}, there was well over 100 partitions and a process that checked many thousands of individual records had slowed down significantly, taking 4 or 5 times as long as before.

An indicator that the problem is occurring is when a single record lookup against the index and then table is taking perhaps several dozen to several hundred consistent gets rather than the 5 or 6 it should. Also, you will see the partition iterator in the explain plan. In that first case where I came across the issue, consistent gets of about 380 per record fetched were being seen for a select that returned 1 record 99% of the time from a single table lookup. I’ve since seen the same problem on tables with over a thousand partitions, each local index being scanned for a total of almost 5000 consistent gets per record.

You may think that this would be an unusual situation as access against very large tables is either full/partial table scans or lookups on the PK/with a WHERE clause including the partitioning key – but it is actually very common. Partitioned tables are being used more and more in large but generally OLTP applications or sets of records are identified in a datawarehouse that are then checked more specifically with generally row-specific logic. With VLDBs which have many, many partitioned tables with many, many partitions each, the problem is common and often not recognized in the fog of other issues and massive I/O levels.

I’ve only covered a general performance issue with partitions here, I’ll expand on the theme and this simplistic example in the next post.

And yes, there are many ways partitioning CAN aid performance. I aim to get to those too. I really love partitioning.

Follow

Get every new post delivered to your Inbox.

Join 161 other followers