Ensuring Correlated Partition Exclusion #2 December 20, 2009
Posted by mwidlake in performance, Uncategorized, VLDB.Tags: partitions, performance, VLDB
2 comments
A few days ago {oh dear, it is now two weeks!} I showed how you could better control Correlated Partition Exclusion by creating a lookup table that showed, for each partition, the minimum and maximum ID (the partition key) and the minimum and maximum CRE_DATETIME (what you want to limit your query on). Using this range table in the way I described, you did not have to take an educated guess as to what range of IDs you used to include the partitions for the date range you were interested in, you could check the table.
But this only worked reliably where the ID and CRE_DATETIME increase directly in proportion to each other. It was a special case.
What about the more normal case, where there is a correlation between the two columns but it is not a perfect correlation? This would happen, for example, if the CRE_DATETIME is entered from another system, or from paper records, where the order of the records is not enforced. So some records from today get put into the system before some from yesterday. Or if the correlation is even loser than this. eg you are looking at orders for new customers. You “know” there are going to be no orders for these customers from 5 years ago but you are not sure how far back in the orders table you should go to find what you want.
You can still use the lookup table method. The lookup table in effect becomes a meta-index – an index of the segments where you will find data but not actually the rows.
To demonstrate this, I created a table where the ID and CRE_DATETIME increase in order:
create a partitioned table test_p4
(id number(10) not null
,cre_datetime date not null
,status number(1) not null
,num_1 number(4) not null -- random 20
,num_2 number(4) -- random 500
,num_3 number(5) -- cycle smoothly
,num_4 number(5) -- Random 10000
,vc_1 varchar2(10)
,vc_2 varchar2(10)
,vc_pad varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
,partition id_max values less than (maxvalue)
tablespace users
)
--
-- local partitioned indexes on the table
IND_NAME TAB_NAME PSN COL_NAME
------------------ ------------------ --------- ------------
TP4_CRE_DT TEST_P4 1 CRE_DATETIME
TP4_PK TEST_P4 1 ID
--
-- populate the table with data
insert into test_p4(id,cre_datetime,status,num_1,num_2,num_3,num_4
,vc_1,vc_2,vc_pad)
select rownum
,to_date('01-JUL-2009','DD-MON-YYYY')+(rownum/360)
,decode(mod(rownum,100),0,1
,0)
,trunc(dbms_random.value(1,20))
,trunc(dbms_random.value(1,50))
,mod(rownum,10)+1
,trunc(dbms_random.value(1,10000))
,dbms_random.string('U',5)
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )
,lpad('A',1000,'A')
from dba_objects
where rownum <43000
I then messed with the data, updating 10% of records and setting the CRE_DATETIME to plus or minus a random amount up to 2 days different, so data in partitions would overlap. I then created a range table in the same way as I did for the previous post.
I ended up with a range table like the below:
MIN_CRE_DATI MAX_CRE_DATI MIN_ID MAX_ID ----------------- ----------------- ---------- ---------- 29-JUN-2009 20:35 05-JUL-2009 14:24 1 999 02-JUL-2009 04:50 07-JUL-2009 12:56 1000 1999 05-JUL-2009 02:34 10-JUL-2009 08:31 2000 2999 08-JUL-2009 05:23 13-JUL-2009 03:32 3000 3999 11-JUL-2009 08:07 15-JUL-2009 18:41 4000 4999 14-JUL-2009 00:14 18-JUL-2009 18:27 5000 5999 16-JUL-2009 08:58 21-JUL-2009 14:18 6000 6999 19-JUL-2009 01:28 24-JUL-2009 11:21 7000 7999 22-JUL-2009 08:02 27-JUL-2009 07:01 8000 8999 24-JUL-2009 22:06 30-JUL-2009 05:37 9000 9999 28-JUL-2009 04:59 01-AUG-2009 10:57 10000 10999 ... 24-SEP-2009 01:52 28-SEP-2009 18:36 31000 31999 26-SEP-2009 16:49 01-OCT-2009 01:26 32000 32999 29-SEP-2009 13:20 04-OCT-2009 13:43 33000 33999 02-OCT-2009 08:40 07-OCT-2009 10:11 34000 34999 05-OCT-2009 04:29 10-OCT-2009 04:09 35000 35999 08-OCT-2009 02:04 12-OCT-2009 17:34 36000 36999 10-OCT-2009 20:03 15-OCT-2009 08:39 37000 37999 13-OCT-2009 15:09 18-OCT-2009 12:01 38000 38999 16-OCT-2009 06:49 21-OCT-2009 03:53 39000 39999 18-OCT-2009 20:16 23-OCT-2009 21:01 40000 40999 21-OCT-2009 12:10 26-OCT-2009 07:13 41000 41999 25-OCT-2009 01:29 29-OCT-2009 20:56 42000 42999
You can see that the MIN_CRE_DATI-MAX_CRE_DATI from record to record (partition to partition) overlap but generally increase.
How do you find the start and end of the ID range to cover a date period you are interested in? I always have to sit down with a pen a paer to work this out. It is the classic “overlapping ranges” check, but my brain cannot hold on to it. So here goes. I want all records between the 1st of October and the 6th, this year.
You want to find the partition with the lowest ID range which has a record that falls into the date range you want. ie the Maximum CRE_DATETIME record is as late or later than the range you are interested in. If the maximum CRE_DATETIME is less than the date range you are interested in, no records will be in that partition.
Here is the code to find the lowest partition
select min(min_id)
from tp4_range
where MAX_CRE_DATI>=to_date('01-OCT-2009','DD-MON-YYYY')
MIN(MIN_ID)
-----------
32000
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 2
| 1 | SORT AGGREGATE | | 1 | 12 |
| 2 | TABLE ACCESS BY INDEX ROWID| TP4_RANGE | 12 | 144 | 2
|* 3 | INDEX RANGE SCAN | TP4R_MACD_MIAD | 12 | | 1
----------------------------------------------------------
To cover the whole of the partition of interest you need to look for records with and I greater than the minimum in that partition, thus the selection of the min(min_id)
Similarly, you only want partitions where the minimum CRE_DATETIME is before the end of the date range you want. If all records in the partition have a CRE_DATETIME beyond the range, you are not interested in it.
select max(max_id)
from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY')
MAX(MAX_ID)
-----------
35999
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 2
| 1 | SORT AGGREGATE | | 1 | 13 |
| 2 | TABLE ACCESS BY INDEX ROWID| TP4_RANGE | 37 | 481 | 2
|* 3 | INDEX RANGE SCAN | TP4R_MICD_MIID | 37 | | 1
----------------------------------------------------------
Then you put these sub-queries into the query to select the data you want. Below I show the “original” code which does not use the partition range table and then using the partition range table, to prove the same number of records come back and to see the plan and statistics change:
select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
and to_date('06-OCT-2009','DD-MON-YYYY')
and num_2 = 5
COUNT(*)
----------
48
Execution Plan
------------------------------------------------------
| Id | Operation | Name | R
ows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------
------------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 11 | 356 (1)| 00:00:02 | | |
| 1 | SORT AGGREGATE | |
1 | 11 | | | | |
| 2 | PARTITION RANGE ALL | |
36 | 396 | 356 (1)| 00:00:02 | 1 | 46 |
|* 3 | TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P4 |
36 | 396 | 356 (1)| 00:00:02 | 1 | 46 |
|* 4 | INDEX RANGE SCAN | TP4_CRE_DT |
1788 | | 52 (0)| 00:00:01 | 1 | 46 |
------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
676 consistent gets
0 physical reads
Note the 676 consistent gets and the Pstart/Pstop of 1-46
select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
and to_date('06-OCT-2009','DD-MON-YYYY')
and id> (select min(min_id)
from tp4_range where MAX_CRE_DATI
>=to_date('01-OCT-2009','DD-MON-YYYY'))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI
<=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5
COUNT(*)
----------
48
Execution Plan
-----------------------------------------------------------
| Id | Operation | Name
| Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------
-----------------------------------------------------------
| 0 | SELECT STATEMENT |
| 1 | 16 | 66 (5)| 00:00:01 | | |
| 1 | SORT AGGREGATE |
| 1 | 16 | | | | |
| 2 | PARTITION RANGE ITERATOR |
| 1 | 16 | 62 (5)| 00:00:01 | KEY | KEY |
|* 3 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P4
| 1 | 16 | 62 (5)| 00:00:01 | KEY | KEY |
| 4 | BITMAP CONVERSION TO ROWIDS |
| | | | | | |
| 5 | BITMAP AND |
| | | | | | |
| 6 | BITMAP CONVERSION FROM ROWIDS |
| | | | | | |
| 7 | SORT ORDER BY |
| | | | | | |
|* 8 | INDEX RANGE SCAN | TP4_PK
| 1788 | | 2 (0)| 00:00:01 | KEY | KEY |
| 9 | SORT AGGREGATE |
| 1 | 12 | | | | |
| 10 | TABLE ACCESS BY INDEX ROWID| TP4_RANGE
| 12 | 144 | 2 (0)| 00:00:01 | | |
|* 11 | INDEX RANGE SCAN | TP4R_MACD_MIA
D | 12 | | 1 (0)| 00:00:01 | | |
| 12 | SORT AGGREGATE |
| 1 | 13 | | | | |
| 13 | TABLE ACCESS BY INDEX ROWID| TP4_RANGE
| 37 | 481 | 2 (0)| 00:00:01 | | |
|* 14 | INDEX RANGE SCAN | TP4R_MICD_MII
D | 37 | | 1 (0)| 00:00:01 | | |
| 15 | BITMAP CONVERSION FROM ROWIDS |
| | | | | | |
| 16 | SORT ORDER BY |
| | | | | | |
|* 17 | INDEX RANGE SCAN | TP4_CRE_DT
| 1788 | | 52 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
350 consistent gets
0 physical reads
The plan is admittedly more complex and the SQL can be tricky to understand if you are not used to code that looks for overlap of ranges. But Consistent gets is down to 350 and the Pstart/Pstop values are KEY-KEY. The CBO cannot tell you WHAT the ranges will be when the code is parsed {I think it should check but in 10.2.0.3 at least, the CBO is not that smart at parse time}, but it knows there will be a start and stop.
I am using tiny partitions for my example and only 45 of them for the table. When the partitions are for millions of rows and there are a couple of thousand of them, excluding partitions in a manner you can rely on is a powerful tool. Which leads onto a final word of caution.
Nothing is enforcing that the range table is maintained.
You could do things with triggers or regular re-calculation of the ranges table but this will be something you need to consider if you use ranges tables to help partition exclusion. Flipping tablespaces to read-only can help the worry go away though…
As an example of the issue of needing to maintain the ranges table and also a demonstration that the ranges table does work correctly if maintainted, I’ll update a record well outside of the “expected window”, show that it does not appear in my range-check controlled code, then update the ranges table and try again.
update test_p4
set cre_datetime=to_date(’02-OCT-2009 11:15′,’DD-MON-YYYY HH24:MI’)
,NUM_2=5
where id=22100
/
3> select count(*) from test_p4
where cre_datetime between to_date(’01-OCT-2009′,’DD-MON-YYYY’)
and to_date(’06-OCT-2009′,’DD-MON-YYYY’)
and id> (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5
COUNT(*)
———-
49
– success
–
I wonder if I will do the next posting on this topic in less than two weeks!
Scalar Functions Not Behaving as Scalar Functions December 8, 2009
Posted by mwidlake in performance, Uncategorized.Tags: performance, PL/SQL
3 comments
My thanks go to Graeme Hobbs, who I currently have the pleasure of working with, for explaining this to me.
This is all about scalar functions that do not act like scalar functions.
Where I am spending much of my time at the moment, we have an issue with a very simple sql statement. It takes on average 2 buffer gets per execution. How could that be an issue? Well, it is being executed a few thousand times an hour. By “a few thousand” read 50,000 times. Even a simple statement being executed 15 times a second is not good if it is not needed. Especially, as in this case, there is a context switch.
I got a look at the code and it is doing something like this:
HAVING SUM( col.vol_gbp ) >=
my_user.pkg_application_settings.get_number_value( ‘rp157_eric’ )
I quickly batted an email to my esteemed colleague Mr Hobbs saying “but this is a scalar. It will be executed once for the statement and the value passed into each execution of the whole query, it can’t be the source of the 50,000 executions an hour”.
Oh woe, for my ignorance has undone me. But it’s OK, Graeme not only took time to explain, but knocked up the following nice demo of WHY this is a problem. And said I could put it on my Blog. What a nice chap.
From Graeme:
******************************************************
Consider package:
CREATE OR REPLACE PACKAGE BODY ghtest IS l_called_count NUMBER := 0; -- Function and procedure implementations FUNCTION f_get_called_count RETURN NUMBER IS BEGIN RETURN( l_called_count ); END; FUNCTION f_call RETURN NUMBER IS BEGIN l_called_count := l_called_count + 1; RETURN 1; END; FUNCTION f_reset RETURN NUMBER IS BEGIN l_called_count := 0; RETURN 1; END; END ghtest;
– nothing up my sleeve
SELECT ghtest.f_get_called_count
FROM DUAL;
F_GET_CALLED_COUNT
——————
0
ie no calls made yet
SQL> SELECT id
, COUNT( * )
FROM ( SELECT *
FROM MY_USR.tbl_SOME_DATA
WHERE ROWNUM < 10 )
GROUP BY id
HAVING COUNT( * ) > ghtest.f_call;
ID COUNT(*)
---------- ----------
1624361594 3
SELECT ghtest.f_get_called_count
FROM DUAL;
F_GET_CALLED_COUNT
------------------
7
NOTE the counter has gone up to 7 as there were 7 IDs found from that code -6 IDs with one record and one with 3.
SELECT ghtest.f_reset
FROM DUAL;
F_RESET
———-
1
That just blanks the counter (see the package).
-- nothing up my sleeve
SELECT ghtest.f_get_called_count
FROM DUAL;
F_GET_CALLED_COUNT
------------------
0
SELECT id
, COUNT( * )
FROM ( SELECT *
FROM my_usr.tbl_some_data
WHERE ROWNUM < 10 )
GROUP BY id
HAVING COUNT( * ) > (SELECT ghtest.f_call FROM dual);
ID COUNT(*)
---------- ----------
1624361594 3
SELECT ghtest.f_get_called_count
FROM DUAL;
F_GET_CALLED_COUNT
------------------
1
You can see that the scalar subquery
HAVING COUNT( * ) > (SELECT ghtest.f_call FROM dual);
calls f_call once.
Whereas
HAVING COUNT( * ) > ghtest.f_call;
results in it being called 7 times – the number of rows returned from the group by
That is precisely the problem.
******************************************************
Apparently our developers should know this, as it is a “known issue”. But not known to me until now (unless I knew it and forgot it)
So, hats of to Graeme.
{apologies, wordpress utterly screwed the layout/truncated text when I pasted in the example, have fixed now}
Ensuring Correlated Partition Exclusion December 7, 2009
Posted by mwidlake in performance, Uncategorized, VLDB.Tags: partitions, performance, VLDB
9 comments
<Previous Post…Next Pos >
I’ve posted a couple of times recently about a tuning method with partition exclusion where you infer a relationship between the value you want to limit a query on and the partition key. It takes a while to explain the theory so I am going to give it a name, in the vain hope it catches on {forgive me if someone has already done this – in fact, just flame me with pointers if you already know a name for this}. At the very least, for my own postings, I can use this name from now on and link back to a single posting explaining it.
I’m going to call it Correlated Partition Exclusion. In essence, you have partitioned the table on a key, in my case the primary key ID, which is an ascending numeric probably sourced from a sequence.
You have a second column, in my case CRE_DATETIME, which increases in line with the PK. If you limit your query on the CRE_DATETIME partition exclusion is not possible as there is no guarantee which CRE_DATETIME values appear in which partition. But, as a human, you understand that if you create 10,000 records a day, if you want to look at the last week’s date you can use:
WHERE ID > MAX_ID-(7*10000)
to exclude partitions with an id more than 7 days worth ago.
So, you have your Correlated Partition Exclusion.
How can you be sure that going back 70,000 IDs is going to safely cover one week of data and how can you maximise your efficiency of including only partitions that cover the date range? {note, I am using a numeric ID as my partition range key and a datetime as my correlated column, this principle works just as well if you partition on datetime and want to correlate to a (generally) ascending numeric key}
Here is my test table :-
create table test_p3 (id number(10) not null ,cre_datetime date not null ,status number(1) not null ,num_1 number(4) not null -- random 20 ,num_2 number(4) -- random 500 ,num_3 number(5) -- cycle smoothly ,num_4 number(5) -- Random 10000 ,vc_1 varchar2(10) ,vc_2 varchar2(10) ,vc_pad varchar2(2000)) tablespace users partition by range (id) (partition id_01k values less than (1000) tablespace users ,partition id_02k values less than (2000) tablespace users ,partition id_03k values less than (3000) tablespace users ,partition id_04k values less than (4000) tablespace users ... ,partition id_45k values less than (45000) tablespace users -- ,partition id_max values less than (maxvalue) tablespace users ) / --@ind_cols IND_NAME TAB_NAME PSN COL_NAME ------------------ ------------------ --------- -------------------- TP3_PK TEST_P3 1 ID TP_CRE_DT TEST_P3 1 CRE_DATETIME
If I want to look for all records between two dates I could use code like the below (based in that suggested by Bernard Polarski, any mistakes are mine).
with get_min_id as
(select max(id) min_id from test_p3
where cre_datetime >= TO_DATE('18-OCT_2009','DD-MON-YYYY') )
,get_max_id as
(select min(id) max_id from test_p3
where cre_datetime <= TO_DATE('20-OCT_2009','DD-MON-YYYY') )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')
/
COUNT(*)
----------
721
1 row selected.
ie find the minimum ID for the start date and the maximum ID for the end date and query between them.
This works fine. Unfortunately, you get a plan like the below
------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | ------------------------------------------------------------------ ------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 39 | 887 (4)| 00:00:04 | | | | 1 | SORT AGGREGATE | | 1 | 39 | | | | | | 2 | NESTED LOOPS | | 2 | 78 | 887 (4)| 00:00:04 | | | | 3 | NESTED LOOPS | | 1 | 26 | 717 (5)| 00:00:03 | | | | 4 | VIEW | | 1 | 13 | 505 (5)| 00:00:02 | | | |* 5 | FILTER | | | | | | | | | 6 | SORT AGGREGATE | | 1 | 26 | | | | | |* 7 | VIEW | index$_join$_001 | 38393 | 974K| 505 (5)| 00:00:02 | | | |* 8 | HASH JOIN | | | | | | | | | 9 | PARTITION RANGE ALL | | 38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 | |* 10 | INDEX RANGE SCAN | TP_CRE_DT | 38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 | | 11 | PARTITION RANGE ALL | | 38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 | | 12 | INDEX FAST FULL SCAN | TP3_PK | 38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 | | 13 | VIEW | | 1 | 13 | 212 (5)| 00:00:01 | | | | 14 | SORT AGGREGATE | | 1 | 26 | | | | | |* 15 | VIEW | index$_join$_002 | 2972 | 77272 | 212 (5)| 00:00:01 | | | |* 16 | HASH JOIN | | | | | | | | | 17 | PARTITION RANGE ALL | | 2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 | |* 18 | INDEX RANGE SCAN | TP_CRE_DT | 2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 | | 19 | PARTITION RANGE ALL | | 2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 | | 20 | INDEX FAST FULL SCAN | TP3_PK | 2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 | | 21 | PARTITION RANGE ITERATOR | | 2 | 26 | 170 (1)| 00:00:01 | KEY | KEY | |* 22 | TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 | 2 | 26 | 170 (1)| 00:00:01 | KEY | KEY | |* 23 | INDEX RANGE SCAN | TP_CRE_DT | 707 | | 48 (0)| 00:00:01 | KEY | KEY | ------------------------------------------------------------------ Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 892 consistent gets 0 physical reads
If you look at the plan {towards the end, and sorry about the naff layout, my blog is not wide enough for this sort of printout} there are two checks on the TP_CRE_DT indexes that scan all partitions of the index – Pstart/Pstop are 1-46. This is the CBO looking for the partitions where the stated CRE_DATETIME records occur in the whole table. Cost is 892 consistent gets, much of which is the checking of local index partitions that will hold no relevant entries.
Bernard also spotted that the code was considering all partitions and was not as efficient as it could be but did not know how to get around it except with a Global index, which brings it’s own issues.
One way is to just say “well, I know how many records per day I get so I will fake up the ID limits based on this”. The problem is, and this is why the CBO cannot make the decision for you, is that there is no guarantee, no fixed rule, saying that CRE_DATETIME will always increment with the ID. In fact, there is nothing stopping you altering a record which has an ID from yesterday having a CRE_DATETIME from 10 years ago {but forget I said that until tomorrow’s post}. Now, in my example the CRE_DATETIME is going to increment with ID. We will use this special case for now, I know it is flawed and will address that flaw {tomorrow}.
So to ensure yo do not miss data you end up making your ID window pretty large to endure you do not miss records. Say you want all records for the last day, you process 1500 records a day, so you consider a window covering all samples with an ID within the last 10,000. It will still be more efficient than scanning all partitions and should be safe. Fairly safe.
The way out of this is to have a ranges table. Something that tells you, for each partition, which is the maximum and minimum CRE_DATE and the IDs covered by that range. You can then use that to identify the partitions that cover the date range you are interested in.
Here is an example. I will create a simple table to hold the ranges:
create table tp3_range (min_cre_dati date ,max_cre_dati date ,min_id number ,max_id number)
Now you have to populate it.
The below code will create the first record for the first partition.
insert into tp3_range SELECT MIN(CRE_DATETIME) ,MAX(CRE_DATETIME) ,MIN(ID) ,MAX(ID) FROM TEST_P3 PARTITION (ID_01K) / Execution Plan ---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost ( %CPU)| Time | Pstart| Pstop | ------------------------------------------------------------------ | 0 | INSERT STATEMENT | | 1 | 12 | 102 (1)| 00:00:01 | | | | 1 | SORT AGGREGATE | | 1 | 12 | | | | | | 2 | PARTITION RANGE SINGLE| | 999 | 11988 | 102 (1)| 00:00:01 | 1 | 1 | | 3 | TABLE ACCESS FULL | TEST_P3 | 999 | 11988 | 102 (1)| 00:00:01 | 1 | 1 | ------------------------------------------------------------------ Statistics ---------------------------------------------------------- 1 recursive calls 3 db block gets 195 consistent gets 0 physical reads
As you can see from the plan and cost, this is not very efficient as it has to scan the whole partition. Maybe not a problem if you do this once, but there are indexes on both these columns, can’t this be done more efficiently? Yes, if you split up the code into four in-line selects (if you want more details about it being more performant to do MIN and MAX on their own than in one statement then see this post on the topic ):
insert into tp3_range SELECT (SELECT MIN(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K)) ,(SELECT MAX(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K)) ,(SELECT MIN(id ) FROM TEST_P3 PARTITION (ID_04K)) ,(SELECT MAX(ID ) FROM TEST_P3 PARTITION (ID_04K)) FROM DUAL PLAN --------------------------------------------------------- SORT AGGREGATE PARTITION RANGE SINGLE cst:2 rws:1000 INDEX FULL SCAN (MIN/MAX) TP_CRE_DT cst:2 rws:1000 Statistics ---------------------------------------------------------- 1 recursive calls 5 db block gets 9 consistent gets 0 physical reads
{You may wonder why the Explain Plan section above has a different look. This is because there seems to be a bug in 10.2.0.3 where the autotrace plan for the insert statement comes out wrong, as a FAST DUAL access, so I had to Explain the statement in a different way}
You would run one of the above statements against each partition. Probably, the “Best Practice” way would be to generate a SQL script from a query against DBA_TAB_PARTITIONS.
To populate my table I cheated – I just assumed all my partitions are of the same size (1000 records) and used:-
insert into tp3_range
select min(cre_datetime),max(cre_datetime),min(id),max(id)
from test_p3
– where id between 10000 and 20000
group by trunc(id/1000)
/
Note I left in a commented line. You could run the above against the whole table and then limit it to just new partitions as you add them. This is a tad risky though, you are relying on the partitioning being perfect and it would not scale to hundreds of very large partitions. You would be better off with the partition-by-partition methods above.
You end up with a table like the below:-
select * from tp3_range order by min_cre_dati MIN_CRE_DATI MAX_CRE_DATI MIN_ID MAX_ID ----------------- ----------------- ---------- ---------- 01-JUL-2009 00:04 03-JUL-2009 18:36 1 999 03-JUL-2009 18:40 06-JUL-2009 13:16 1000 1999 06-JUL-2009 13:20 09-JUL-2009 07:56 2000 2999 09-JUL-2009 08:00 12-JUL-2009 02:36 3000 3999 12-JUL-2009 02:40 14-JUL-2009 21:16 4000 4999 14-JUL-2009 21:20 17-JUL-2009 15:56 5000 5999 17-JUL-2009 16:00 20-JUL-2009 10:36 6000 6999 20-JUL-2009 10:40 23-JUL-2009 05:16 7000 7999 23-JUL-2009 05:20 25-JUL-2009 23:56 8000 8999 26-JUL-2009 00:00 28-JUL-2009 18:36 9000 9999 28-JUL-2009 18:40 31-JUL-2009 13:16 10000 10999 31-JUL-2009 13:20 03-AUG-2009 07:56 11000 11999 ... 14-OCT-2009 13:20 17-OCT-2009 07:56 38000 38999 17-OCT-2009 08:00 20-OCT-2009 02:36 39000 39999 20-OCT-2009 02:40 22-OCT-2009 21:16 40000 40999 22-OCT-2009 21:20 25-OCT-2009 15:56 41000 41999 25-OCT-2009 16:00 28-OCT-2009 10:36 42000 42999
add the two below indexes:
create index tp3r_micd_miid on tp3_range(min_cre_dati,min_id);
create index tp3r_macd_miad on tp3_range(max_cre_dati,max_id);
And now you can find the upper and lower ID bounds for a date range with the following code:
select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
)
MIN_ID
----------
39000
Execution Plan
---------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
---------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 2
|* 1 | INDEX RANGE SCAN | TP3R_MICD_MIID | 1 | 22 | 1
| 2 | SORT AGGREGATE | | 1 | 9 |
| 3 | FIRST ROW | | 40 | 360 | 1
|* 4 | INDEX RANGE SCAN (MIN/MAX)| TP3R_MICD_MIID | 40 | 360 | 1
---------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
)
MAX_ID
----------
39999
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 22 | 2
|* 1 | INDEX RANGE SCAN | TP3R_MACD_MIAD | 1 | 22 | 1
| 2 | SORT AGGREGATE | | 1 | 9 |
| 3 | FIRST ROW | | 4 | 36 | 1
|* 4 | INDEX RANGE SCAN (MIN/MAX)| TP3R_MACD_MIAD | 4 | 36 | 1
----------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
Put it all together into one statement and let’s see how much it costs:
with get_min_id as
(select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
) )
,get_max_id as
(select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
) )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')
COUNT(*)
----------
721
1 row selected.
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Row
s | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
----------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 57 | 67 (5)| 00:00:01 | | |
| 1 | SORT AGGREGATE | |
1 | 57 | | | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P3 |
2 | 26 | 65 (5)| 00:00:01 | | |
| 3 | NESTED LOOPS | |
2 | 114 | 65 (5)| 00:00:01 | | |
| 4 | MERGE JOIN CARTESIAN | |
1 | 44 | 2 (0)| 00:00:01 | | |
|* 5 | INDEX RANGE SCAN | TP3R_MICD_MIID |
1 | 22 | 1 (0)| 00:00:01 | | |
| 6 | SORT AGGREGATE | |
1 | 9 | | | | |
| 7 | FIRST ROW | |
40 | 360 | 1 (0)| 00:00:01 | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MICD_MIID |
40 | 360 | 1 (0)| 00:00:01 | | |
| 9 | BUFFER SORT | |
1 | 22 | 1 (0)| 00:00:01 | | |
|* 10 | INDEX RANGE SCAN | TP3R_MACD_MIAD |
1 | 22 | 1 (0)| 00:00:01 | | |
| 11 | SORT AGGREGATE | |
1 | 9 | | | | |
| 12 | FIRST ROW | |
4 | 36 | 1 (0)| 00:00:01 | | |
|* 13 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MACD_MIAD |
4 | 36 | 1 (0)| 00:00:01 | | |
| 14 | PARTITION RANGE ITERATOR | |
| | | | KEY | KEY |
| 15 | BITMAP CONVERSION TO ROWIDS | |
| | | | | |
| 16 | BITMAP AND | |
| | | | | |
| 17 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 18 | SORT ORDER BY | |
| | | | | |
|* 19 | INDEX RANGE SCAN | TP3_PK | 7
07 | | 6 (0)| 00:00:01 | KEY | KEY |
| 20 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 21 | SORT ORDER BY | |
| | | | | |
|* 22 | INDEX RANGE SCAN | TP_CRE_DT | 7
07 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
133 consistent gets
0 physical reads
If you look way back up this post, you will see that the number of records selected by the above is the same as from the code indicated by Bernard (721 records) and for a lot less cost – 133 consistent gets compared to 892.
So I have hopefully explained the issue (having to visit all the index partitions to check the date range), shown how a ranges table can help, given some simple and a less-simple-but-more-efficient examples of code to populate the table and finally shown that using the range table can be more efficient.
And I arrogantly gave it a name – Correlated Partition Exclusion
Some of you may remember tha this example has been a special case, as there is no overlap between the dates; the relationship between the CRE_DATETIME and ID is perfect. You are unlikely to have that in real life. Also, dome of you may also be tired of reading this post. So I will cover the general case in the NEXT post.
Depth of Indexes on VLDBs November 18, 2009
Posted by mwidlake in performance, Uncategorized, VLDB.Tags: Architecture, partitions, statistics
3 comments
In a couple of recent posts on partitions, on single row and range scan access I have started exploring the performance of partitions. I will continue on the main theme soon, honest.
I feel a little bad that I have not given concrete examples so far {creating very large objects for the sake of demonstrations and proof-of-concepts is a bit of an issue with VLDBs, due to the space and time requirements to do it}. So, here is just some real-world information on index depth, or BLEVEL of indexes on Very Large DataBases.
The BLEVEL is the number of levels in the index. Minus one. See this post for a full description, but in brief, if an index has a BLEVEL of 3, that means there is a root node, a level of branch nodes, a second level of branch nodes and then the leaf nodes. 4 levels. 4 IOs to check if the index holds the value you are looking for. And then one or more IOs against the table to get the actual records from the table, if required (one block if there is only one row, one or more blocks if there are several rows).
The below is from a 25TB datawarehouse system. I am looking for the maximum index depth on the database:-
select max(blevel) from dba_indexes
MAX(BLEVEL)
———–
4
OK, how many indexes are this deep?…
select owner,index_name from dba_indexes
where blevel=4
OWNER INDEX_NAME
—————————— ———————-
ERIC W_IIIIIIII_URLS_PK
1 row selected.
Oh. I have a pretty massive database and the deepest index I have is BLEVEL 4 and there is one of them. So the index height is 5. And access to that table for a given referral ID is going to take 5 IOs to work down the index and one against the table.
What is more, the index is an IOT
INDEX_NAME INDEX_TYPE
—————————— —————————
W_REFERRAL_URLS_PK IOT – TOP
so, in fact, to get to the table record it is just the 5 IOs as the table record is held in the IOT.
So with a 25TB data warehouse, the deepest normal indexes are BLEVEL 3.
select index_name,num_rows/leaf_blocks from dba_indexes where BLEVEL =3 order by num_rows desc INDEX_NAME NUM_ROWS LEAF_BLOCKS ------------------------------ ---------------- ------------ W_WL_SH_API_XIIIIIIIIIII_ID 4,585,108,192 13,406,015 W_WL_SHIIIIIIIIIIIXXXX_ID 4,564,350,002 16,892,898 W_WL_SIIIIIIIIIIIIIIIIIIIIII_ID 4,422,775,820 17,189,536 W_WL_SHIIIIIIIIIIIIIIIIIIIIII_ID 4,182,087,545 12,243,438 W_WL_GX_RESIIIIIIIIIII_LOG_ID1 3,690,374,216 14,613,388 IDX_W_WL_LIIIIIIIIIIIESS_ID 351,507,905 1,958,109 WL_LOGIN_SUIIIIIIIIIIIIE_ID 348,004,861 1,550,180 IND_IIII_LEG_ID 312,727,000 972,690 IND_TMLR_MARIIIIIIIIIII_ID 306,988,247 1,121,592 PK_TBL_IIIIIIIIIIIUNNER 305,566,240 939,775 PK_IIIIIIIIIIIGS 284,208,000 863,000
I asked an old friend, Tony Webb, to check out the depth of indexes on the largest databases they have, which range from a few TB to several 10s of TB. {one is 100TB plus, but a lot of the data is in LOBs, so the actual number of records is only billions, not 10′s of billions
} No index had a BLEVEL greater than 3.
Hi Martin,
I ran the first query on DAVE, DEE(our biggest db), DOZY and MICK. The max depth results were 3,2,3 and 3. Our largest database only had 2. In fact some of the other largish dbs also only have a max depth of 2.
That enough for your blog Martin?
Now, some of you may be smelling a Rat, and you would be right. At both sites, the VLDBs have been designed with partitioning as a key part of the Physical Implementation. All the largest objects are partitioned. What impact does partitioning have on BLEVEL?
The depth of the index segment is dependent on the number of keys per block.
Let us say you have 8k blocks {this is a questionably low block size for a data warehouse, but it is what some of these systems have} and your index is on a couple of columns, totalling 20 bytes per key on average, including rowid. Allowing block overhead, that is approx 8000/20 entries = 400 max….
Blow that, that’s whooly theory, I’ll look at my real 25TB database for number of rows per leaf block…
select leaf_blocks, num_rows\leaf_blocks rows_per_leaf from dba_ind_statistics where num_rows > 100000000 order by leaf_blocks desc LEAF_BLOCKS ROWS_PER_LEAF ------------ ------------- 30,211,949 325.602273 28,717,447 262.762331 26,308,608 382.505959 25,429,862 385.047045 24,054,626 309.92864 23,660,747 382.571434 22,458,058 411.063205 22,316,976 346.620625 18,875,827 379.952198 17,451,901 338.600909 17,189,536 257.29466
From the above let us take a low figure of 300 average enteries per block. For a BLEVEL 0 , a one block index, that would be 300 rows that are referenced. For a level 1 index, 300 entries in the root block will point to 300 leaf blocks, so that will be 300*300 rows…90,000 entires.
Go down another BLEVEL to 2 (and a height of 3) and you have a root node referencing 300 Branch nodes, referenceing 300 Leaf nodes each referencing 300 records. 300*300*300 = 27 million.
Another level (BLEVEL 3), another factor of 300 and that is 8.1 billion.
BLEVEL 0= 300
BLEVEL 1 = 90,000
BLEVEL 2 = 27,000,000
BLEVEL 3 = 8,100,000,000
BLEVEL 4 = 6,480,000,000,000 (6,480 billion entries).
You can see that with an 8k block size and being very pessimistic about the number of entries per block (I used a low average to calculate enttries per block where as in reality most indexes will fill the block at the root and branch nodes to close to full before throwing a new level) your have to increase your data volume by amost 300 to throw a new level of index.
On average, knowing nothing about the actual number of records in an index, you would have to reduce your number of indexed rows by 150 times (half of 300) to have a 50% chance of lowering the BLEVEL of a locally partitioned index by one. ie 150 or more partitions.
with a 32k block size, you would have to reduce your data volume by more like 620 times to drop the BLEVEL (not 600 as you have less wastage from block overhead with larger blocks).
To reduce it by two it would be 300*150 I think, wiht 8k block size {can any statisticians, like Mr Lewis, help me out on this one?}. If I am right, that is 45,000 partitions. In that 25TB datawarehouse, no table has more than 4,000 partitions.
That is why I say, when you partition a table and have locally partitioned indexes, you might reduce the index level by 1. Might.
However, none of these systems has more than 4000 partitions per table. That might sound a lot, but it is only going to reduce an index BLEVEL by 1. Almost certainly it will, but if you have read the previous postings, that is not going to really help index lookups be that more efficient
I’ll just add a couple of last comments.
- If you have an index BLEVEL of 4 or 5 or more and do not have a VLDB, you might want to look at why {hints, it could be an IOT, it could be an index on several concatenated VARCHAR2 columns, it could be you are using 2k block size, it could be that you regularly delete a large pecentage of your data and then re-insert more data, it could be you have a one-table database, hehe.}.
- Point one does not mean you should rebuild said index. I almost never rebuild indexes. The need to do so is one of those Oracle myths. There are situations where index rebuilds are advantageous, but they are pretty damned rare. If you regularly rebuild all your indexes or you are thinking of doing so, can I suggest you book a “meeting” for an afternoon and sit down with a cup of tea and google/bing/yahoo or whatever and check out index rebuilding first? Ignore any sites you land on offering database health checks or training courses on cruise liners.
That’s all for now.
Friday Philosophy – Memory (of the Human Kind) November 14, 2009
Posted by mwidlake in Perceptions, Uncategorized.Tags: knowledge
7 comments
One of the Joys of having been around for a while (I’ve hit that point where I occasionally find myself working with people who were not born when I got my first job in IT) is that you start to re-learn things. Well, I do. {I have to point out that, this being a Friday Philosophy, you will learn nothing yourself, especially not about Oracle. from this opost. If you want Oracle knowledge you might want to pop by Dion cho, Tanel Poder or Richard foote}
Several times over the last couple of weeks I have learnt something “new” and realised a little while later that I already knew it, but had forgotton (BLEVEL is one, consistent gets being impacted by array fetch size is another) . This is not something to be proud of – not knowing something you once did is far worse, and slightly worrying, then never having known it…
I like to think that it is all the new stuff I am learning, pushing previous knowledge out of my brain’s buffer cache {if it is not warmed by re-using it}, but I suspect that it might just be an uptime-related memory leak. A bit like having your database open for a few months and doing a lot of work that includes massive PL/SQL in-memory tables and hash joins (pickler fetches). No one seems sure of the exact mechanics, but after a while you have a hell of a lot less spare memory in your serever than you started with
Maybe the memory loss is closer to the “pickler query” analogy than I thought, you can preserve soft tissue for a long time in alcohol. I’ll have another glass of wine and think on that.
This Forgetting Stuff was actually a major factor to my starting a blog. I was in the process of putting a load of notes on Oracle Odditites and things I had learnt from years past onto a web site so I could get at it from wherever I was in the world, and a friend told me I was being stupid – I should put them on a blog. So I did. There are only two little issues with this.
- I can’t remember what I have already put on my blog.
- I’ve forgotten where I put the ZIP disk with the notes on.
So I was contemplating this drift of knowledge and two things struck me.
1) I reckon that the very best people, in any area of expertise, are blessed with excellent memories. There is a comedian in the UK called Stephen Fry, and he is renowned for being stunningly clever. I think he probably is stunningly clever, but he also remembers everything he has learnt (Rory McGrath is another UK comedian with a perfect memory, but somehow he lacks the charm of Stephen Fry, and he has a beard, so not so highly renowned).
2) My poor memory is not maybe such a bad thing. I don’t have to get so upset when a whole chunk of my Oracle knowledge becomes obsolete. I used to be really good at sizing objects and utilizing space, taking into account the 5-block minimum, getting the extents to fit nicely into the datafiles, little scripts to resize segments into a small but sensibele number of extents to reduce wasted space, considering initrans, pctfree, pctused… Pretty much all pointless now
Big Discs are Bad September 27, 2009
Posted by mwidlake in development, performance, Uncategorized.Tags: performance, Storage, system development, VLDB
8 comments
I recently came across this article on large discs for database by Paul Vallee. The article is over 3 years old but is still incredibly valid. It’s a very good description of why big discs are a problem for Oracle Database Performance. {Paul also introduces the BAHD-DB campaign – Battle Against Huge Disks for Databases, which I quite like}.
To summarise the article, and the problem in general, IT managers will buy big discs as they provide more GB per pound sterling. It saves money.
However, less discs is Bad For Performance. As an extreme example, you can now buy a single disc that is a TB in size, so you could put a 1TB Oracle database on one such disc. This one disc can only transfer so much data per second and it takes this one disc say 10ms to search for any piece of data. If you want the index entry from one place and the table row from another, that is at least two seeks. This will not be a fast database {and I am not even touching on the consideration of disc resilience}.
Now spread the data over 10 discs. In theory these 10 discs can transfer 10 times the total data volume and one disc can be looking for information while the others are satisfying IO requests {This is a gross over-simplification, but it is the general idea}.
IT Managers will understand this 1-to-10 argument when you go through it.
Kind of.
But then discussions about how many modern “fast” discs are need to replace the old “slow” discs ensure. It can be very, very hard to get the message through that modern discs are not much faster. A 1TB disc compared to a 4-year-old 100GB disc will not have a transfer speed 10 times faster and it will certainly not have a seek time ten times less, chances are the seek time is the same. And then there are the discussion of how much impact the larger memory caches of modern storage units have. Answer,(a) quite a lot so long as it is caching what you want and (b) even if it is perfectly caching what you want, as soon as you have read a cache-sized set of data, you are back to disc IO speed.
Bottom line. Disc Drives are now slower in proportion to the disc acerage than they used to be.
Anyway, I am certainly not the only person to have had these discussions, though I have had them for longer than most {due to my accidental work history of having worked on VLDBs for so long}. There are certainly practitioners of Oracle Arts who understand all of this far better than I and one of them, James Morle, recently started blogging. It’s looking good so far. If he keeps it up for a month, I’ll put him on my blog roll
There is, however, one aspect of the Big Disc Performance issue that does not seem to get much mention but is something I have suffered from more than a couple of times.
As a Database Performance person you have had the argument about needing spindles not disc acreage and won. The IT manager buys enough spindles to provide the I/O performance your system needs. Success.
However, the success has left a situation behind. You need 10 spindles over a couple of RAID 10 arrays to give you the IO you need. 250GB discs were the smallest you could buy. So you have 1.25TB of available storage (RAID 10 halves the storage) and have a 500GB database sitting on it. There is 750GB of empty storage there…
That 750GB of empty storage will not be left inviolate. Someone will use it. Someone will need “a bit of temporary storage” and that nice chunk of fast storage will be too inviting. Especially if it IS fast storage. It will be used.
Now your database, who’s storage you specified to support said database, is sharing it’s storage with another app. An app that steals some of your IO and potentially {heck, let’s say this straight WILL} impact your database performance. And the galling thing? Twice, I had no idea my storage had become shared until I started getting odd IO latency issues on the database.
You may be able to make a logical argument for the spindles you need at design time. But you have almost no chance of protecting those spindles in the future. But who said working life was easy?
Friday Philosophy – Friends who Know Stuff August 29, 2009
Posted by mwidlake in Uncategorized.Tags: ethics, knowledge
2 comments
Being a consultant/contractor can be hard work. You are expected to turn up on-site and become useful before the second cup of coffee has been drunk. Then, if you manage that, there is an expectation that you will continue to pull IT Rabbits out of computer Hats for as long as you are there. Which is actually reasonable, we are after all usually paid well for being consultants/contractors.
This need to always have an answer can become quite hard, especially as those of us who chose not to be permanent staff have a bit of a disadvantage, one which those who have normal jobs may not appreciate.
Permies have a group around them called “the team” who they can call upon and talk about issues. Permies tend to stay in an organisation for many years and build up strong contacts with people, who they can call upon years after they have moved on. For us nomads, it can be far harder to make strong links to people you can call upon. That is not to say most teams are unfriendly when you go on site, it is just that by the nature of starting off as a temporary member of the group and moving on after a year, 6 months, even a week or two, developing strong ties to people so that you feel able to badger them 2 years later is less likely to happen.
Don’t under estimate the benefit of being able to call on old friends to get a second opinion (or of being called yourself to help assist some old friend who has got to get to grips with some section of IT knowledge that you had to deal with for 2 years). It really helps.
Some of you are probably thinking “well, you Consultant types just ask each other, the other experts that you all know”. Well, sometimes, but we tend not to work with each other much. Contractors rarely get to work with each other repeatedly on different projects unless you find yourself in a position where the client needs someone with X skills and you have a friend with X skills who is looking for a new position.
This is something I have become very aware of, having gone from Contractor for half a decade, to Permy for 6 years and back again to contractor/consultant. I miss having a stable team of collegues to discuss things with.
So, Friends with Skills are important. And it is a two-way thing, if you expect to be able to call on some old collegues for help, then you need to be helpful when they call on you.
Is this a case of “who you know not what you know”? Yes and no. It is not about contacts getting you a leg up. It’s about developing and keeping a group of work-related friends where you all help each other when there is a need. Proper friendship is about sharing, not using.
Learning stuff you know you don’t know July 17, 2009
Posted by mwidlake in Uncategorized.6 comments
A Friday, off-topic post
OK, how many people remember the much maligned speech by Donald Rumsfeld about {and I paraphrase} “things we know, things we don’t know, things we know we don’t know and things we don’t know we don’t know”.
If you want, you can watch the dear chap himself saying it here. {I’m afraid this is a slightly maligning version, but not as bad as most I found}.
Leaving aside it was Mr Rumsfeld who said it and thus accounted for some of the endless scorn poured upon the speech, I think it was a very, very valid observation by his speech writers. I always think that people laughing at the “idiocy” of the speech have missed something. The idea is nothing new, Henry David Therou for example said
“To know that we know what we know, and that we do not know what we do not know, that is true knowledge.”
{an excellent quote, for which I thank Wikipedia and Google – I knew I did not know any good quotes but I knew they existed and I just did not know them}.
I think it is very valid to realise you don’t know stuff in your chosen area of knowledge but that you recognise that lack and that there is thus more to learn. It also means that people who do already know an awful lot in one area of knowledge, they probably have gaps you can help fill. So speak up when you think you know something, even to acknowledged experts. It gives you a really warm glow inside when you do teach somone you regard as an expert something new. And you know what? Most of those experts will be really appreciative for you even trying, let alone succeeding.
I could of course be saying all this to give my utterences about not knowing stuff a veneer of philosphical respectability, but I really do believe and have done for many years now, that we should be able to say we do not know something and it be respected that we acknowledge that gap. I don’t take glory in my stupidity {for even my ego could not hold up under all that amount of glory
} but I like to feel that if I can be honest about my stupidity I can thus be honest about my knowledge, and be quietly happy with what I know and what I can do.
OK, so none of this is specific to Oracle but it is no less valid for being applicable to all knowledge. I guess I’m trying to say that I think it is OK {and should be publically acceptable} to admit not knowing stuff and it is equally OK and good for all of us to try and teach others, even if we feel individually that we maybe do not have that much to add.
What prompted this philosophical outpouring? I found out today that I don’t even know how to peel a banana. And probably most of you don’t either. But Monkeys do.
Go on, watch the video, the boss is in a meeting. It makes you think….
Cost of full table scans #2 June 30, 2009
Posted by mwidlake in performance, Uncategorized.Tags: cost, performance, SQL
2 comments
This is the second post in a series on Full Table Scans. Sorry to anyone waiting for part two, I got distracted {I suspect very few people were actually waiting
} I make up for it by making posting this veeeery long.
Post One – I got several comments, mostly pre-empting what I planned to say next {I am not complaining as such, I like to get feedback and Randolf Geist in particular was being very helpful}, but I’m going to say what I wanted to say anyway, especially as I was being purposefully misleading in my first post…
I’m attempting to show three things in this series of posts.
- How table scans are costed by the CBO under 10g and 11g, as it is quite different from before.
- That simple test cases of what you knew once can fail to work and leave you confused and questioning your knowledge.
- How levels of “truth” help you understand more as you dig deeper.
- As a fourth I might touch on how Oracle’s calculated cost is not actually always reflecting how it executes sql, but I need to work on that a little.
If you thought that db_file_mutliblock_read_count is the number of blocks that oracle reads in one go when scanning tables and indexes and that this improves performance, that will help you understand what is going on inside your Oracle database.
If you think that increasing the value will make full table scans look more efficient to Oracle and thus it will favour them, you are again right and it may well help you tune a system.
However, you may know more – you may know that Oracle modifies the value of the MBRC to make it more realistic. As the MBRC gets larger, things like the size of the segment, hitting the end of individual extents and the physical time taken to read data off disk and chuck it across the network makes the efficiency gains of these larger values less significant. Each layer of knowledge helps a little more. Then the technology underneath shifts and you may not realise this as the older knowledge still kind of helps.
I’m not trying to be smart or “I know better” here, I’ve posted before that none of us knows this stuff until we are told and I certainly did not know all of what is in these posts when I started preparing them 3 or 4 weeks ago. Some of what I plan to say I am still not sure of. But I’ll postpone the philosphy now for a later blog post, though as I have a rough version already written, here it is if you want.
Back to the technology and the costing of full table scans.
Here is my test script again.
It creates a couple of tables, the one of interest is TEST_BIG, holding 40,000 rows. TEST_BIG is 2304 blocks big but only 2267 blocks are occupied. As nothing has been deleted from the table, this is also the high water mark, ie the highest point in the table into which data has been inserted. Oracle keeps track of this high water mark. The table is in a locally managed, autoallocate tablespace with automatic segment space management. All the defaults.
This is my environment:-
NAME VALUE
—————————— ——————–
compatible 11.1.0.0.0
cpu_count 2
db_block_size 8192
db_file_multiblock_read_count 128
optimizer_mode ALL_ROWS
sga_target 0
sort_area_size 65536
Just to re-itterate, I have not gathered system statistics -the system is “out of the box”.
From Oracle 10 onwards, there are some default system statistics set. You can see these by looking at the sys.aux_stats$ table (click on “view plain” to see it properly laid out):-
select sname,pname,pval1,pval2 from sys.aux_stats$ order by sname,pname SNAME PNAME PVAL1 PVAL2 ----------------------------------------------------- SYSSTATS_INFO DSTART 10-15-2007 11:32 SYSSTATS_INFO DSTOP 10-15-2007 11:32 SYSSTATS_INFO FLAGS 1 SYSSTATS_INFO STATUS COMPLETED SYSSTATS_MAIN CPUSPEED SYSSTATS_MAIN CPUSPEEDNW 1845.90945 SYSSTATS_MAIN IOSEEKTIM 10 SYSSTATS_MAIN IOTFRSPEED 4096 SYSSTATS_MAIN MAXTHR SYSSTATS_MAIN MBRC SYSSTATS_MAIN MREADTIM SYSSTATS_MAIN SLAVETHR SYSSTATS_MAIN SREADTIM
There are only 3 values set {on both version 11g and 10g},
- CPUSPEEDNW – an estimate of the speed of the CPU. This value is different on different servers. Maybe. Both my old V10 windows test boxes have a value of 484, even though one is a truely old laptop, yet the linux servers I have access to all have different values, ranging from 713 to 1778.
- IOSEEKTIME – This is the time for the IO subsystem to locate the start of an item. In both v10 and 11, this is 10 milliseconds
- IOTFRSPEED – This is the volume of data that can be read from the io subsystem in 1 millisecond – 4096 bytes or 4k, again in V10 and 11
These default figures for IOSEEKTIME and IOTFRSPEED are fairly conservative, especially as they have not changed since V10, but then seek time has not really improved (especially on single discs).
In my experience, and looking around on the net, it seems that many sites on V10 upwards have not gathered system stats {BTW see the end of this post for some clarification on system statistics}. This seems to be due to one of several factors:
- Some sites do not have a DBA or one who knows about system statistics {I’m not criticising, if you have been labelled a DBA and thrown the whole of the Oracle stack to look after, there is a lot to get your head around}.
- Some DBAs think the system stats are gathered by the automatic stats job that {sort of} runs every night .
- Changing to using system statistics would change execution plans and thus need full regression testing, some re-tuning, which there is no time or money for.
- Hey, it works, why change things {this can also be viewed as “I am fundamentally scared of change”
}.
I can understand all of the above, and as a result of them many of us benefit from understanding how the default system statistics work. I would suggest it is better to gather system statitics than not as you are giving the oracle CBO optimizer more accurate information about the actual speed of your cpu, your IO and the difference between single block reads and multi-block reads, but life is not perfect. {NB I’ve heard on the grapevine that you should not gather system stats if you run Oracle Apps 11i but I am no expert on Oracle Apps!}
So what does Oracle do with the “out of the box” information? It calculate values for SREADTIME {time to read a single block} and MREADTIM {the time taken to do a multi-block read. If you think about how discs work, to read something the head of the disc must locate where the data is and physically move to it – this is the seek time – and then read the data.
So for a single block (8k) the time taken is
IOSEEKTIME+(size of one block/IOTFRSPEED).
ie 10+(8192bytes/4096 bytes) = 12.
For a multi block read, then the disc head will seek once and then read db_file_multiblock_read_count number of blocks at once {just accept this for now, please}. In my case, 128*8k.
IOSEEKTIME+( (MBRC * Size of one block) / IOTFRSPEED)
ie 10+( (128*8192) / 4096)
= 10+(1048576/4096)
=10+(1024k/4k) =10+256 =236
I don’t know about you but I can’t divide 1048576 by 4096 just like that but 1024/4, yep easier.
The CBO can only cost something if it can treat all parts in the same units, so it converts those multiblock read counts to equivalent single block read counts by dividing the time for the multi block read by the time of the single read.
236/12 = 19.6667.
Oracle then rounds the figure to the next whole number, so 20.
{Incidentally, it does this with the CPU cost as well.
Oracle makes a guess at how much cpu effort an activity will take and then divides it by the CPU speed and the time for a single block read – I’m being a little simplistic here, but it will do for now}.
So, a single block read is cost 1
A multi block read of 128 is cost 20, and reads 128 times as much.
In previous versions of Oracle this was all different. In 8i, single block reads and multi block reads were costed the same. In 9i without cpu costing {no cpu costing being the default and I can’t remember seeing it changed on a live system I saw} a “fudge factor” was introduced where values for MBRC were reduced down to better match reality. And in my first post I misleadingly referred to a Fudge Factor and gave a table. Anyone who was used to this fudge factor may have noticed my numbers did not match, but then who of us remembers that the fudge factor for 8 as the multi block read count is 6.19?
{That table is still useful, it shows an effective ratio for the MBRC, IF you have gatherd no stats AND you have an 8k block size. OK, maybe it is not so useful…}
Anyway, as you can now see from the above formulae, Oracle on 10g and 11g now uses a fairly sensible piece of logic to decide on the relative costs of single and multi block reads. The CBO formula is basically saying you save the cost of multiple seek times when reading multiple blocks, but the time to transfer data is directly related to the volumeof data read. This is probably reasonable on a system with a simple IO setup (like single disks or very basic RAID) but will break down with high-end storage which has read-ahead and SAN caches and the like.
OK, I have to address one more issue before I demonstrate that the formula is correct. And it is something that I did not know and confused the hell out of me when I decided to do a “simple test” of what I thought I knew, and prompted me to go on endlessly about how doing a simple test to prove what you have learned can go wrong and send you crazy with confusion. And thanks again to Randolf Geist for pointing me in the right direction.
At the end of the last post I had shown that my multiblock read count was 128 but acted like it was 8.
ie I got a cost of 617 for scanning TEST_BIG. When I manually set my MBRC to 8 I also got a cost of 617. When I manually set a value for MBRC of 128, it gave a cost of 395. Huh?!
Coskan alluded to this bit, but I missed it – I was ignorant about what he was saying.
On a newly created Oralce 11.1 test database, created as part of a simple instal on windows vista, you have db_file_multiblock_read_count set to:-
show parameter db_file_multiblock_read
NAME Value
————————————————————
db_file_multiblock_read_count 128
But there is a hidden parameter:-
NAME VALUE
————————————
DESCRIPTION
—————————- —– —–
_db_file_optimizer_read_count 8
multiblock read count for regular clients
{To see hidden parameters you need to be logged in as sys or use a trick like intermediate views – here is a script you can run as sys to see them. Warning, there are a LOT of them! I’ll do yet another post about hidden parameters sometime soon.}
How the hell are we supposed to know this???! Do Oracle do this on purposed to keep us awake at night worrying about this???.
*sigh*.
On Oracle 10, this hidden initialization parameter defaults to 8 and MBRC defaults to 8 {usually – I just recreated a new test 10.2.0.1 database and both are set to 16 “out of the box”}. If you change MBRC, the hidden parameter alters too, to the same value, so you do not notice it so much.
On 11, MBRC is set to 128 and the hidden parameter is set to 8 initially. Once you change MBRC, the hidden _db_file_optimizer_read_count is also changed to match.
So, as my first paragraph on the first posting said, you go on to a test database to check something and you get confused as the very first “simple” test does not give you the results you expected.
OK, does the formula work?
In the below table I list:
- The MBRC that I set by altering my session
- The cost as reported by Explain Plan
- The calculated MREADTIM = IOSEEKTIM+ (MBRC*blocksize/IOTFRSPEED)
- The calculated SREADTIM = IOSEEKTIM+(blocksize/IOTFRSPEED)
- MREATIME/SREADTIM { I’ll call it MT/ST}
- Blocks in table/MBRC {I’ll call it blks/MBRC
- (blks/MBRC) * (MT/ST) to get a calculated cost in single block reads
MBRC Cost MRT SRT MT/ST blks/ Calc cost
mbrc
2 1325 14 12 1.167 1133.5 1323
3 1010 16 12 1.333 755.667 1008
4 853 18 12 1.5 566.75 851
6 695 22 12 1.833 377.833 693
8 617 26 12 2.167 283.375 615
10 569 30 12 2.5 226.7 567
12 538 34 12 2.833 188.917 536
16 499 42 12 3.5 141.688 496
20 475 50 12 4.167 113.35 473
24 459 58 12 4.833 94.458 457
32 440 74 12 6.167 70.844 437
33 438 76 12 6.333 68.697 436
64 410 138 12 11.5 35.422 408
128 395 266 12 22.167 17.711 393
Yes ![]()
The calculated cost is always 2 less than the cost given by the CBO. I know that under Oracle 10 1 is added to the cost for reading the table header. I am presuming that an additional 1 has been added for some reason under 11.
That will do for now.
Oh, as promised, some comments on system statistics
Oracle System statistics are NOT statistics collected on the tables and indexes owned by users SYS or SYSTEM. These would be “Dictionary Statistics”. These can be collected via specific calls to the DBMS_STATS.GATHER_DICTIONARY_STATISTICS and {if you have not turned off the automatics gathering of object statistics} are gathered automatically during weekday nights and over the weekend by the automated job, which gathers table and index stats when it feels they are needed. I’ll be posting about that {probably at inordinate length} in a month or so.
System Statistics are statistics about how the underlying computer system works, specifically cpu speed and IO subsystem performance.
Even Oracle Support can get confused over this. You might like to check out this unfortunate experience on the topic. I have to say, I have much sympathy with Piet as I had very similar issues trying to discuss automated stats gathering with Oracle a couple of years ago. Having said that, I in the end got good support on the topic.
UNIX SIG May 2009 May 20, 2009
Posted by mwidlake in Uncategorized.2 comments
This is just a quick post today, I was going to continue on my theme of select count(*) but it’s been a long, tiring day. I’ll do it tomorrow.
I was off over to Wolverhampton to present at the oracle user group UNIX SIG (special interest group) today. I felt there was a need for more introductory presentations as they tend to be expert ones these days, people talking about the latest, coolest things they had found out. Great though such talks are, for many users, these can be a little beyond where they are and discussions I have had suggest that a lack of easily accessible talks can turn people off SIGs. So I offered an introductory talk on tuning, how to get going with it. The talk was graceously accepted by Patrick Hurley, currently running that SIG.
The day did not start well. OK, the night before did not start well, I was suffering from trying to put too much into the talk and I was up until 1am stripping things out. I finally got it into a state I felt I could present but it lacked a few diagrams I wanted. It would have to do.
I set of OK in the morning, just past 7am. However, I got stuck on the way, a crash on the A14 had bloked the road and I missed my chance to escape. Thus I was late. I managed to call ahead but several presenters were having trouble today.
I arrived during the morning coffee break and I was up next, so it was a quick job to set up the laptop, grab a coffee and get into the right frame of mind.
The talk went OK I think, could have been a little more clear, a little more polished, but we will see what the feedback says. I was trying to get what I spend at least a whole day (and preferrably 2 or 3 days) running a course on into just under an hour.
After that and on top of the 4 hours in the car I was bushed. I hung around for lunch, had a couple of chats with fellow OUG people, including the deputy chair of the Mangement & Infrastructure SIG, Gordon Brown, which is good as we have our SIG in 2 weeks. However, I had seen Pete Finnigan’s talk at the Northern SIG, did not have much interest in the talk on Oracle Certification (sorry Joel, just not my bag) and so I did what I dislike other people doing, I left early.
A better journey back was had but then I had to get going on my latest assignment, some project management for an established client.
Hmm, this is all a bit “what I did today”. Not very interesting to other people I guess. Maybe I’ll edit it tomorrow.
