Format of Date Histograms February 6, 2010
Posted by mwidlake in internals.Tags: data dictionary, histograms, statistics
9 comments
Oracle annoyingly handles dates internally in several formats. Dates are stored in tables as seven bytes, each byte representing century, year-in-century, month, day, hour, minute and second, shifted by different amounts. For the high/low values seen in DBA_TAB_COLUMNS or DBA_TAB_COL_STATISTICS, it is stored as a RAW value, where a two-digit hex string represents the century, year-in-century,month etc as before – see this post on decoding high and low column values and check out the comments for corrections.
So what about histograms? You might want to know what is in the histogram for a date column in order to better understand the decisions made by the CBO. Below, I pull out the histogram for an example date column {I’ve trimmed the output a little to save space}:
select table_name ,column_name ,endpoint_value end_val ,endpoint_number rowcount from all_tab_histograms where table_name ='TEST_TABLE_X' and owner ='TEST_1' and column_name = 'PLACED_DT' order by endpoint_number TABLE_NAME COLUMN_NAME END_VAL ROWCOUNT --------------- ------------- ------------- --------------- TEST_TABLE_X PLACED_DT 2,452,258 0 TEST_TABLE_X PLACED_DT 2,454,334 1 TEST_TABLE_X PLACED_DT 2,454,647 2 TEST_TABLE_X PLACED_DT 2,454,737 3 TEST_TABLE_X PLACED_DT 2,454,820 4 TEST_TABLE_X PLACED_DT 2,454,867 5 TEST_TABLE_X PLACED_DT 2,454,929 6 TEST_TABLE_X PLACED_DT 2,454,981 7 TEST_TABLE_X PLACED_DT 2,455,006 8 TEST_TABLE_X PLACED_DT 2,455,024 9 TEST_TABLE_X PLACED_DT 2,455,039 10 TEST_TABLE_X PLACED_DT 2,455,050 11 ... TEST_TABLE_X PLACED_DT 2,455,205 42 TEST_TABLE_X PLACED_DT 2,455,207 43 TEST_TABLE_X PLACED_DT 2,455,209 44 TEST_TABLE_X PLACED_DT 2,455,211 45 TEST_TABLE_X PLACED_DT 2,455,213 46 TEST_TABLE_X PLACED_DT 2,455,215 47 TEST_TABLE_X PLACED_DT 2,455,216 48 TEST_TABLE_X PLACED_DT 2,455,218 49 TEST_TABLE_X PLACED_DT 2,455,220 50 TEST_TABLE_X PLACED_DT 2,455,221 51 TEST_TABLE_X PLACED_DT 2,455,222 52 TEST_TABLE_X PLACED_DT 2,455,222 53 TEST_TABLE_X PLACED_DT 2,455,222 54 TEST_TABLE_X PLACED_DT 2,455,222 55 TEST_TABLE_X PLACED_DT 2,455,223 56 TEST_TABLE_X PLACED_DT 2,455,223 57 TEST_TABLE_X PLACED_DT 2,455,223 58 TEST_TABLE_X PLACED_DT 2,455,223 59 ... TEST_TABLE_X PLACED_DT 2,455,223 69 TEST_TABLE_X PLACED_DT 2,455,223 70 TEST_TABLE_X PLACED_DT 2,455,223 71 TEST_TABLE_X PLACED_DT 2,455,223 72 TEST_TABLE_X PLACED_DT 2,455,223 73 TEST_TABLE_X PLACED_DT 2,455,224 74 TEST_TABLE_X PLACED_DT 2,455,226 75
Well, it looks like a seven digit number so it must be representing the date in a similar way to as described above, yes? No. The format is obviously something different {Oh come ON Oracle, some internal standards would be nice once in a while!
}
I pulled out the minimum and maximum values from the DBA_TAB_COLUMNS table and translated them:-
COLUMN_NAME NUM_DIST LOW_V HI_V ------------ ---------- ------------------- ------------------- PLACED_DT 375,428 2001-12-13 17:31:38 2010-01-28 23:51:38
So the above low values and high values will pretty much match the first and last values in the histograms, which are 2452258 and 2455226. Any guesses? Those values from the histogram are very close to each other, only 2968 different to cover around 9 years of values. 9 times 365…
The histogram is representing the dates in Julian format, ie number of days since 1st Jan 4712BC. As a quick proof of this:
select to_date('2452258','J'),to_date('2455226','J')
from dual;
TO_DATE('2452258' TO_DATE('2455226'
----------------- -----------------
14-DEC-2001 00:00 29-JAN-2010 00:00
Well, what do you know. Very close to the 13th Dec 2001 and 28th Jan 2010
This of course makes sense, storing the date as an ascending numeric means it is simple to calculate the width of the range and how many values per day there are.
Imagine how complex it would be to do this if the date was stored in the bizarre way we humans deal with it – an ascending numeric for year, a cycling 12 digit number for month and a varying cyclic number for the day of the month. And that is ignoring other calendars in common use.
However, I’m looking at this method of representing the date and something bothers me. There is no allowance for the time portion. Maybe column histograms can’t cope with time? I feel a couple of tests coming on…
ADDITIONAL.
If you have seen the comments, you will know that Jonathan commented to ask if I might have truncated/been selective with the data in creating my test table as he is sure there is a fractional section representing the time portion.
Well, I did say I needed to check further so maybe I would have spotted my error on my own… ![]()
This is the script I use to pull out histogram data (and yes, I trimed the output from the script before copying it into this post, so it does not quite match the script):
-- chk_hist.sql
-- Martin Widlake 7/4/4
-- pull out the histograms on a table
set pause on pages 32
spool chk_hist.lst
col owner form A8
col table_name form A15
col column_name form a20
col rowcount form 99,999,999,999
col end_val form 99,999,999,999
select owner
,table_name
,column_name
,endpoint_value end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name like upper(nvl('&tab_name','WHOOPS')||'%')
and owner like upper('&tab_own')||'%'
and column_name like upper('&col_name')||'%'
order by table_name,column_name,endpoint_number
/
spool off
clear colu
Hmm wonder what happens if I change the “col end_val form 99,999,999,999″ so that it would show fractions…
col end_val form 999,999,999.99999999
TABLE_NAME COLUMN_NAM END_VAL ROWCOUNT --------------- ---------- --------------------- ------------ TEST_TABLE_X PLACED_DT 2,452,257.73030093 0 TEST_TABLE_X PLACED_DT 2,454,333.76546296 1 TEST_TABLE_X PLACED_DT 2,454,647.32561343 2 TEST_TABLE_X PLACED_DT 2,454,737.25017361 3 TEST_TABLE_X PLACED_DT 2,454,820.02204861 4 TEST_TABLE_X PLACED_DT 2,454,866.98009259 5 TEST_TABLE_X PLACED_DT 2,454,928.66848380 6 TEST_TABLE_X PLACED_DT 2,454,980.94815972 7 TEST_TABLE_X PLACED_DT 2,455,005.68413194 8 TEST_TABLE_X PLACED_DT 2,455,023.67142361 9 TEST_TABLE_X PLACED_DT 2,455,039.03236111 10 TEST_TABLE_X PLACED_DT 2,455,050.39246528 11
Ahhhhh……
Thanks Jonathan ![]()
It is still a Julian date, but with the time as a fraction as he said.
Dealing with Bind Issues December 1, 2009
Posted by mwidlake in Meeting notes, performance.Tags: histograms, Meeting, performance
11 comments
One of the presentations I have seen today was on handling bind values. I can’t say it was, for me, the best I have seen this week, I’ve done a lot of work on binds in the past so I had come across most of the material before. Should I have bothered?
Well, there was one little gem in there which struck me as very simple and potentially very effective.
Bind variables and histograms are not a good mix, as has been commented on many, many times. In essence, if your data in a column is skewed so that some values match very few records and others match a large number, when oracle sees a SQL statement with a bind value being compared to that column, it peeks at the first value being passed in with the bind and uses it to decide on the plan (this is pre 11G, by the way).
That plan is then used for every execution of that sql statement until it is thrown out the SGA. Which is jolly unfortunate if all the values subsequently passed in via the bind do not suit the plan.
The solutions to this usually boil down to one of three approaches; remove the histograms on the column in question so that all values are treated equally;stop it being a bind/prevent bind peeking; force the “bad” plan out of the SGA and hope the next parse gets a better plan.
All have their merits and drawbacks.
Well, in this presentation there was a fourth solution. Something like this:
if :status in (1,2,3) then
select /*+ cardinality (t1 100000) */
from table_1 t1
, table_t t2
where t1.status=:status
and....;
else
select /*+ cardinality (t1 100) */
from table_1 t1
, table_t t2
where t1.status=:status
and....;
end;
I might be missing something blindingly obvious here (and this might be a common solution that has just passed me by), but it seems to me to be a simple and effective solution that could be used in many situations.
I also learnt that it is rare not to find at least one good thing out of any presentation, so long as you keep paying attention.
Decrypting Histogram Data #3 – is 7 or 15 Characters Significant? September 15, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, statistics
2 comments
At last I am going to wrap up about looking at Histogram data and endpoint_value. Sorry about the delay, life is a tad complex at present. This is also a very, very long post. You may want to skip to the end to get the conclusions
I first posted about how to translate endpoint_values.
I then posted in the second post a week ago about how endpoint_value is limited to 7 or so characters for varchar2 columns.
{Addition – this later post discusses the content of endpoinit_value for dates}
Coskan contacted me to say he believed I had missed something, and of course he was right – but I’d laid big hints that I was using theory and not practical test and there was more to know. {There is usually more to know, only practical tests confirm whether your theory works or not. And even after that, it seems there is still more to know!}
OK, I have a test table, created with the following {oh, this is under 10.2.0.3, Enterprise Edition on linux}:
create table hist_test
as select rownum id
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
,lpad('H',16,'H')||dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001
I need to skew the data in some columns so that different WHERE clauses should see different expected numbers of rows
I now need to add some skew to the data so you can see the histogram in action .
update hist_test set
VC_5=lpad(‘D’,5,’D')||’A’
,VC_7=lpad(‘F’,7,’F')||’A’
,VC_16=lpad(‘H’,16,’H')||’A’
,VC_40=lpad(‘I’,40,’I')||’A’
where mod(id,10)=1
/
Gather histogram stats, 10 buckets:
dbms_stats.gather_table_stats(ownname=>user,tabname=>’HIST_TEST’
,method_opt=>’FOR ALL COLUMNS SIZE 10′)
And looking at ALL_TAB_COLUMNS we can see the general stucture of the data (just the VC_ columns are shown)
COL_NAME NUM_DIST LOW_V HI_V N_BUCK AVG_L ------------------------------------------------------------- ID 1,000 1 1000 10 4 VC_1 6 AAAAA FFFFF 6 6 VC_2 26 AAA AAZ 10 4 VC_3 26 BBBA BBBZ 10 5 VC_4 26 CCCCA CCCCZ 10 6 VC_5 26 DDDDDA DDDDDZ 10 7 VC_6 26 EEEEEEA EEEEEEZ 10 8 VC_7 26 FFFFFFFA FFFFFFFZ 10 9 VC_8 26 GGGGGGGGA GGGGGGGGZ 10 10 VC_16 26 HHHHHHHHHHHHHHHHA HHHHHHHHHHHHHHHHZ 10 18 VC_40 1 IIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIII 1 42
I have created a table with varchar2 columns set to a fixed sting of characters the length indicated by the column name with the last column varying from A to Z. So VC_5 contains EEEEEx, where x is A to Z. 26 distinct values each column.
If you look at the above, all columns are showing as having 26 values and 10 buckets EXCEPT VC_40, but it does have 26 distinct values:
select count(distinct(VC_40)) from hist_test;
COUNT(DISTINCT(VC_40))
———————-
26
I have my function to unpack the endpoint_value for varchar2 columns from the number stored to the character string it represents. My function is not perfect, but it is reasonably good. Go checkout post one for the function.
Looking at the histograms with my function you can see the below {I show only some records}
colname END_VAL ROWCOUNT MOD_REAL
------------------------------------------------------------------
VC_5 354460798875962000000000000000000000 1 DDDDDA
VC_5 354460798875972000000000000000000000 2 DDDDDD
VC_5 354460798875986000000000000000000000 3 DDDDDG
{snip}
VC_5 354460798876071000000000000000000000 9 DDDDDY
VC_5 354460798876080000000000000000000000 10 DDDDDZ
VC_7 364886116489977000000000000000000000 1 FFFFFGK
VC_7 364886116489977000000000000000000000 2 FFFFFGK
VC_7 364886116489977000000000000000000000 3 FFFFFGK
VC_7 364886116489977000000000000000000000 4 FFFFFGK
{snip}
VC_7 364886116489977000000000000000000000 9 FFFFFGK
VC_7 364886116489977000000000000000000000 10 FFFFFGK
VC_16 375311434103976000000000000000000000 1 HHHHHI:
VC_16 375311434103976000000000000000000000 2 HHHHHI:
VC_16 375311434103976000000000000000000000 3 HHHHHI:
{snip}
VC_16 375311434103976000000000000000000000 9 HHHHHI:
VC_16 375311434103976000000000000000000000 10 HHHHHI:
VC_40 380524092910976000000000000000000000 1,000 IIIIIJM
For VC_5, there are 10 buckets, all with different endpoint_values.
The VC_7 has 10 buckets, but most have the same endpoint_value.
The VC_16 has 10 records in the HISTOGRAM table, all with the same endpoint_value.
VC_40 has only one record in the HISTOGRAM table.
Basically, if I am right, WHERE filters on a column holding values that are the same for the first seven or so characters will not be well supported by the histograms. We should see the estimated cost for such a WHERE filter to be wrong.
I have skewed the data so that Histograms should show more expected values for DDDDDA than DDDDDS, for FFFFFFFA than FFFFFFFS etc
update hist_test
set
VC_5=lpad(‘D’,5,’D')||’A’
,VC_7=lpad(‘F’,7,’F')||’A’
,VC_16=lpad(‘H’,16,’H')||’A’
,VC_40=lpad(‘I’,40,’I')||’A’
where mod(id,10)=1
/
And now I’ll see how the CBO estimates the rows coming back for some WHERE clauses.
select count(*) from hist_test
where VC_5 = 'DDDDDA'
COUNT(*)
----------
139
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 8 (0)|
| 1 | SORT AGGREGATE | | 1 | 7 | |
|* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 700 | 8 (0)|
----------------------------------------------------------
select count(*) from hist_test
where VC_5 = 'DDDDDS'
COUNT(*)
----------
35
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 8 (0)|
| 1 | SORT AGGREGATE | | 1 | 7 | |
|* 2 | TABLE ACCESS FULL| HIST_TEST | 49 | 343 | 8 (0)|
---------------------------------------------------------------------
For column VC_5 it all works as we would expect. the CBO estimates 100 values for DDDDDA and 49 for DDDDDS. I’ve shown the actual count(*)’s as well to show there is a real variation in numbers. There are good reasons why the estimates do not match reality, I won’t go into them now but this does highlight that histograms can help but do have flaws.
What about the other rows? We are only interested in the WHERE clause and the estimate for the number of rows from the table access, so I’ll show only those.
where VC_7 = 'FFFFFFFA'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 900 | 8 (0)|
where VC_7 = 'FFFFFFFS'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 49 | 441 | 8 (0)|
where VC_16 = 'HHHHHHHHHHHHHHHHA'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 7200 | 8 (0)|
where VC_16 = 'HHHHHHHHHHHHHHHHS'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 47 | 216 | 8 (0)|
where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)|
where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIS'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)|
{and just to show there is no typo in the string of 'I's above...
where VC_40 = lpad('I',40,'I')||'S'
|* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)|
In the above you can see that the CBO estimates the costs for VC_7 exactly the same as it did for VC_5. Despite the endpoint_values no longer distinguishing the buckets.
Even more telling is tha for VC_16 the CBO also is able to predict there are more values ending in A than S.
As for VC_40, the CBO utterly fails to spot the skew. In fact it utterly fails to spot that there is more than one value in the column (look back at the section showing min and max values and number of distinct values), and there is only one histogram bucket. so I supposed it is no surprise.
How does the CBO still detect the skew for VC_7 and VC_16, even though the endpoint_valuess are the same?
Because it swaps to using column ENDPOINT_ACTUAL_VALUE.
ENDPOINT_ACTUAL_VALUE is only populated for varchar2 columns when the precision of endpoint_value is exceeded:
colname ROWCOUNT MOD_REAL ENDPOINT_ACTUAL_VALUE ------- --------------------------------------------- VC_5 7 DDDDDR VC_5 8 DDDDDV VC_5 9 DDDDDY VC_5 10 DDDDDZ VC_6 0 EEEEEF8 EEEEEEA VC_6 1 EEEEEF8 EEEEEEC VC_6 2 EEEEEF8 EEEEEEF VC_6 3 EEEEEF8 EEEEEEH VC_6 4 EEEEEF8 EEEEEEK VC_6 5 EEEEEF8 EEEEEEN VC_6 6 EEEEEF8 EEEEEEP VC_6 7 EEEEEFn EEEEEES VC_6 8 EEEEEFn EEEEEEU VC_6 9 EEEEEFn EEEEEEX VC_6 10 EEEEEFn EEEEEEZ VC_7 1 FFFFFGK FFFFFFFA VC_7 2 FFFFFGK FFFFFFFC VC_7 3 FFFFFGK FFFFFFFF VC_7 4 FFFFFGK FFFFFFFI VC_7 5 FFFFFGK FFFFFFFK VC_7 6 FFFFFGK FFFFFFFO VC_7 7 FFFFFGK FFFFFFFR VC_7 8 FFFFFGK FFFFFFFT VC_7 9 FFFFFGK FFFFFFFW VC_7 10 FFFFFGK FFFFFFFZ VC_16 1 HHHHHI: HHHHHHHHHHHHHHHHA VC_16 2 HHHHHI: HHHHHHHHHHHHHHHHD VC_16 3 HHHHHI: HHHHHHHHHHHHHHHHG VC_16 4 HHHHHI: HHHHHHHHHHHHHHHHJ VC_16 5 HHHHHI: HHHHHHHHHHHHHHHHL VC_16 6 HHHHHI: HHHHHHHHHHHHHHHHO VC_16 7 HHHHHI: HHHHHHHHHHHHHHHHS VC_16 8 HHHHHI: HHHHHHHHHHHHHHHHU VC_16 9 HHHHHI: HHHHHHHHHHHHHHHHX VC_16 10 HHHHHI: HHHHHHHHHHHHHHHHZ VC_40 1,000 IIIIIJM
You can see that VC_5 columns have no ENDPOINT_ACTUAL_VALUEs but VC_6, VC_7 and VC_16 do. VC_40 does not.
So, at what point does the CBO stop storing values in ENDPOINT_ACTUAL_VALUES? 32 characters. I created another test table and these are the low/high values, number of distinct values and number of histogram buckets:
COL_NAME NUM_DIST LOW_V HI_V N_BUCK AVG_L
————————————————————–
ID 1,000 1 1000 10 4
NUM_1 10 0 9 10 3
VC_1 6 AAAAA FFFFF 6 6
VC_30 26 AAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAA 10 32
VC_31 26 BBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBB 10 33
VC_32 1 CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC 1 34
VC_33 1 DDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDD 1 35
VC_34 1 EEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEE 1 36
VC_35 1 FFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFF 1 37[/sourcecode ]
The name of the column is the length of the fixed string, so the columns are actually one character longer, as can be seen by the “AVG_L” column.
And just to check {as after all this is just a nice theory until you test}, I skew the number of records ending in ‘A’ for VC_31 and VC_32:-
update hist_test2
set
VC_31=lpad(‘B’,31,’B')||’A’
,VC_32=lpad(‘C’,32,’C')||’A’
where mod(id,10)=1
And see what the CBO thinks the rows to be identified are:-
where VC_31 = lpad(‘B’,31,’B')||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 100 | 3300 |
where VC_31 = lpad(‘B’,31,’B')||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 49 | 1617 |
where VC_32 = lpad(‘C’,32,’C')||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 |
where VC_32 = lpad(‘C’,32,’C')||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 | [/sourcecode ]
Yes. The CBO can see the skew for the 32 characters of VC_31 and not for the 33 characters of VC_32
So in conclusion
- endpoint_value is only accurate for varchar2 columns up to around 7 characters.
- If the varchar2 exceeds the precision of the endpoint_value column then endpoint_actual_value is used.
- endpoint_actual_value is used for varchar2 columns up to 32 characters and not for columns less than 7 characters.
- Histograms will not help with columns having many values with the same leading 32 or more characters.
- If you column has the same leading 32 characters or more, even the number of distinct values is not gathered by DBMS_STATS.
- You can’t tell how SQL/CBO will work from theory, you have to test. And still consider you knowledge as open to correction after that
Decrypting Histogram Data #2 September 3, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, SQL
7 comments
Hint – if you have a column where most or all of the entries have the same 15 plus characters, don’t bother gathering histograms on it. They will not help you a lot. Or at all. eg you have a column that holds “CUSTOMER IDENTIFIER xxxxxxxxx”, “CUSTOMER IDENTIFIER yyyyyyyy”, etc. Of course, good design suggests that the “CUSTOMER IDENTIFIER” bit is probably redundant and can be got rid of, but we live in the real world and may not have the power or ability to enforce that change, or we might have 3 or 4 such “prefix” strings.
Further, histograms on dnumerics longer than 15 significant digits will also potentially behave not as you would expect.
I better justify my assertion.
In the previous post on decrypting histogram data I covered how Oracle turns a varchar2 value into a numeric value that is held in the ENDPOINT_VALUE of DBA_TAB_HISTOGRAMS and I also gave you a cunning (but slightly flawed) function to convert it back. I use it in this post so you might want to go look at the prior post. Sorry it is long, I can’t stop rambling.
First, I’ll create some test data {Oh, this is on 10.2.0.3 on Linux}. The below script create a table HIST_TEST with columns NUM_1 through to NUM7, which hold numbers padded out to greater lengths and then 0-9 as the last value. Similarlry columns VC_2 to VC_8 are padded out and have a random character added. VC_1 is a random 5-character string.
create table hist_test
as select rownum id
,mod(rownum,10) num_1
,trunc(dbms_random.value(1,10)) num_2
,1000+trunc(dbms_random.value(1,10)) num_3
,1000000+trunc(dbms_random.value(1,10)) num_4
,1000000000+trunc(dbms_random.value(1,10)) num_5
,1000000000000+trunc(dbms_random.value(1,10)) num_6
,1000000000000000+trunc(dbms_random.value(1,10)) num_7
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
--below skews the table to AB.
,lpad('H',16,'H')||decode(mod(rownum,3),0,'A'
,1,'B'
, dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001
/
begin
dbms_stats.gather_table_stats(ownname=>user,tabname=>'HIST_TEST'
,method_opt=>'FOR ALL COLUMNS SIZE 10');
END;
/
The below is a simple select against DBA_TAB_COLUMNS to see the information for the column {Oh, I say simple, but you have to use the functions utl_raw.cat_to_number and utl_raw.cast_to_varchar2 to turn the raw values held in the columns LOW_VALUE and HIGH_VALUE to something we humans can read. Why does Oracle Corp have to make life so difficult? *sigh*.
Oh, click on “view plain” on the box below to get a better layout.
select
column_name
,num_distinct num_dist
,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(low_value))
,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(low_value))
, low_value
) low_v
,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(high_value))
,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(high_value))
, high_value
) hi_v
,num_nulls n_nulls
,num_buckets n_buck
,avg_col_len avg_l
from dba_tab_columns
where table_name ='HIST_TEST'
and owner=USER
order by owner,column_id
/
COLUMN_NAME NUM_DIST LOW_V HI_V N_NULLS N_BUCK AVG_L
---------------------------------------------------------------------
ID 1,000 1 1000 0 10 4
NUM_1 10 0 9 0 10 3
NUM_2 9 1 9 0 9 3
NUM_3 9 1001 1009 0 9 4
NUM_4 9 1000001 1000009 0 9 6
NUM_5 9 1000000001 1000000009 0 9 7
NUM_6 9 1000000000001 1000000000009 0 9 9
NUM_7 9 1000000000000001 1000000000000009 0 9 10
VC_1 6 AAAAA FFFFF 0 6 6
VC_2 26 AAA AAZ 0 10 4
VC_3 26 BBBA BBBZ 0 10 5
VC_4 26 CCCCA CCCCZ 0 10 6
VC_5 26 DDDDDA DDDDDZ 0 10 7
VC_6 26 EEEEEEA EEEEEEZ 0 10 8
VC_7 26 FFFFFFFA FFFFFFFZ 0 10 9
VC_8 26 GGGGGGGGA GGGGGGGGZ 0 10 10
VC_16 26 HHHHHHHHHHHHHHHHA HHHHHHHHHHHHHHHHZ 0 10 18
VC_40 1 IIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIII 0 1 42/
I now select the data out of the DBA_TAB_HISTOGRAMS table to see what is really being stored. For the sake of brevity {which means, “so I can finish this post tonight”} I’ll just show bits, but if you want, download the script right at the end and, if you created the functions from the prior post, you can play with this yourself.
First, here you can see some of the values actually stored in ENDPOINT_VALUE and what they translate into:-
TABLE_NAME colname ENDPOINT_VALUE ROWCOUNT REAL_VAL ------------------------------------------------------------ Numbers HIST_TEST NUM_5 1000000007 760 1000000007 HIST_TEST NUM_5 1000000008 870 1000000008 HIST_TEST NUM_5 1000000009 1,000 1000000009 HIST_TEST NUM_6 1000000000001 103 1000000000001 HIST_TEST NUM_6 1000000000002 219 1000000000002 HIST_TEST NUM_6 1000000000003 328 1000000000003 HIST_TEST NUM_6 1000000000004 427 1000000000004 Varchars HIST_TEST VC_2 338824386822815000000000000000000000 8 AAU HIST_TEST VC_2 338824624507302000000000000000000000 9 AAW HIST_TEST VC_2 338824782963627000000000000000000000 10 AAY HIST_TEST VC_3 344035480872391000000000000000000000 0 BBB@ HIST_TEST VC_3 344035481491361000000000000000000000 1 BBBB HIST_TEST VC_3 344035482419816000000000000000000000 2 BBBE HIST_TEST VC_3 344035483348271000000000000000000000 3 BBBH HIST_TEST VC_3 344035483967241000000000000000000000 4 BBBJ
Note that for numerics the number itself is stored and I do not need to translate it.
For VARCHAR2 columns the value held is the 15-digit number padded with zeros.
Also note, for VC_2 the range covered seems to end at AAY not AAZ and column VC_3 starts at BBB@ not BBBA {I am creating values with the last character set to A-Z}. Also, bucket 8 for VC_2 ends in a control character.
To reduce this I add a fudge to my function {again, see previous post}. It helps:
TABLE_NAME colname ROWCOUNT REAL_VAL MOD_REAL --------------------------------------------------- HIST_TEST VC_2 9 AAW AAX HIST_TEST VC_2 10 AAY AAZ HIST_TEST VC_3 0 BBB@ BBBA HIST_TEST VC_3 1 BBBB BBBC HIST_TEST VC_3 2 BBBE BBBF HIST_TEST VC_3 3 BBBH BBBI HIST_TEST VC_3 4 BBBJ BBBK HIST_TEST VC_3 5 BBBM BBBN HIST_TEST VC_3 6 BBBO BBBP HIST_TEST VC_3 7 BBBR BBBS HIST_TEST VC_3 8 BBBT BBBU HIST_TEST VC_3 9 BBBW BBBX HIST_TEST VC_3 10 BBBY BBBZ HIST_TEST VC_4 0 CCCC@ CCCCA HIST_TEST VC_4 1 CCCCB CCCCC HIST_TEST VC_4 2 CCCCD CCCCE HIST_TEST VC_4 3 CCCCH CCCCH HIST_TEST VC_4 4 CCCCJ CCCCJ
As you can see, I now get a better translation, but it still goes wrong sometimes (see last couple of rows). So, feel free to take my functions and use them, but be aware even the modified version is not perfect. If YOU perfect it, can I have a copy please
The below shows that Numeric value histograms break when you exceed 15 digits:
colname END_VAL ROWCOUNT REAL_VAL ------- ------------------------ -------- -------- NUM_5 1000000009 1,000 1000000009 NUM_6 1000000000001 103 1000000000001 NUM_6 1000000000002 219 1000000000002 NUM_6 1000000000003 328 1000000000003 NUM_6 1000000000004 427 1000000000004 NUM_6 1000000000005 542 1000000000005 NUM_6 1000000000006 651 1000000000006 NUM_6 1000000000007 771 1000000000007 NUM_6 1000000000008 881 1000000000008 NUM_6 1000000000009 1,000 1000000000009 NUM_7 1000000000000000 133 1000000000000000 NUM_7 1000000000000000 256 1000000000000000 NUM_7 1000000000000000 367 1000000000000000 NUM_7 1000000000000000 467 1000000000000000 NUM_7 1000000000000010 567 1000000000000010 NUM_7 1000000000000010 665 1000000000000010 NUM_7 1000000000000010 784 1000000000000010 NUM_7 1000000000000010 896 1000000000000010 NUM_7 1000000000000010 1,000 1000000000000010
This is the point at which storage of numbers for histograms breaks. You can see that NUM_6 is fine but NUM_7 is not. That is because NUM_6 is below 15 significant digits and NUM_7 is over 15 significant digits and effectively gets truncated.
Histograms on numeric values with more than 15 significant digits will not work as you expect, possible not at all.
With Varchar(2) values, histogram END_VALUES break even sooner, at around 7 digits:
colname END_VAL ROWCOUNT MOD_REAL ------- ------------------------------------- -------- -------- VC_5 354460798876024000000000000000000000 5 DDDDDO VC_5 354460798876038000000000000000000000 6 DDDDDR VC_5 354460798876047000000000000000000000 7 DDDDDS VC_5 354460798876061000000000000000000000 8 DDDDDV VC_5 354460798876071000000000000000000000 9 DDDDDY VC_5 354460798876080000000000000000000000 10 DDDDDZ VC_6 359673457682977000000000000000000000 0 EEEEEF8 VC_6 359673457682977000000000000000000000 1 EEEEEF8 VC_6 359673457682977000000000000000000000 2 EEEEEF8 VC_6 359673457682977000000000000000000000 3 EEEEEF8 VC_6 359673457682977000000000000000000000 4 EEEEEF8 VC_6 359673457682977000000000000000000000 5 EEEEEF8 VC_6 359673457682977000000000000000000000 6 EEEEEF8 VC_6 359673457682977000000000000000000000 7 EEEEEF8 VC_6 359673457682978000000000000000000000 8 EEEEEFn VC_6 359673457682978000000000000000000000 9 EEEEEFn VC_6 359673457682978000000000000000000000 10 EEEEEFn VC_7 364886116489977000000000000000000000 0 FFFFFGK VC_7 364886116489977000000000000000000000 1 FFFFFGK VC_7 364886116489977000000000000000000000 2 FFFFFGK VC_7 364886116489977000000000000000000000 3 FFFFFGK VC_7 364886116489977000000000000000000000 4 FFFFFGK VC_7 364886116489977000000000000000000000 5 FFFFFGK VC_7 364886116489977000000000000000000000 6 FFFFFGK VC_7 364886116489977000000000000000000000 7 FFFFFGK VC_7 364886116489977000000000000000000000 8 FFFFFGK VC_7 364886116489977000000000000000000000 9 FFFFFGK VC_7 364886116489977000000000000000000000 10 FFFFFGK
You can see that for column VC_5 the actual ENDPOINT_VALUE is varying at the 14th and 15th significant digit and my translated value changes. But for VC_6 the numeric ENDPOINT_VALUE is hardly changing. Each row translated to one of two values.
For VC_7 the ENDPOINT_VALUE is static. All histogram END_VALUES are the same.
This means Histograms will not work properly for any VARCHAR(2) columns which do not vary for the first 7 or more characters and any characters after the 7th will be ignored.
Or does it? My logic is correct but tomorrow (or soon after) I’ll try some actual tests over theory… {So don’t ping me with corrections just yet}
I’ve not mentioned VC_16 and VC_40 have I? I will tomorrow. Maybe
.
Finally, Here is the selecting script, as promised.
col column_name form a7 head colname
col rowcount form 99,999
col real_val form a17
col end_val form 999999999999999999999999999999999999
col end_string form a10
col endpoint_actual_value form a40
col mod_real form a17
select --owner
dth.table_name
,dth.column_name
,dth.endpoint_value end_val
,dth.endpoint_number rowcount
,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value)
,'VARCHAR2',hist_numtochar(dth.endpoint_value+1)
, dth.endpoint_value
) real_val
,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value)
,'VARCHAR2',hist_numtochar2(dth.endpoint_value+1)
, dth.endpoint_value
) mod_real
,endpoint_actual_value
from dba_tab_histograms dth
,dba_tab_columns dtc
where dth.table_name ='HIST_TEST'
and dth.owner=USER
and dth.owner=dtc.owner
and dth.table_name=dtc.table_name
and dth.column_name=dtc.column_name
order by dth.table_name,dth.column_name,dth.endpoint_number
/
Decrypting Histogram Data August 11, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, statistics
3 comments
I’m looking at Histograms at the moment and I’ve hit a problem I always hit when looking at histograms. So I’ve decided once and for all to see if I can find a solution.
The value held in DBA_TAB_HISTOGRAMS.ENDPOINT_VALUE for VARCHAR2 columns is a number with no apparent link to reality.
{Before anyone who has read my previous posts suggests it, I’ve already looked at the view and the relevant column resolves to data in either sys.histgrm$ or sys.hist_head$ and the underlying values held are still numbers and not processed in the view, so they are numbers. I think}.
{{I have a sneaking suspicion this could turn into a set of increasingly obscure postings…}}
{{{THIS one might be my most obscure so far!}}}
Let’s have a look:
select --owner table_name ,column_name ,endpoint_value end_val ,endpoint_number rowcount from all_tab_histograms where table_name ='TEST_P' and owner='WIDLAKEM' order by table_name,column_name,endpoint_number / TABLE_NAME COLUMN_NAME END_VAL ROWCOUNT --------------- --------------- ---------- --------------- TEST_P STATUS 0 41,523 TEST_P STATUS 1 42,065 TEST_P STATUS 2 42,099 TEST_P VC_1 3.3882E+35 0 TEST_P VC_1 3.4951E+35 1 TEST_P VC_1 3.6487E+35 2 TEST_P VC_1 3.7558E+35 3 TEST_P VC_1 3.9095E+35 4 TEST_P VC_1 4.0162E+35 5 TEST_P VC_1 4.1697E+35 6 TEST_P VC_1 4.3235E+35 7 TEST_P VC_1 4.4305E+35 8 TEST_P VC_1 4.5841E+35 9 TEST_P VC_1 4.6914E+35 10
So what in heck is 4.6914E+35 {last value listed}?
Well, it is a number of course. If I set my column heading appropriately I can see this:-
col end_val form 999,999,999,999,999,999,999,999,999,999,999,999 col column_name form a10 select column_name ,endpoint_value end_val ,endpoint_number rowcount from all_tab_histograms where table_name ='TEST_P' and owner='WIDLAKEM' and column_name='VC_1' order by table_name,column_name,endpoint_number / COLUMN_NAM END_VAL ROWCOUNT ---------- ------------------------------------------------ ---------- VC_1 338,822,823,410,931,000,000,000,000,000,000,000 0 VC_1 349,512,448,932,628,000,000,000,000,000,000,000 1 VC_1 364,867,102,368,954,000,000,000,000,000,000,000 2 VC_1 375,575,265,424,979,000,000,000,000,000,000,000 3 VC_1 390,949,334,699,583,000,000,000,000,000,000,000 4 VC_1 401,618,352,253,998,000,000,000,000,000,000,000 5 VC_1 416,972,612,321,579,000,000,000,000,000,000,000 6 VC_1 432,345,727,450,272,000,000,000,000,000,000,000 7 VC_1 443,054,364,035,286,000,000,000,000,000,000,000 8 VC_1 458,409,658,691,912,000,000,000,000,000,000,000 9 VC_1 469,139,289,515,351,000,000,000,000,000,000,000 10
The ENDPOINT_VALUE is an encoded representation of the varchar2 value, based on the trick of turning each character into it’s ascii equivalent (0-255) and then multiplying it by 256*no-of-characters-in to the string and adding it all together. Let’s use “Cat” as an example.
- Last character is ‘t’, ascii value is 116=116;
- Next-to-last character is ‘b’, ascii value 97, *(256)= 24832;
- Chr 3 in is ‘C’, ascii value 67, *(256*256) =4390912
- Total is 4390912+24832+116=4415860
What Oracle actually does is slightly different {and I could have this slightly wrong, my brain waved a white flag a couple of hours back} it takes the first 15 characters and multiplies the first character ascii value by power(256*15), second character ascii value by power(256*14) etc until it runs out of characters or gets to character 15.
How do I know this? I decoded the ‘C’ in the kernel
. No, I didn’t, I found a link to an encode script Jonathan Lewis had written {it’s also in his latest book, or at least in the set of scripts the book tells you how to download}. I’ve lost the original link I found, and thus a nice little article on the whole topic, but this link will do until I can re-find the original. {NB this referenced article mentions using the hexstr function which may be neater but, as I said earlier, my brain has given up}.
I’ve outrageously stolen Jonathan’s script which encodes a varchar2 string into the relevant numeric and used it as a template to create a decode version too. Just a couple of notes:
- I can imagine this might not work if you have multibyte characters.
- Oracle confuse things by doing something like translating the 15 character into a 36-digit string – and then taking only the first 15 significant digits. This kind of messes up the untranslate.
So, I create three functions:
- hist_chartonum that converts a varchar2 into something very similar to what is stored in the histogram views. Jonathan will find it incredibly familiar. An optional second paramter turns off the truncation that oracle does for the histogram data, so you can get the full glorious 36 significant digits for 15 characters if you wish.
- hist_numtochar which converts the numeric back to a string. It gets it slightly wrong due to the truncation of the least significant 20 or so characters. Again, an optional second paramater allows it to not replicate the trunaction and work with all 15 characters.
- hist_numtochar2 which is an attempt to allow for the truncation errors. I add power(256,9) to the value if it has been truncated. It seems to make eg the numeric representation of AAAAA translate back to AAAAA not AAAA@. Yes, it’s a fudge.
This is the script:-
-- cre_hist_funcs.sql
-- Heavily borrowed from Jonathan Lewis, sep 2003
-- MDW 11/8/09 - all mistakes mine.
-- hist_chartonum converts a string to a number in a very similar way
-- to how Oracle does so for storing in column ENDPOINT_VALUE in XXX_TAB_HISTOGRAMS
-- Optionally get it to not truncate the value in the same way to only 15 digits.
-- hist_numtochar converts the ENDPOINT_VALUE back to human-readable format.
-- Optinally get it to go beyond 15 numerics (7 or 8 characters)
-- hist_numtochar2 attempts to allow for truncation errors.
-- JLs version has been checked against 8.1.7.4, 9.2.0.4 and 10.1.0.2
-- I've used on 10.2.0.3
set timing off
create or replace function hist_chartonum(p_vc varchar2
,p_trunc varchar2 :='Y') return number
is
m_vc varchar2(15) := substr(rpad(p_vc,15,chr(0)),1,15);
m_n number := 0;
begin
for i in 1..15 loop
/* dbms_output.put(ascii(substr(m_vc,i,1)));
dbms_output.put(chr(9));
dbms_output.put_Line(to_char( power(256,15-i) * ascii(substr(m_vc,i,1)),
'999,999,999,999,999,999,999,999,999,999,999,999'
)
); */
m_n := m_n + power(256,15-i) * ascii(substr(m_vc,i,1));
end loop;
-- this converts it from a 36 digit number to the 15-digit number used
-- in the ENDPOINT_VALUE Column.
If p_trunc = 'Y' then
m_n := round(m_n, -21);
end if;
-- dbms_output.put_line(to_char(m_n,'999,999,999,999,999,999,999,999,999,999,999,999'));
return m_n;
end;
/
create or replace function hist_numtochar(p_num number
,p_trunc varchar2 :='Y') return varchar2
is
m_vc varchar2(15);
m_n number :=0;
m_n1 number;
m_loop number :=7;
begin
m_n :=p_num;
if length(to_char(m_n))<36 then
--dbms_output.put_line ('input too short');
m_vc:='num format err';
else
if p_trunc !='Y' then
m_loop :=15;
end if;
--dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999'));
for i in 1..m_loop loop
m_n1:=trunc(m_n/(power(256,15-i)));
-- dbms_output.put_line(to_char(m_n1));
if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
end if;
dbms_output.put_line(m_vc);
m_n:=m_n-(m_n1*power(256,15-i));
end loop;
end if;
return m_vc;
end;
/
create or replace function hist_numtochar2(p_num number
,p_trunc varchar2 :='Y') return varchar2
is
m_vc varchar2(15);
m_n number :=0;
m_n1 number;
m_loop number :=7;
begin
m_n :=p_num;
if length(to_char(m_n))<36 then
--dbms_output.put_line ('input too short');
m_vc:='num format err';
else
if p_trunc !='Y' then
m_loop :=15;
else
m_n:=m_n+power(256,9);
end if;
--dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999'));
for i in 1..m_loop loop
m_n1:=trunc(m_n/(power(256,15-i)));
-- dbms_output.put_line(to_char(m_n1));
if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
end if;
dbms_output.put_line(m_vc);
m_n:=m_n-(m_n1*power(256,15-i));
end loop;
end if;
return m_vc;
end;
/
rem
rem Sample of use:
rem
col id1 format 999,999,999,999,999,999,999,999,999,999,999,999
col id3 format 999,999,999,999,999,999,999,999,999,999,999,999
select
'short' text
,hist_chartonum('short') id1
,hist_numtochar(hist_chartonum('short')) id2
,hist_chartonum('short','N') id3
,hist_numtochar(hist_chartonum('short','N'),'N') id4
from dual;
select
'alongteststring' text
,hist_chartonum('alongteststring') id1
,hist_numtochar(hist_chartonum('alongteststring')) id2
,hist_chartonum('alongteststring','N') id3
,hist_numtochar(hist_chartonum('alongteststring','N'),'N') id4
from dual;
select
'amuchlongerteststring' text
,hist_chartonum('amuchlongerteststring') id1
,hist_numtochar(hist_chartonum('amuchlongerteststring')) id2
,hist_chartonum('amuchlongerteststring','N') id3
,hist_numtochar(hist_chartonum('amuchlongerteststring','N'),'N') id4
from dual;
spool off
The test at the end produces the following output:-
strng histgrm_translation ------------------------------------------------ translate_back ------------------------------------------------ untrimmed_translation ------------------------------------------------ untrimmed_translate_back ------------------------------------------------ short 599,232,339,077,851,000,000,000,000,000,000,000 short 599,232,339,077,850,549,055,205,294,263,500,800 short alongteststring 505,852,124,009,532,000,000,000,000,000,000,000 alongtm 505,852,124,009,531,849,435,547,374,314,942,055 alongteststring amuchlongerteststring 505,872,878,384,947,000,000,000,000,000,000,000 amuchly 505,872,878,384,946,809,648,430,989,359,543,156 amuchlongertest
Final comment. Oracle trims the value stored in sys.histgrm$ and knackers up the untranslate. Why? Well, this is only me thinking out aloud, but this reduces the number from 36 to 15 significant digits and numerics are stored as one byte per two digits plus an offset byte {am I missing a length byte?}. So 10 bytes or so. This will keep the size of data stored down. I mean, as I posted here, the sys.histgrm table gets big enough as it is!
Tomorrow {or whenever my brain recovers} I’ll show how I use it and one or two oddities I have found or come across from other people’s blogs.
