Decrypting Histogram Data #3 – is 7 or 15 Characters Significant? September 15, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, statistics
2 comments
At last I am going to wrap up about looking at Histogram data and endpoint_value. Sorry about the delay, life is a tad complex at present. This is also a very, very long post. You may want to skip to the end to get the conclusions
I first posted about how to translate endpoint_values.
I then posted in the second post a week ago about how endpoint_value is limited to 7 or so characters for varchar2 columns.
{Addition – this later post discusses the content of endpoinit_value for dates}
Coskan contacted me to say he believed I had missed something, and of course he was right – but I’d laid big hints that I was using theory and not practical test and there was more to know. {There is usually more to know, only practical tests confirm whether your theory works or not. And even after that, it seems there is still more to know!}
OK, I have a test table, created with the following {oh, this is under 10.2.0.3, Enterprise Edition on linux}:
create table hist_test as select rownum id ,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1 ,lpad('A',2,'A')||dbms_random.string('U',1) vc_2 ,lpad('B',3,'B')||dbms_random.string('U',1) vc_3 ,lpad('C',4,'C')||dbms_random.string('U',1) vc_4 ,lpad('D',5,'D')||dbms_random.string('U',1) vc_5 ,lpad('E',6,'E')||dbms_random.string('U',1) vc_6 ,lpad('F',7,'F')||dbms_random.string('U',1) vc_7 ,lpad('G',8,'G')||dbms_random.string('U',1) vc_8 ,lpad('H',16,'H')||dbms_random.string('U',1)) vc_16 ,lpad('I',40,'I')||dbms_random.string('U',1) vc_40 from dba_objects where rownum < 1001
I need to skew the data in some columns so that different WHERE clauses should see different expected numbers of rows
I now need to add some skew to the data so you can see the histogram in action .
update hist_test set
VC_5=lpad(‘D’,5,’D’)||’A’
,VC_7=lpad(‘F’,7,’F’)||’A’
,VC_16=lpad(‘H’,16,’H’)||’A’
,VC_40=lpad(‘I’,40,’I’)||’A’
where mod(id,10)=1
/
Gather histogram stats, 10 buckets:
dbms_stats.gather_table_stats(ownname=>user,tabname=>’HIST_TEST’
,method_opt=>’FOR ALL COLUMNS SIZE 10′)
And looking at ALL_TAB_COLUMNS we can see the general stucture of the data (just the VC_ columns are shown)
COL_NAME NUM_DIST LOW_V HI_V N_BUCK AVG_L ------------------------------------------------------------- ID 1,000 1 1000 10 4 VC_1 6 AAAAA FFFFF 6 6 VC_2 26 AAA AAZ 10 4 VC_3 26 BBBA BBBZ 10 5 VC_4 26 CCCCA CCCCZ 10 6 VC_5 26 DDDDDA DDDDDZ 10 7 VC_6 26 EEEEEEA EEEEEEZ 10 8 VC_7 26 FFFFFFFA FFFFFFFZ 10 9 VC_8 26 GGGGGGGGA GGGGGGGGZ 10 10 VC_16 26 HHHHHHHHHHHHHHHHA HHHHHHHHHHHHHHHHZ 10 18 VC_40 1 IIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIII 1 42
I have created a table with varchar2 columns set to a fixed sting of characters the length indicated by the column name with the last column varying from A to Z. So VC_5 contains EEEEEx, where x is A to Z. 26 distinct values each column.
If you look at the above, all columns are showing as having 26 values and 10 buckets EXCEPT VC_40, but it does have 26 distinct values:
select count(distinct(VC_40)) from hist_test;
COUNT(DISTINCT(VC_40))
———————-
26
I have my function to unpack the endpoint_value for varchar2 columns from the number stored to the character string it represents. My function is not perfect, but it is reasonably good. Go checkout post one for the function.
Looking at the histograms with my function you can see the below {I show only some records}
colname END_VAL ROWCOUNT MOD_REAL ------------------------------------------------------------------ VC_5 354460798875962000000000000000000000 1 DDDDDA VC_5 354460798875972000000000000000000000 2 DDDDDD VC_5 354460798875986000000000000000000000 3 DDDDDG {snip} VC_5 354460798876071000000000000000000000 9 DDDDDY VC_5 354460798876080000000000000000000000 10 DDDDDZ VC_7 364886116489977000000000000000000000 1 FFFFFGK VC_7 364886116489977000000000000000000000 2 FFFFFGK VC_7 364886116489977000000000000000000000 3 FFFFFGK VC_7 364886116489977000000000000000000000 4 FFFFFGK {snip} VC_7 364886116489977000000000000000000000 9 FFFFFGK VC_7 364886116489977000000000000000000000 10 FFFFFGK VC_16 375311434103976000000000000000000000 1 HHHHHI: VC_16 375311434103976000000000000000000000 2 HHHHHI: VC_16 375311434103976000000000000000000000 3 HHHHHI: {snip} VC_16 375311434103976000000000000000000000 9 HHHHHI: VC_16 375311434103976000000000000000000000 10 HHHHHI: VC_40 380524092910976000000000000000000000 1,000 IIIIIJM
For VC_5, there are 10 buckets, all with different endpoint_values.
The VC_7 has 10 buckets, but most have the same endpoint_value.
The VC_16 has 10 records in the HISTOGRAM table, all with the same endpoint_value.
VC_40 has only one record in the HISTOGRAM table.
Basically, if I am right, WHERE filters on a column holding values that are the same for the first seven or so characters will not be well supported by the histograms. We should see the estimated cost for such a WHERE filter to be wrong.
I have skewed the data so that Histograms should show more expected values for DDDDDA than DDDDDS, for FFFFFFFA than FFFFFFFS etc
update hist_test
set
VC_5=lpad(‘D’,5,’D’)||’A’
,VC_7=lpad(‘F’,7,’F’)||’A’
,VC_16=lpad(‘H’,16,’H’)||’A’
,VC_40=lpad(‘I’,40,’I’)||’A’
where mod(id,10)=1
/
And now I’ll see how the CBO estimates the rows coming back for some WHERE clauses.
select count(*) from hist_test where VC_5 = 'DDDDDA' COUNT(*) ---------- 139 Execution Plan ---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 7 | 8 (0)| | 1 | SORT AGGREGATE | | 1 | 7 | | |* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 700 | 8 (0)| ---------------------------------------------------------- select count(*) from hist_test where VC_5 = 'DDDDDS' COUNT(*) ---------- 35 Execution Plan ---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 7 | 8 (0)| | 1 | SORT AGGREGATE | | 1 | 7 | | |* 2 | TABLE ACCESS FULL| HIST_TEST | 49 | 343 | 8 (0)| ---------------------------------------------------------------------
For column VC_5 it all works as we would expect. the CBO estimates 100 values for DDDDDA and 49 for DDDDDS. I’ve shown the actual count(*)’s as well to show there is a real variation in numbers. There are good reasons why the estimates do not match reality, I won’t go into them now but this does highlight that histograms can help but do have flaws.
What about the other rows? We are only interested in the WHERE clause and the estimate for the number of rows from the table access, so I’ll show only those.
where VC_7 = 'FFFFFFFA' |* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 900 | 8 (0)| where VC_7 = 'FFFFFFFS' |* 2 | TABLE ACCESS FULL| HIST_TEST | 49 | 441 | 8 (0)| where VC_16 = 'HHHHHHHHHHHHHHHHA' |* 2 | TABLE ACCESS FULL| HIST_TEST | 100 | 7200 | 8 (0)| where VC_16 = 'HHHHHHHHHHHHHHHHS' |* 2 | TABLE ACCESS FULL| HIST_TEST | 47 | 216 | 8 (0)| where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA' |* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)| where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIS' |* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)| {and just to show there is no typo in the string of 'I's above... where VC_40 = lpad('I',40,'I')||'S' |* 2 | TABLE ACCESS FULL| HIST_TEST | 1000 | 42000 | 8 (0)|
In the above you can see that the CBO estimates the costs for VC_7 exactly the same as it did for VC_5. Despite the endpoint_values no longer distinguishing the buckets.
Even more telling is tha for VC_16 the CBO also is able to predict there are more values ending in A than S.
As for VC_40, the CBO utterly fails to spot the skew. In fact it utterly fails to spot that there is more than one value in the column (look back at the section showing min and max values and number of distinct values), and there is only one histogram bucket. so I supposed it is no surprise.
How does the CBO still detect the skew for VC_7 and VC_16, even though the endpoint_valuess are the same?
Because it swaps to using column ENDPOINT_ACTUAL_VALUE.
ENDPOINT_ACTUAL_VALUE is only populated for varchar2 columns when the precision of endpoint_value is exceeded:
colname ROWCOUNT MOD_REAL ENDPOINT_ACTUAL_VALUE ------- --------------------------------------------- VC_5 7 DDDDDR VC_5 8 DDDDDV VC_5 9 DDDDDY VC_5 10 DDDDDZ VC_6 0 EEEEEF8 EEEEEEA VC_6 1 EEEEEF8 EEEEEEC VC_6 2 EEEEEF8 EEEEEEF VC_6 3 EEEEEF8 EEEEEEH VC_6 4 EEEEEF8 EEEEEEK VC_6 5 EEEEEF8 EEEEEEN VC_6 6 EEEEEF8 EEEEEEP VC_6 7 EEEEEFn EEEEEES VC_6 8 EEEEEFn EEEEEEU VC_6 9 EEEEEFn EEEEEEX VC_6 10 EEEEEFn EEEEEEZ VC_7 1 FFFFFGK FFFFFFFA VC_7 2 FFFFFGK FFFFFFFC VC_7 3 FFFFFGK FFFFFFFF VC_7 4 FFFFFGK FFFFFFFI VC_7 5 FFFFFGK FFFFFFFK VC_7 6 FFFFFGK FFFFFFFO VC_7 7 FFFFFGK FFFFFFFR VC_7 8 FFFFFGK FFFFFFFT VC_7 9 FFFFFGK FFFFFFFW VC_7 10 FFFFFGK FFFFFFFZ VC_16 1 HHHHHI: HHHHHHHHHHHHHHHHA VC_16 2 HHHHHI: HHHHHHHHHHHHHHHHD VC_16 3 HHHHHI: HHHHHHHHHHHHHHHHG VC_16 4 HHHHHI: HHHHHHHHHHHHHHHHJ VC_16 5 HHHHHI: HHHHHHHHHHHHHHHHL VC_16 6 HHHHHI: HHHHHHHHHHHHHHHHO VC_16 7 HHHHHI: HHHHHHHHHHHHHHHHS VC_16 8 HHHHHI: HHHHHHHHHHHHHHHHU VC_16 9 HHHHHI: HHHHHHHHHHHHHHHHX VC_16 10 HHHHHI: HHHHHHHHHHHHHHHHZ VC_40 1,000 IIIIIJM
You can see that VC_5 columns have no ENDPOINT_ACTUAL_VALUEs but VC_6, VC_7 and VC_16 do. VC_40 does not.
So, at what point does the CBO stop storing values in ENDPOINT_ACTUAL_VALUES? 32 characters. I created another test table and these are the low/high values, number of distinct values and number of histogram buckets:
COL_NAME NUM_DIST LOW_V HI_V N_BUCK AVG_L
————————————————————–
ID 1,000 1 1000 10 4
NUM_1 10 0 9 10 3
VC_1 6 AAAAA FFFFF 6 6
VC_30 26 AAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAA 10 32
VC_31 26 BBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBB 10 33
VC_32 1 CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC 1 34
VC_33 1 DDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDD 1 35
VC_34 1 EEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEE 1 36
VC_35 1 FFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFF 1 37[/sourcecode ]
The name of the column is the length of the fixed string, so the columns are actually one character longer, as can be seen by the “AVG_L” column.
And just to check {as after all this is just a nice theory until you test}, I skew the number of records ending in ‘A’ for VC_31 and VC_32:-
update hist_test2
set
VC_31=lpad(‘B’,31,’B’)||’A’
,VC_32=lpad(‘C’,32,’C’)||’A’
where mod(id,10)=1
And see what the CBO thinks the rows to be identified are:-
where VC_31 = lpad(‘B’,31,’B’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 100 | 3300 |
where VC_31 = lpad(‘B’,31,’B’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 49 | 1617 |
where VC_32 = lpad(‘C’,32,’C’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 |
where VC_32 = lpad(‘C’,32,’C’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 | [/sourcecode ]
Yes. The CBO can see the skew for the 32 characters of VC_31 and not for the 33 characters of VC_32
So in conclusion
- endpoint_value is only accurate for varchar2 columns up to around 7 characters.
- If the varchar2 exceeds the precision of the endpoint_value column then endpoint_actual_value is used.
- endpoint_actual_value is used for varchar2 columns up to 32 characters and not for columns less than 7 characters.
- Histograms will not help with columns having many values with the same leading 32 or more characters.
- If you column has the same leading 32 characters or more, even the number of distinct values is not gathered by DBMS_STATS.
- You can’t tell how SQL/CBO will work from theory, you have to test. And still consider you knowledge as open to correction after that 🙂