jump to navigation

Decrypting Histogram Data #3 – is 7 or 15 Characters Significant? September 15, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,
2 comments

At last I am going to wrap up about looking at Histogram data and endpoint_value. Sorry about the delay, life is a tad complex at present. This is also a very, very long post. You may want to skip to the end to get the conclusions

I first posted about how to translate endpoint_values.
I then posted in the second post a week ago about how endpoint_value is limited to 7 or so characters for varchar2 columns.
{Addition – this later post discusses the content of endpoinit_value for dates}

Coskan contacted me to say he believed I had missed something, and of course he was right – but I’d laid big hints that I was using theory and not practical test and there was more to know. {There is usually more to know, only practical tests confirm whether your theory works or not. And even after that, it seems there is still more to know!}

OK, I have a test table, created with the following {oh, this is under 10.2.0.3, Enterprise Edition on linux}:

create table hist_test
as select rownum  id
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
,lpad('H',16,'H')||dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001

I need to skew the data in some columns so that different WHERE clauses should see different expected numbers of rows
I now need to add some skew to the data so you can see the histogram in action .

update hist_test set
VC_5=lpad(‘D’,5,’D’)||’A’
,VC_7=lpad(‘F’,7,’F’)||’A’
,VC_16=lpad(‘H’,16,’H’)||’A’
,VC_40=lpad(‘I’,40,’I’)||’A’
where mod(id,10)=1
/

Gather histogram stats, 10 buckets:
dbms_stats.gather_table_stats(ownname=>user,tabname=>’HIST_TEST’
,method_opt=>’FOR ALL COLUMNS SIZE 10′)

And looking at ALL_TAB_COLUMNS we can see the general stucture of the data (just the VC_ columns are shown)

COL_NAME NUM_DIST LOW_V                HI_V                 N_BUCK  AVG_L
-------------------------------------------------------------
ID          1,000 1                    1000                     10      4
VC_1            6 AAAAA                FFFFF                     6      6
VC_2           26 AAA                  AAZ                      10      4
VC_3           26 BBBA                 BBBZ                     10      5
VC_4           26 CCCCA                CCCCZ                    10      6
VC_5           26 DDDDDA               DDDDDZ                   10      7
VC_6           26 EEEEEEA              EEEEEEZ                  10      8
VC_7           26 FFFFFFFA             FFFFFFFZ                 10      9
VC_8           26 GGGGGGGGA            GGGGGGGGZ                10     10
VC_16          26 HHHHHHHHHHHHHHHHA    HHHHHHHHHHHHHHHHZ        10     18
VC_40           1 IIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIII      1     42

I have created a table with varchar2 columns set to a fixed sting of characters the length indicated by the column name with the last column varying from A to Z. So VC_5 contains EEEEEx, where x is A to Z. 26 distinct values each column.

If you look at the above, all columns are showing as having 26 values and 10 buckets EXCEPT VC_40, but it does have 26 distinct values:

select count(distinct(VC_40)) from hist_test;
COUNT(DISTINCT(VC_40))
———————-
26

I have my function to unpack the endpoint_value for varchar2 columns from the number stored to the character string it represents. My function is not perfect, but it is reasonably good. Go checkout post one for the function.
Looking at the histograms with my function you can see the below {I show only some records}

colname                               END_VAL ROWCOUNT MOD_REAL
------------------------------------------------------------------
VC_5     354460798875962000000000000000000000        1 DDDDDA
VC_5     354460798875972000000000000000000000        2 DDDDDD
VC_5     354460798875986000000000000000000000        3 DDDDDG
{snip}
VC_5     354460798876071000000000000000000000        9 DDDDDY
VC_5     354460798876080000000000000000000000       10 DDDDDZ
VC_7     364886116489977000000000000000000000        1 FFFFFGK
VC_7     364886116489977000000000000000000000        2 FFFFFGK
VC_7     364886116489977000000000000000000000        3 FFFFFGK
VC_7     364886116489977000000000000000000000        4 FFFFFGK
{snip}
VC_7     364886116489977000000000000000000000        9 FFFFFGK
VC_7     364886116489977000000000000000000000       10 FFFFFGK
VC_16    375311434103976000000000000000000000        1 HHHHHI:
VC_16    375311434103976000000000000000000000        2 HHHHHI:
VC_16    375311434103976000000000000000000000        3 HHHHHI:
{snip}
VC_16    375311434103976000000000000000000000        9 HHHHHI:
VC_16    375311434103976000000000000000000000       10 HHHHHI:
VC_40    380524092910976000000000000000000000    1,000 IIIIIJM

For VC_5, there are 10 buckets, all with different endpoint_values.
The VC_7 has 10 buckets, but most have the same endpoint_value.
The VC_16 has 10 records in the HISTOGRAM table, all with the same endpoint_value.
VC_40 has only one record in the HISTOGRAM table.

Basically, if I am right, WHERE filters on a column holding values that are the same for the first seven or so characters will not be well supported by the histograms. We should see the estimated cost for such a WHERE filter to be wrong.

I have skewed the data so that Histograms should show more expected values for DDDDDA than DDDDDS, for FFFFFFFA than FFFFFFFS etc

update hist_test
set
 VC_5=lpad(‘D’,5,’D’)||’A’
,VC_7=lpad(‘F’,7,’F’)||’A’
,VC_16=lpad(‘H’,16,’H’)||’A’
,VC_40=lpad(‘I’,40,’I’)||’A’
where mod(id,10)=1
/

And now I’ll see how the CBO estimates the rows coming back for some WHERE clauses.

select count(*) from hist_test
where VC_5 = 'DDDDDA'
  COUNT(*)
----------
       139

Execution Plan
----------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     1 |     7 |     8   (0)|
|   1 |  SORT AGGREGATE    |           |     1 |     7 |            |
|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |   700 |     8   (0)|
----------------------------------------------------------

select count(*) from hist_test
where VC_5 = 'DDDDDS'
  COUNT(*)
----------
        35

Execution Plan
----------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------
|   0 | SELECT STATEMENT   |           |     1 |     7 |     8   (0)|
|   1 |  SORT AGGREGATE    |           |     1 |     7 |            |
|*  2 |   TABLE ACCESS FULL| HIST_TEST |    49 |   343 |     8   (0)|
---------------------------------------------------------------------

For column VC_5 it all works as we would expect. the CBO estimates 100 values for DDDDDA and 49 for DDDDDS. I’ve shown the actual count(*)’s as well to show there is a real variation in numbers. There are good reasons why the estimates do not match reality, I won’t go into them now but this does highlight that histograms can help but do have flaws.

What about the other rows? We are only interested in the WHERE clause and the estimate for the number of rows from the table access, so I’ll show only those.

where VC_7 = 'FFFFFFFA'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |   900 |     8   (0)|

where VC_7 = 'FFFFFFFS'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |    49 |   441 |     8   (0)|

where VC_16 = 'HHHHHHHHHHHHHHHHA'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |  7200 |     8   (0)|

where VC_16 = 'HHHHHHHHHHHHHHHHS'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |    47 |   216 |     8   (0)|

where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIA'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

where VC_40 = 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIS'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

{and just to show there is no typo in the string of 'I's above...
where VC_40 = lpad('I',40,'I')||'S'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

In the above you can see that the CBO estimates the costs for VC_7 exactly the same as it did for VC_5. Despite the endpoint_values no longer distinguishing the buckets.

Even more telling is tha for VC_16 the CBO also is able to predict there are more values ending in A than S.

As for VC_40, the CBO utterly fails to spot the skew. In fact it utterly fails to spot that there is more than one value in the column (look back at the section showing min and max values and number of distinct values), and there is only one histogram bucket. so I supposed it is no surprise.

How does the CBO still detect the skew for VC_7 and VC_16, even though the endpoint_valuess are the same?
Because it swaps to using column ENDPOINT_ACTUAL_VALUE.

ENDPOINT_ACTUAL_VALUE is only populated for varchar2 columns when the precision of endpoint_value is exceeded:

colname ROWCOUNT MOD_REAL          ENDPOINT_ACTUAL_VALUE
------- ---------------------------------------------
VC_5           7 DDDDDR
VC_5           8 DDDDDV
VC_5           9 DDDDDY
VC_5          10 DDDDDZ
VC_6           0 EEEEEF8           EEEEEEA
VC_6           1 EEEEEF8           EEEEEEC
VC_6           2 EEEEEF8           EEEEEEF
VC_6           3 EEEEEF8           EEEEEEH
VC_6           4 EEEEEF8           EEEEEEK
VC_6           5 EEEEEF8           EEEEEEN
VC_6           6 EEEEEF8           EEEEEEP
VC_6           7 EEEEEFn           EEEEEES
VC_6           8 EEEEEFn           EEEEEEU
VC_6           9 EEEEEFn           EEEEEEX
VC_6          10 EEEEEFn           EEEEEEZ
VC_7           1 FFFFFGK           FFFFFFFA
VC_7           2 FFFFFGK           FFFFFFFC
VC_7           3 FFFFFGK           FFFFFFFF
VC_7           4 FFFFFGK           FFFFFFFI
VC_7           5 FFFFFGK           FFFFFFFK
VC_7           6 FFFFFGK           FFFFFFFO
VC_7           7 FFFFFGK           FFFFFFFR
VC_7           8 FFFFFGK           FFFFFFFT
VC_7           9 FFFFFGK           FFFFFFFW
VC_7          10 FFFFFGK           FFFFFFFZ
VC_16          1 HHHHHI:           HHHHHHHHHHHHHHHHA
VC_16          2 HHHHHI:           HHHHHHHHHHHHHHHHD
VC_16          3 HHHHHI:           HHHHHHHHHHHHHHHHG
VC_16          4 HHHHHI:           HHHHHHHHHHHHHHHHJ
VC_16          5 HHHHHI:           HHHHHHHHHHHHHHHHL
VC_16          6 HHHHHI:           HHHHHHHHHHHHHHHHO
VC_16          7 HHHHHI:           HHHHHHHHHHHHHHHHS
VC_16          8 HHHHHI:           HHHHHHHHHHHHHHHHU
VC_16          9 HHHHHI:           HHHHHHHHHHHHHHHHX
VC_16         10 HHHHHI:           HHHHHHHHHHHHHHHHZ
VC_40      1,000 IIIIIJM

You can see that VC_5 columns have no ENDPOINT_ACTUAL_VALUEs but VC_6, VC_7 and VC_16 do. VC_40 does not.

So, at what point does the CBO stop storing values in ENDPOINT_ACTUAL_VALUES? 32 characters. I created  another test table and these are the low/high values, number of distinct values and number of histogram buckets:

COL_NAME NUM_DIST LOW_V HI_V N_BUCK AVG_L
————————————————————–
ID 1,000 1 1000 10 4
NUM_1 10 0 9 10 3
VC_1 6 AAAAA FFFFF 6 6
VC_30 26 AAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAA 10 32
VC_31 26 BBBBBBBBBBBBBBBBBBBB BBBBBBBBBBBBBBBBBBBB 10 33
VC_32 1 CCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCC 1 34
VC_33 1 DDDDDDDDDDDDDDDDDDDD DDDDDDDDDDDDDDDDDDDD 1 35
VC_34 1 EEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEE 1 36
VC_35 1 FFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFF 1 37[/sourcecode ]

The name of the column is the length of the fixed string, so the columns are actually one character longer, as can be seen by the “AVG_L” column.

And just to check {as after all this is just a nice theory until you test}, I skew the number of records ending in ‘A’ for VC_31 and VC_32:-

update hist_test2
set
VC_31=lpad(‘B’,31,’B’)||’A’
,VC_32=lpad(‘C’,32,’C’)||’A’
where mod(id,10)=1

And see what the CBO thinks the rows to be identified are:-

where VC_31 = lpad(‘B’,31,’B’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 100 | 3300 |

where VC_31 = lpad(‘B’,31,’B’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 49 | 1617 |

where VC_32 = lpad(‘C’,32,’C’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 |

where VC_32 = lpad(‘C’,32,’C’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 | [/sourcecode ]

Yes. The CBO can see the skew for the 32 characters of VC_31 and not for the 33 characters of VC_32

So in conclusion

  • endpoint_value is only accurate for varchar2 columns up to around 7 characters.
  • If the varchar2 exceeds the precision of the endpoint_value column then endpoint_actual_value is used.
  • endpoint_actual_value is used for varchar2 columns up to 32 characters and not for columns less than 7 characters.
  • Histograms will not help with columns having many values with the same leading 32 or more characters.
  • If you column has the same leading 32 characters or more, even the number of distinct values is not gathered by DBMS_STATS.
  • You can’t tell how SQL/CBO will work from theory, you have to test. And still consider you knowledge as open to correction after that 🙂
Advertisement