jump to navigation

Decrypting Histogram Data #2 September 3, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,
7 comments

Hint – if you have a column where most or all of the entries have the same 15 plus characters, don’t bother gathering histograms on it. They will not help you a lot. Or at all. eg you have a column that holds “CUSTOMER IDENTIFIER xxxxxxxxx”, “CUSTOMER IDENTIFIER yyyyyyyy”, etc. Of course, good design suggests that the “CUSTOMER IDENTIFIER” bit is probably redundant and can be got rid of, but we live in the real world and may not have the power or ability to enforce that change, or we might have 3 or 4 such “prefix” strings.

Further, histograms on dnumerics longer than 15 significant digits will also potentially behave not as you would expect.

I better justify my assertion.

In the previous post on decrypting histogram data I covered how Oracle turns a varchar2 value into a numeric value that is held in the ENDPOINT_VALUE of DBA_TAB_HISTOGRAMS and I also gave you a cunning (but slightly flawed) function to convert it back. I use it in this post so you might want to go look at the prior post. Sorry it is long, I can’t stop rambling.

First, I’ll create some test data {Oh, this is on 10.2.0.3 on Linux}. The below script create a table HIST_TEST with columns NUM_1 through to NUM7, which hold numbers padded out to greater lengths and then 0-9 as the last value. Similarlry columns VC_2 to VC_8 are padded out and have a random character added. VC_1 is a random 5-character string.

create table hist_test
as select rownum  id
,mod(rownum,10)  num_1
,trunc(dbms_random.value(1,10)) num_2
,1000+trunc(dbms_random.value(1,10)) num_3
,1000000+trunc(dbms_random.value(1,10)) num_4
,1000000000+trunc(dbms_random.value(1,10)) num_5
,1000000000000+trunc(dbms_random.value(1,10)) num_6
,1000000000000000+trunc(dbms_random.value(1,10)) num_7
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
--below skews the table to AB.
,lpad('H',16,'H')||decode(mod(rownum,3),0,'A'
                                       ,1,'B'
                                       ,   dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001
/
begin
  dbms_stats.gather_table_stats(ownname=>user,tabname=>'HIST_TEST'
  ,method_opt=>'FOR ALL COLUMNS SIZE 10');
END;
/

The below is a simple select against DBA_TAB_COLUMNS to see the information for the column {Oh, I say simple, but you have to use the functions utl_raw.cat_to_number and utl_raw.cast_to_varchar2 to turn the raw values held in the columns LOW_VALUE and HIGH_VALUE to something we humans can read. Why does Oracle Corp have to make life so difficult? *sigh*.

Oh, click on “view plain” on the box below to get a better layout.

select
 column_name
,num_distinct num_dist
,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(low_value))
                 ,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(low_value))
                 ,          low_value
       ) low_v
,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(high_value))
                 ,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(high_value))
                 ,          high_value
       ) hi_v
,num_nulls    n_nulls
,num_buckets  n_buck
,avg_col_len  avg_l
from dba_tab_columns
where table_name ='HIST_TEST'
and owner=USER
order by owner,column_id
/

COLUMN_NAME       NUM_DIST LOW_V              HI_V                    N_NULLS N_BUCK  AVG_L
---------------------------------------------------------------------
ID                   1,000 1                  1000                          0     10      4
NUM_1                   10 0                  9                             0     10      3
NUM_2                    9 1                  9                             0      9      3
NUM_3                    9 1001               1009                          0      9      4
NUM_4                    9 1000001            1000009                       0      9      6
NUM_5                    9 1000000001         1000000009                    0      9      7
NUM_6                    9 1000000000001      1000000000009                 0      9      9
NUM_7                    9 1000000000000001   1000000000000009              0      9     10
VC_1                     6 AAAAA              FFFFF                         0      6      6
VC_2                    26 AAA                AAZ                           0     10      4
VC_3                    26 BBBA               BBBZ                          0     10      5
VC_4                    26 CCCCA              CCCCZ                         0     10      6
VC_5                    26 DDDDDA             DDDDDZ                        0     10      7
VC_6                    26 EEEEEEA            EEEEEEZ                       0     10      8
VC_7                    26 FFFFFFFA           FFFFFFFZ                      0     10      9
VC_8                    26 GGGGGGGGA          GGGGGGGGZ                     0     10     10
VC_16                   26 HHHHHHHHHHHHHHHHA  HHHHHHHHHHHHHHHHZ             0     10     18
VC_40                    1 IIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIII            0      1     42/

I now select the data out of the DBA_TAB_HISTOGRAMS table to see what is really being stored. For the sake of brevity {which means, “so I can finish this post tonight”} I’ll just show bits, but if you want, download the script right at the end and, if you created the functions from the prior post, you can play with this yourself.

First, here you can see some of the values actually stored in ENDPOINT_VALUE and what they translate into:-

TABLE_NAME   colname                        ENDPOINT_VALUE ROWCOUNT REAL_VAL
------------------------------------------------------------
Numbers
HIST_TEST    NUM_5                              1000000007      760 1000000007
HIST_TEST    NUM_5                              1000000008      870 1000000008
HIST_TEST    NUM_5                              1000000009    1,000 1000000009
HIST_TEST    NUM_6                           1000000000001      103 1000000000001
HIST_TEST    NUM_6                           1000000000002      219 1000000000002
HIST_TEST    NUM_6                           1000000000003      328 1000000000003
HIST_TEST    NUM_6                           1000000000004      427 1000000000004
Varchars
HIST_TEST    VC_2     338824386822815000000000000000000000        8 AAU
HIST_TEST    VC_2     338824624507302000000000000000000000        9 AAW
HIST_TEST    VC_2     338824782963627000000000000000000000       10 AAY
HIST_TEST    VC_3     344035480872391000000000000000000000        0 BBB@
HIST_TEST    VC_3     344035481491361000000000000000000000        1 BBBB
HIST_TEST    VC_3     344035482419816000000000000000000000        2 BBBE
HIST_TEST    VC_3     344035483348271000000000000000000000        3 BBBH
HIST_TEST    VC_3     344035483967241000000000000000000000        4 BBBJ

Note that for numerics the number itself is stored and I do not need to translate it.
For VARCHAR2 columns the value held is the 15-digit number padded with zeros.
Also note, for VC_2 the range covered seems to end at AAY not AAZ and column VC_3 starts at BBB@ not BBBA {I am creating values with the last character set to A-Z}. Also, bucket 8 for VC_2 ends in a control character.

To reduce this I add a fudge to my function {again, see previous post}. It helps:

TABLE_NAME   colname ROWCOUNT REAL_VAL          MOD_REAL
---------------------------------------------------
HIST_TEST    VC_2           9 AAW               AAX
HIST_TEST    VC_2          10 AAY               AAZ
HIST_TEST    VC_3           0 BBB@              BBBA
HIST_TEST    VC_3           1 BBBB              BBBC
HIST_TEST    VC_3           2 BBBE              BBBF
HIST_TEST    VC_3           3 BBBH              BBBI
HIST_TEST    VC_3           4 BBBJ              BBBK
HIST_TEST    VC_3           5 BBBM              BBBN
HIST_TEST    VC_3           6 BBBO              BBBP
HIST_TEST    VC_3           7 BBBR              BBBS
HIST_TEST    VC_3           8 BBBT              BBBU
HIST_TEST    VC_3           9 BBBW              BBBX
HIST_TEST    VC_3          10 BBBY              BBBZ
HIST_TEST    VC_4           0 CCCC@             CCCCA
HIST_TEST    VC_4           1 CCCCB             CCCCC
HIST_TEST    VC_4           2 CCCCD             CCCCE
HIST_TEST    VC_4           3 CCCCH            CCCCH
HIST_TEST    VC_4           4 CCCCJ            CCCCJ

As you can see, I now get a better translation, but it still goes wrong sometimes (see last couple of rows). So, feel free to take my functions and use them, but be aware even the modified version is not perfect. If YOU perfect it, can I have a copy please :-)

The below shows that Numeric value histograms break when you exceed 15 digits:

colname                  END_VAL ROWCOUNT REAL_VAL
------- ------------------------ -------- --------
NUM_5                 1000000009    1,000 1000000009
NUM_6              1000000000001      103 1000000000001
NUM_6              1000000000002      219 1000000000002
NUM_6              1000000000003      328 1000000000003
NUM_6              1000000000004      427 1000000000004
NUM_6              1000000000005      542 1000000000005
NUM_6              1000000000006      651 1000000000006
NUM_6              1000000000007      771 1000000000007
NUM_6              1000000000008      881 1000000000008
NUM_6              1000000000009    1,000 1000000000009
NUM_7           1000000000000000      133 1000000000000000
NUM_7           1000000000000000      256 1000000000000000
NUM_7           1000000000000000      367 1000000000000000
NUM_7           1000000000000000      467 1000000000000000
NUM_7           1000000000000010      567 1000000000000010
NUM_7           1000000000000010      665 1000000000000010
NUM_7           1000000000000010      784 1000000000000010
NUM_7           1000000000000010      896 1000000000000010
NUM_7           1000000000000010    1,000 1000000000000010

This is the point at which storage of numbers for histograms breaks. You can see that NUM_6 is fine but NUM_7 is not. That is because NUM_6 is below 15 significant digits and NUM_7 is over 15 significant digits and effectively gets truncated.

Histograms on numeric values with more than 15 significant digits will not work as you expect, possible not at all.

With Varchar(2) values, histogram END_VALUES break even sooner, at around 7 digits:

colname                               END_VAL ROWCOUNT MOD_REAL
------- ------------------------------------- -------- --------
VC_5     354460798876024000000000000000000000        5 DDDDDO
VC_5     354460798876038000000000000000000000        6 DDDDDR
VC_5     354460798876047000000000000000000000        7 DDDDDS
VC_5     354460798876061000000000000000000000        8 DDDDDV
VC_5     354460798876071000000000000000000000        9 DDDDDY
VC_5     354460798876080000000000000000000000       10 DDDDDZ
VC_6     359673457682977000000000000000000000        0 EEEEEF8
VC_6     359673457682977000000000000000000000        1 EEEEEF8
VC_6     359673457682977000000000000000000000        2 EEEEEF8
VC_6     359673457682977000000000000000000000        3 EEEEEF8
VC_6     359673457682977000000000000000000000        4 EEEEEF8
VC_6     359673457682977000000000000000000000        5 EEEEEF8
VC_6     359673457682977000000000000000000000        6 EEEEEF8
VC_6     359673457682977000000000000000000000        7 EEEEEF8
VC_6     359673457682978000000000000000000000        8 EEEEEFn
VC_6     359673457682978000000000000000000000        9 EEEEEFn
VC_6     359673457682978000000000000000000000       10 EEEEEFn
VC_7     364886116489977000000000000000000000        0 FFFFFGK
VC_7     364886116489977000000000000000000000        1 FFFFFGK
VC_7     364886116489977000000000000000000000        2 FFFFFGK
VC_7     364886116489977000000000000000000000        3 FFFFFGK
VC_7     364886116489977000000000000000000000        4 FFFFFGK
VC_7     364886116489977000000000000000000000        5 FFFFFGK
VC_7     364886116489977000000000000000000000        6 FFFFFGK
VC_7     364886116489977000000000000000000000        7 FFFFFGK
VC_7     364886116489977000000000000000000000        8 FFFFFGK
VC_7     364886116489977000000000000000000000        9 FFFFFGK
VC_7     364886116489977000000000000000000000       10 FFFFFGK

You can see that for column VC_5 the actual ENDPOINT_VALUE is varying at the 14th and 15th significant digit and my translated value changes. But for VC_6 the numeric ENDPOINT_VALUE is hardly changing. Each row translated to one of two values.
For VC_7 the ENDPOINT_VALUE is static. All histogram END_VALUES are the same.

This means Histograms will not work properly for any VARCHAR(2) columns which do not vary for the first 7 or more characters and any characters after the 7th will be ignored.

Or does it? My logic is correct but tomorrow (or soon after) I’ll try some actual tests over theory… {So don’t ping me with corrections just yet}

I’ve not mentioned VC_16 and VC_40 have I? I will tomorrow. Maybe :-|.

Finally, Here is the selecting script, as promised.

col column_name form a7 head colname
col rowcount form 99,999
col real_val form a17
col end_val form 999999999999999999999999999999999999
col end_string form a10
col endpoint_actual_value form a40
col mod_real form a17
select --owner
 dth.table_name
,dth.column_name
,dth.endpoint_value  end_val
,dth.endpoint_number rowcount
,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value)
            ,'VARCHAR2',hist_numtochar(dth.endpoint_value+1)
                 ,          dth.endpoint_value
       ) real_val
,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value)

            ,'VARCHAR2',hist_numtochar2(dth.endpoint_value+1)
                 ,          dth.endpoint_value
       ) mod_real
,endpoint_actual_value
from dba_tab_histograms dth
,dba_tab_columns dtc
where dth.table_name ='HIST_TEST'
and  dth.owner=USER
and dth.owner=dtc.owner
and dth.table_name=dtc.table_name
and dth.column_name=dtc.column_name
order by dth.table_name,dth.column_name,dth.endpoint_number
/
Follow

Get every new post delivered to your Inbox.

Join 156 other followers