Decrypting Histogram Data #2 September 3, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, SQL
7 comments
Hint – if you have a column where most or all of the entries have the same 15 plus characters, don’t bother gathering histograms on it. They will not help you a lot. Or at all. eg you have a column that holds “CUSTOMER IDENTIFIER xxxxxxxxx”, “CUSTOMER IDENTIFIER yyyyyyyy”, etc. Of course, good design suggests that the “CUSTOMER IDENTIFIER” bit is probably redundant and can be got rid of, but we live in the real world and may not have the power or ability to enforce that change, or we might have 3 or 4 such “prefix” strings.
Further, histograms on dnumerics longer than 15 significant digits will also potentially behave not as you would expect.
I better justify my assertion.
In the previous post on decrypting histogram data I covered how Oracle turns a varchar2 value into a numeric value that is held in the ENDPOINT_VALUE of DBA_TAB_HISTOGRAMS and I also gave you a cunning (but slightly flawed) function to convert it back. I use it in this post so you might want to go look at the prior post. Sorry it is long, I can’t stop rambling.
First, I’ll create some test data {Oh, this is on 10.2.0.3 on Linux}. The below script create a table HIST_TEST with columns NUM_1 through to NUM7, which hold numbers padded out to greater lengths and then 0-9 as the last value. Similarlry columns VC_2 to VC_8 are padded out and have a random character added. VC_1 is a random 5-character string.
create table hist_test as select rownum id ,mod(rownum,10) num_1 ,trunc(dbms_random.value(1,10)) num_2 ,1000+trunc(dbms_random.value(1,10)) num_3 ,1000000+trunc(dbms_random.value(1,10)) num_4 ,1000000000+trunc(dbms_random.value(1,10)) num_5 ,1000000000000+trunc(dbms_random.value(1,10)) num_6 ,1000000000000000+trunc(dbms_random.value(1,10)) num_7 ,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1 ,lpad('A',2,'A')||dbms_random.string('U',1) vc_2 ,lpad('B',3,'B')||dbms_random.string('U',1) vc_3 ,lpad('C',4,'C')||dbms_random.string('U',1) vc_4 ,lpad('D',5,'D')||dbms_random.string('U',1) vc_5 ,lpad('E',6,'E')||dbms_random.string('U',1) vc_6 ,lpad('F',7,'F')||dbms_random.string('U',1) vc_7 ,lpad('G',8,'G')||dbms_random.string('U',1) vc_8 --below skews the table to AB. ,lpad('H',16,'H')||decode(mod(rownum,3),0,'A' ,1,'B' , dbms_random.string('U',1)) vc_16 ,lpad('I',40,'I')||dbms_random.string('U',1) vc_40 from dba_objects where rownum < 1001 / begin dbms_stats.gather_table_stats(ownname=>user,tabname=>'HIST_TEST' ,method_opt=>'FOR ALL COLUMNS SIZE 10'); END; /
The below is a simple select against DBA_TAB_COLUMNS to see the information for the column {Oh, I say simple, but you have to use the functions utl_raw.cat_to_number and utl_raw.cast_to_varchar2 to turn the raw values held in the columns LOW_VALUE and HIGH_VALUE to something we humans can read. Why does Oracle Corp have to make life so difficult? *sigh*.
Oh, click on “view plain” on the box below to get a better layout.
select column_name ,num_distinct num_dist ,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(low_value)) ,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(low_value)) , low_value ) low_v ,decode(data_type,'NUMBER',to_char(utl_raw.cast_to_number(high_value)) ,'VARCHAR2',to_char(utl_raw.cast_to_varchar2(high_value)) , high_value ) hi_v ,num_nulls n_nulls ,num_buckets n_buck ,avg_col_len avg_l from dba_tab_columns where table_name ='HIST_TEST' and owner=USER order by owner,column_id / COLUMN_NAME NUM_DIST LOW_V HI_V N_NULLS N_BUCK AVG_L --------------------------------------------------------------------- ID 1,000 1 1000 0 10 4 NUM_1 10 0 9 0 10 3 NUM_2 9 1 9 0 9 3 NUM_3 9 1001 1009 0 9 4 NUM_4 9 1000001 1000009 0 9 6 NUM_5 9 1000000001 1000000009 0 9 7 NUM_6 9 1000000000001 1000000000009 0 9 9 NUM_7 9 1000000000000001 1000000000000009 0 9 10 VC_1 6 AAAAA FFFFF 0 6 6 VC_2 26 AAA AAZ 0 10 4 VC_3 26 BBBA BBBZ 0 10 5 VC_4 26 CCCCA CCCCZ 0 10 6 VC_5 26 DDDDDA DDDDDZ 0 10 7 VC_6 26 EEEEEEA EEEEEEZ 0 10 8 VC_7 26 FFFFFFFA FFFFFFFZ 0 10 9 VC_8 26 GGGGGGGGA GGGGGGGGZ 0 10 10 VC_16 26 HHHHHHHHHHHHHHHHA HHHHHHHHHHHHHHHHZ 0 10 18 VC_40 1 IIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIII 0 1 42/
I now select the data out of the DBA_TAB_HISTOGRAMS table to see what is really being stored. For the sake of brevity {which means, “so I can finish this post tonight”} I’ll just show bits, but if you want, download the script right at the end and, if you created the functions from the prior post, you can play with this yourself.
First, here you can see some of the values actually stored in ENDPOINT_VALUE and what they translate into:-
TABLE_NAME colname ENDPOINT_VALUE ROWCOUNT REAL_VAL ------------------------------------------------------------ Numbers HIST_TEST NUM_5 1000000007 760 1000000007 HIST_TEST NUM_5 1000000008 870 1000000008 HIST_TEST NUM_5 1000000009 1,000 1000000009 HIST_TEST NUM_6 1000000000001 103 1000000000001 HIST_TEST NUM_6 1000000000002 219 1000000000002 HIST_TEST NUM_6 1000000000003 328 1000000000003 HIST_TEST NUM_6 1000000000004 427 1000000000004 Varchars HIST_TEST VC_2 338824386822815000000000000000000000 8 AAU HIST_TEST VC_2 338824624507302000000000000000000000 9 AAW HIST_TEST VC_2 338824782963627000000000000000000000 10 AAY HIST_TEST VC_3 344035480872391000000000000000000000 0 BBB@ HIST_TEST VC_3 344035481491361000000000000000000000 1 BBBB HIST_TEST VC_3 344035482419816000000000000000000000 2 BBBE HIST_TEST VC_3 344035483348271000000000000000000000 3 BBBH HIST_TEST VC_3 344035483967241000000000000000000000 4 BBBJ
Note that for numerics the number itself is stored and I do not need to translate it.
For VARCHAR2 columns the value held is the 15-digit number padded with zeros.
Also note, for VC_2 the range covered seems to end at AAY not AAZ and column VC_3 starts at BBB@ not BBBA {I am creating values with the last character set to A-Z}. Also, bucket 8 for VC_2 ends in a control character.
To reduce this I add a fudge to my function {again, see previous post}. It helps:
TABLE_NAME colname ROWCOUNT REAL_VAL MOD_REAL --------------------------------------------------- HIST_TEST VC_2 9 AAW AAX HIST_TEST VC_2 10 AAY AAZ HIST_TEST VC_3 0 BBB@ BBBA HIST_TEST VC_3 1 BBBB BBBC HIST_TEST VC_3 2 BBBE BBBF HIST_TEST VC_3 3 BBBH BBBI HIST_TEST VC_3 4 BBBJ BBBK HIST_TEST VC_3 5 BBBM BBBN HIST_TEST VC_3 6 BBBO BBBP HIST_TEST VC_3 7 BBBR BBBS HIST_TEST VC_3 8 BBBT BBBU HIST_TEST VC_3 9 BBBW BBBX HIST_TEST VC_3 10 BBBY BBBZ HIST_TEST VC_4 0 CCCC@ CCCCA HIST_TEST VC_4 1 CCCCB CCCCC HIST_TEST VC_4 2 CCCCD CCCCE HIST_TEST VC_4 3 CCCCH CCCCH HIST_TEST VC_4 4 CCCCJ CCCCJ
As you can see, I now get a better translation, but it still goes wrong sometimes (see last couple of rows). So, feel free to take my functions and use them, but be aware even the modified version is not perfect. If YOU perfect it, can I have a copy please 🙂
The below shows that Numeric value histograms break when you exceed 15 digits:
colname END_VAL ROWCOUNT REAL_VAL ------- ------------------------ -------- -------- NUM_5 1000000009 1,000 1000000009 NUM_6 1000000000001 103 1000000000001 NUM_6 1000000000002 219 1000000000002 NUM_6 1000000000003 328 1000000000003 NUM_6 1000000000004 427 1000000000004 NUM_6 1000000000005 542 1000000000005 NUM_6 1000000000006 651 1000000000006 NUM_6 1000000000007 771 1000000000007 NUM_6 1000000000008 881 1000000000008 NUM_6 1000000000009 1,000 1000000000009 NUM_7 1000000000000000 133 1000000000000000 NUM_7 1000000000000000 256 1000000000000000 NUM_7 1000000000000000 367 1000000000000000 NUM_7 1000000000000000 467 1000000000000000 NUM_7 1000000000000010 567 1000000000000010 NUM_7 1000000000000010 665 1000000000000010 NUM_7 1000000000000010 784 1000000000000010 NUM_7 1000000000000010 896 1000000000000010 NUM_7 1000000000000010 1,000 1000000000000010
This is the point at which storage of numbers for histograms breaks. You can see that NUM_6 is fine but NUM_7 is not. That is because NUM_6 is below 15 significant digits and NUM_7 is over 15 significant digits and effectively gets truncated.
Histograms on numeric values with more than 15 significant digits will not work as you expect, possible not at all.
With Varchar(2) values, histogram END_VALUES break even sooner, at around 7 digits:
colname END_VAL ROWCOUNT MOD_REAL ------- ------------------------------------- -------- -------- VC_5 354460798876024000000000000000000000 5 DDDDDO VC_5 354460798876038000000000000000000000 6 DDDDDR VC_5 354460798876047000000000000000000000 7 DDDDDS VC_5 354460798876061000000000000000000000 8 DDDDDV VC_5 354460798876071000000000000000000000 9 DDDDDY VC_5 354460798876080000000000000000000000 10 DDDDDZ VC_6 359673457682977000000000000000000000 0 EEEEEF8 VC_6 359673457682977000000000000000000000 1 EEEEEF8 VC_6 359673457682977000000000000000000000 2 EEEEEF8 VC_6 359673457682977000000000000000000000 3 EEEEEF8 VC_6 359673457682977000000000000000000000 4 EEEEEF8 VC_6 359673457682977000000000000000000000 5 EEEEEF8 VC_6 359673457682977000000000000000000000 6 EEEEEF8 VC_6 359673457682977000000000000000000000 7 EEEEEF8 VC_6 359673457682978000000000000000000000 8 EEEEEFn VC_6 359673457682978000000000000000000000 9 EEEEEFn VC_6 359673457682978000000000000000000000 10 EEEEEFn VC_7 364886116489977000000000000000000000 0 FFFFFGK VC_7 364886116489977000000000000000000000 1 FFFFFGK VC_7 364886116489977000000000000000000000 2 FFFFFGK VC_7 364886116489977000000000000000000000 3 FFFFFGK VC_7 364886116489977000000000000000000000 4 FFFFFGK VC_7 364886116489977000000000000000000000 5 FFFFFGK VC_7 364886116489977000000000000000000000 6 FFFFFGK VC_7 364886116489977000000000000000000000 7 FFFFFGK VC_7 364886116489977000000000000000000000 8 FFFFFGK VC_7 364886116489977000000000000000000000 9 FFFFFGK VC_7 364886116489977000000000000000000000 10 FFFFFGK
You can see that for column VC_5 the actual ENDPOINT_VALUE is varying at the 14th and 15th significant digit and my translated value changes. But for VC_6 the numeric ENDPOINT_VALUE is hardly changing. Each row translated to one of two values.
For VC_7 the ENDPOINT_VALUE is static. All histogram END_VALUES are the same.
This means Histograms will not work properly for any VARCHAR(2) columns which do not vary for the first 7 or more characters and any characters after the 7th will be ignored.
Or does it? My logic is correct but tomorrow (or soon after) I’ll try some actual tests over theory… {So don’t ping me with corrections just yet}
I’ve not mentioned VC_16 and VC_40 have I? I will tomorrow. Maybe :-|.
Finally, Here is the selecting script, as promised.
col column_name form a7 head colname col rowcount form 99,999 col real_val form a17 col end_val form 999999999999999999999999999999999999 col end_string form a10 col endpoint_actual_value form a40 col mod_real form a17 select --owner dth.table_name ,dth.column_name ,dth.endpoint_value end_val ,dth.endpoint_number rowcount ,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value) ,'VARCHAR2',hist_numtochar(dth.endpoint_value+1) , dth.endpoint_value ) real_val ,decode(dtc.data_type,'NUMBER',to_char(dth.endpoint_value) ,'VARCHAR2',hist_numtochar2(dth.endpoint_value+1) , dth.endpoint_value ) mod_real ,endpoint_actual_value from dba_tab_histograms dth ,dba_tab_columns dtc where dth.table_name ='HIST_TEST' and dth.owner=USER and dth.owner=dtc.owner and dth.table_name=dtc.table_name and dth.column_name=dtc.column_name order by dth.table_name,dth.column_name,dth.endpoint_number /