Decrypting Histogram Data August 11, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, histograms, performance, statistics
5 comments
I’m looking at Histograms at the moment and I’ve hit a problem I always hit when looking at histograms. So I’ve decided once and for all to see if I can find a solution.
The value held in DBA_TAB_HISTOGRAMS.ENDPOINT_VALUE for VARCHAR2 columns is a number with no apparent link to reality.
{Before anyone who has read my previous posts suggests it, I’ve already looked at the view and the relevant column resolves to data in either sys.histgrm$ or sys.hist_head$ and the underlying values held are still numbers and not processed in the view, so they are numbers. I think}.
{{I have a sneaking suspicion this could turn into a set of increasingly obscure postings…}}
{{{THIS one might be my most obscure so far!}}}
Let’s have a look:
select --owner table_name ,column_name ,endpoint_value end_val ,endpoint_number rowcount from all_tab_histograms where table_name ='TEST_P' and owner='WIDLAKEM' order by table_name,column_name,endpoint_number / TABLE_NAME COLUMN_NAME END_VAL ROWCOUNT --------------- --------------- ---------- --------------- TEST_P STATUS 0 41,523 TEST_P STATUS 1 42,065 TEST_P STATUS 2 42,099 TEST_P VC_1 3.3882E+35 0 TEST_P VC_1 3.4951E+35 1 TEST_P VC_1 3.6487E+35 2 TEST_P VC_1 3.7558E+35 3 TEST_P VC_1 3.9095E+35 4 TEST_P VC_1 4.0162E+35 5 TEST_P VC_1 4.1697E+35 6 TEST_P VC_1 4.3235E+35 7 TEST_P VC_1 4.4305E+35 8 TEST_P VC_1 4.5841E+35 9 TEST_P VC_1 4.6914E+35 10
So what in heck is 4.6914E+35 {last value listed}?
Well, it is a number of course. If I set my column heading appropriately I can see this:-
col end_val form 999,999,999,999,999,999,999,999,999,999,999,999 col column_name form a10 select column_name ,endpoint_value end_val ,endpoint_number rowcount from all_tab_histograms where table_name ='TEST_P' and owner='WIDLAKEM' and column_name='VC_1' order by table_name,column_name,endpoint_number / COLUMN_NAM END_VAL ROWCOUNT ---------- ------------------------------------------------ ---------- VC_1 338,822,823,410,931,000,000,000,000,000,000,000 0 VC_1 349,512,448,932,628,000,000,000,000,000,000,000 1 VC_1 364,867,102,368,954,000,000,000,000,000,000,000 2 VC_1 375,575,265,424,979,000,000,000,000,000,000,000 3 VC_1 390,949,334,699,583,000,000,000,000,000,000,000 4 VC_1 401,618,352,253,998,000,000,000,000,000,000,000 5 VC_1 416,972,612,321,579,000,000,000,000,000,000,000 6 VC_1 432,345,727,450,272,000,000,000,000,000,000,000 7 VC_1 443,054,364,035,286,000,000,000,000,000,000,000 8 VC_1 458,409,658,691,912,000,000,000,000,000,000,000 9 VC_1 469,139,289,515,351,000,000,000,000,000,000,000 10
The ENDPOINT_VALUE is an encoded representation of the varchar2 value, based on the trick of turning each character into it’s ascii equivalent (0-255) and then multiplying it by 256*no-of-characters-in to the string and adding it all together. Let’s use “Cat” as an example.
- Last character is ‘t’, ascii value is 116=116;
- Next-to-last character is ‘b’, ascii value 97, *(256)= 24832;
- Chr 3 in is ‘C’, ascii value 67, *(256*256) =4390912
- Total is 4390912+24832+116=4415860
What Oracle actually does is slightly different {and I could have this slightly wrong, my brain waved a white flag a couple of hours back} it takes the first 15 characters and multiplies the first character ascii value by power(256*15), second character ascii value by power(256*14) etc until it runs out of characters or gets to character 15.
How do I know this? I decoded the ‘C’ in the kernel :-). No, I didn’t, I found a link to an encode script Jonathan Lewis had written {it’s also in his latest book, or at least in the set of scripts the book tells you how to download}. I’ve lost the original link I found, and thus a nice little article on the whole topic, but this link will do until I can re-find the original. {NB this referenced article mentions using the hexstr function which may be neater but, as I said earlier, my brain has given up}.
I’ve outrageously stolen Jonathan’s script which encodes a varchar2 string into the relevant numeric and used it as a template to create a decode version too. Just a couple of notes:
– I can imagine this might not work if you have multibyte characters.
– Oracle confuse things by doing something like translating the 15 character into a 36-digit string – and then taking only the first 15 significant digits. This kind of messes up the untranslate.
So, I create three functions:
- hist_chartonum that converts a varchar2 into something very similar to what is stored in the histogram views. Jonathan will find it incredibly familiar. An optional second paramter turns off the truncation that oracle does for the histogram data, so you can get the full glorious 36 significant digits for 15 characters if you wish.
- hist_numtochar which converts the numeric back to a string. It gets it slightly wrong due to the truncation of the least significant 20 or so characters. Again, an optional second paramater allows it to not replicate the trunaction and work with all 15 characters.
- hist_numtochar2 which is an attempt to allow for the truncation errors. I add power(256,9) to the value if it has been truncated. It seems to make eg the numeric representation of AAAAA translate back to AAAAA not AAAA@. Yes, it’s a fudge.
This is the script:-
-- cre_hist_funcs.sql -- Heavily borrowed from Jonathan Lewis, sep 2003 -- MDW 11/8/09 - all mistakes mine. -- hist_chartonum converts a string to a number in a very similar way -- to how Oracle does so for storing in column ENDPOINT_VALUE in XXX_TAB_HISTOGRAMS -- Optionally get it to not truncate the value in the same way to only 15 digits. -- hist_numtochar converts the ENDPOINT_VALUE back to human-readable format. -- Optinally get it to go beyond 15 numerics (7 or 8 characters) -- hist_numtochar2 attempts to allow for truncation errors. -- JLs version has been checked against 8.1.7.4, 9.2.0.4 and 10.1.0.2 -- I've used on 10.2.0.3 set timing off create or replace function hist_chartonum(p_vc varchar2 ,p_trunc varchar2 :='Y') return number is m_vc varchar2(15) := substr(rpad(p_vc,15,chr(0)),1,15); m_n number := 0; begin for i in 1..15 loop /* dbms_output.put(ascii(substr(m_vc,i,1))); dbms_output.put(chr(9)); dbms_output.put_Line(to_char( power(256,15-i) * ascii(substr(m_vc,i,1)), '999,999,999,999,999,999,999,999,999,999,999,999' ) ); */ m_n := m_n + power(256,15-i) * ascii(substr(m_vc,i,1)); end loop; -- this converts it from a 36 digit number to the 15-digit number used -- in the ENDPOINT_VALUE Column. If p_trunc = 'Y' then m_n := round(m_n, -21); end if; -- dbms_output.put_line(to_char(m_n,'999,999,999,999,999,999,999,999,999,999,999,999')); return m_n; end; / create or replace function hist_numtochar(p_num number ,p_trunc varchar2 :='Y') return varchar2 is m_vc varchar2(15); m_n number :=0; m_n1 number; m_loop number :=7; begin m_n :=p_num; if length(to_char(m_n))<36 then --dbms_output.put_line ('input too short'); m_vc:='num format err'; else if p_trunc !='Y' then m_loop :=15; end if; --dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999')); for i in 1..m_loop loop m_n1:=trunc(m_n/(power(256,15-i))); -- dbms_output.put_line(to_char(m_n1)); if m_n1!=0 then m_vc:=m_vc||chr(m_n1); end if; dbms_output.put_line(m_vc); m_n:=m_n-(m_n1*power(256,15-i)); end loop; end if; return m_vc; end; / create or replace function hist_numtochar2(p_num number ,p_trunc varchar2 :='Y') return varchar2 is m_vc varchar2(15); m_n number :=0; m_n1 number; m_loop number :=7; begin m_n :=p_num; if length(to_char(m_n))<36 then --dbms_output.put_line ('input too short'); m_vc:='num format err'; else if p_trunc !='Y' then m_loop :=15; else m_n:=m_n+power(256,9); end if; --dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999')); for i in 1..m_loop loop m_n1:=trunc(m_n/(power(256,15-i))); -- dbms_output.put_line(to_char(m_n1)); if m_n1!=0 then m_vc:=m_vc||chr(m_n1); end if; dbms_output.put_line(m_vc); m_n:=m_n-(m_n1*power(256,15-i)); end loop; end if; return m_vc; end; / rem rem Sample of use: rem col id1 format 999,999,999,999,999,999,999,999,999,999,999,999 col id3 format 999,999,999,999,999,999,999,999,999,999,999,999 select 'short' text ,hist_chartonum('short') id1 ,hist_numtochar(hist_chartonum('short')) id2 ,hist_chartonum('short','N') id3 ,hist_numtochar(hist_chartonum('short','N'),'N') id4 from dual; select 'alongteststring' text ,hist_chartonum('alongteststring') id1 ,hist_numtochar(hist_chartonum('alongteststring')) id2 ,hist_chartonum('alongteststring','N') id3 ,hist_numtochar(hist_chartonum('alongteststring','N'),'N') id4 from dual; select 'amuchlongerteststring' text ,hist_chartonum('amuchlongerteststring') id1 ,hist_numtochar(hist_chartonum('amuchlongerteststring')) id2 ,hist_chartonum('amuchlongerteststring','N') id3 ,hist_numtochar(hist_chartonum('amuchlongerteststring','N'),'N') id4 from dual; spool off
The test at the end produces the following output:-
strng histgrm_translation ------------------------------------------------ translate_back ------------------------------------------------ untrimmed_translation ------------------------------------------------ untrimmed_translate_back ------------------------------------------------ short 599,232,339,077,851,000,000,000,000,000,000,000 short 599,232,339,077,850,549,055,205,294,263,500,800 short alongteststring 505,852,124,009,532,000,000,000,000,000,000,000 alongtm 505,852,124,009,531,849,435,547,374,314,942,055 alongteststring amuchlongerteststring 505,872,878,384,947,000,000,000,000,000,000,000 amuchly 505,872,878,384,946,809,648,430,989,359,543,156 amuchlongertest
Final comment. Oracle trims the value stored in sys.histgrm$ and knackers up the untranslate. Why? Well, this is only me thinking out aloud, but this reduces the number from 36 to 15 significant digits and numerics are stored as one byte per two digits plus an offset byte {am I missing a length byte?}. So 10 bytes or so. This will keep the size of data stored down. I mean, as I posted here, the sys.histgrm table gets big enough as it is!
Tomorrow {or whenever my brain recovers} I’ll show how I use it and one or two oddities I have found or come across from other people’s blogs.