jump to navigation

Decrypting Histogram Data August 11, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,
trackback

{part two is here}

I’m looking at Histograms at the moment and I’ve hit a problem I always hit when looking at histograms. So I’ve decided once and for all to see if I can find a solution.

The value held in DBA_TAB_HISTOGRAMS.ENDPOINT_VALUE for VARCHAR2 columns is a number with no apparent link to reality.

{Before anyone who has read my previous posts suggests it, I’ve already looked at the view and the relevant column resolves to data in either sys.histgrm$ or sys.hist_head$ and the underlying values held are still numbers and not processed in the view, so they are numbers. I think}.
{{I have a sneaking suspicion this could turn into a set of increasingly obscure postings…}}
{{{THIS one might be my most obscure so far!}}}

Let’s have a look:

select --owner
 table_name
,column_name
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_P'
and  owner='WIDLAKEM'
order by table_name,column_name,endpoint_number
/
TABLE_NAME      COLUMN_NAME        END_VAL        ROWCOUNT
--------------- --------------- ---------- ---------------
TEST_P          STATUS                   0          41,523
TEST_P          STATUS                   1          42,065
TEST_P          STATUS                   2          42,099
TEST_P          VC_1            3.3882E+35               0
TEST_P          VC_1            3.4951E+35               1
TEST_P          VC_1            3.6487E+35               2
TEST_P          VC_1            3.7558E+35               3
TEST_P          VC_1            3.9095E+35               4
TEST_P          VC_1            4.0162E+35               5
TEST_P          VC_1            4.1697E+35               6
TEST_P          VC_1            4.3235E+35               7
TEST_P          VC_1            4.4305E+35               8
TEST_P          VC_1            4.5841E+35               9
TEST_P          VC_1            4.6914E+35              10

So what in heck is 4.6914E+35 {last value listed}?

Well, it is a number of course. If I set my column heading appropriately I can see this:-

col end_val form 999,999,999,999,999,999,999,999,999,999,999,999
col column_name form a10
select column_name
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_P'
and  owner='WIDLAKEM'
and column_name='VC_1'
order by table_name,column_name,endpoint_number
/
COLUMN_NAM                                          END_VAL   ROWCOUNT
---------- ------------------------------------------------ ----------
VC_1        338,822,823,410,931,000,000,000,000,000,000,000          0
VC_1        349,512,448,932,628,000,000,000,000,000,000,000          1
VC_1        364,867,102,368,954,000,000,000,000,000,000,000          2
VC_1        375,575,265,424,979,000,000,000,000,000,000,000          3
VC_1        390,949,334,699,583,000,000,000,000,000,000,000          4
VC_1        401,618,352,253,998,000,000,000,000,000,000,000          5
VC_1        416,972,612,321,579,000,000,000,000,000,000,000          6
VC_1        432,345,727,450,272,000,000,000,000,000,000,000          7
VC_1        443,054,364,035,286,000,000,000,000,000,000,000          8
VC_1        458,409,658,691,912,000,000,000,000,000,000,000          9
VC_1        469,139,289,515,351,000,000,000,000,000,000,000         10

The ENDPOINT_VALUE is an encoded representation of the varchar2 value, based on the trick of turning each character into it’s ascii equivalent (0-255) and then multiplying it by 256*no-of-characters-in to the string and adding it all together. Let’s use “Cat” as an example.

  • Last character is ‘t’, ascii value is 116=116;
  • Next-to-last character is ‘b’, ascii value 97, *(256)= 24832;
  • Chr 3 in is ‘C’, ascii value 67, *(256*256) =4390912
  • Total is 4390912+24832+116=4415860

What Oracle actually does is slightly different {and I could have this slightly wrong, my brain waved a white flag a couple of hours back} it takes the first 15 characters and multiplies the first character ascii value by power(256*15), second character ascii value by power(256*14) etc until it runs out of characters or gets to character 15.

How do I know this? I decoded the ‘C’ in the kernel :-). No, I didn’t, I found a link to an encode script Jonathan Lewis had written {it’s also in his latest book, or at least in the set of scripts the book tells you how to download}. I’ve lost the original link I found, and thus a nice little article on the whole topic, but this link will do until I can re-find the original. {NB this referenced article mentions using the hexstr function which may be neater but, as I said earlier, my brain has given up}.

I’ve outrageously stolen Jonathan’s script which encodes a varchar2 string into the relevant numeric and used it as a template to create a decode version too. Just a couple of notes:

– I can imagine this might not work if you have multibyte characters.
– Oracle confuse things by doing something like translating the 15 character into a 36-digit string – and then taking only the first 15 significant digits. This kind of messes up the untranslate.

So, I create three functions:

  • hist_chartonum that converts a varchar2 into something very similar to what is stored in the histogram views. Jonathan will find it incredibly familiar. An optional second paramter turns off the truncation that oracle does for the histogram data, so you can get the full glorious 36 significant digits for 15 characters if you wish.
  • hist_numtochar which converts the numeric back to a string. It gets it slightly wrong due to the truncation of the least significant 20 or so characters. Again, an optional second paramater allows it to not replicate the trunaction and work with all 15 characters.
  • hist_numtochar2 which is an attempt to allow for the truncation errors. I add power(256,9) to the value if it has been truncated. It seems to make eg the numeric representation of AAAAA translate back to AAAAA not AAAA@. Yes, it’s a fudge.

This is the script:-

-- cre_hist_funcs.sql
-- Heavily borrowed from Jonathan Lewis, sep 2003
-- MDW 11/8/09 - all mistakes mine.
-- hist_chartonum converts a string to a number in a very similar way
-- to how Oracle does so for storing in column ENDPOINT_VALUE in XXX_TAB_HISTOGRAMS
-- Optionally get it to not truncate the value in the same way to only 15 digits.
-- hist_numtochar converts the ENDPOINT_VALUE back to human-readable format.
-- Optinally get it to go beyond 15 numerics (7 or 8 characters)
-- hist_numtochar2 attempts to allow for truncation errors.
-- JLs version has been checked against 8.1.7.4, 9.2.0.4 and 10.1.0.2
-- I've used on 10.2.0.3
set timing off
create or replace function hist_chartonum(p_vc varchar2
                                         ,p_trunc varchar2 :='Y') return number
is
m_vc varchar2(15) := substr(rpad(p_vc,15,chr(0)),1,15);
m_n number := 0;
begin
  for i in 1..15 loop
/*  dbms_output.put(ascii(substr(m_vc,i,1)));
    dbms_output.put(chr(9));
    dbms_output.put_Line(to_char( power(256,15-i) * ascii(substr(m_vc,i,1)),
                                '999,999,999,999,999,999,999,999,999,999,999,999'
                                 )
                        ); */
    m_n := m_n + power(256,15-i) * ascii(substr(m_vc,i,1));
  end loop;
-- this converts it from a 36 digit number to the 15-digit number used
-- in the ENDPOINT_VALUE Column.
  If p_trunc = 'Y' then
    m_n := round(m_n, -21);
  end if;
-- dbms_output.put_line(to_char(m_n,'999,999,999,999,999,999,999,999,999,999,999,999'));
return m_n;
end;
/

create or replace function hist_numtochar(p_num number
                                         ,p_trunc varchar2 :='Y') return varchar2
is
  m_vc varchar2(15);
  m_n number :=0;
  m_n1 number;
  m_loop number :=7;
begin
m_n :=p_num;
if length(to_char(m_n))<36 then
--dbms_output.put_line ('input too short');
  m_vc:='num format err';
else
  if p_trunc !='Y' then
    m_loop :=15;
  end if;
--dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999'));
  for i in 1..m_loop loop
    m_n1:=trunc(m_n/(power(256,15-i)));
--    dbms_output.put_line(to_char(m_n1));
    if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
    end if;
    dbms_output.put_line(m_vc);
    m_n:=m_n-(m_n1*power(256,15-i));  
  end loop;
end if;
return m_vc;
end;
/
create or replace function hist_numtochar2(p_num number
                                         ,p_trunc varchar2 :='Y') return varchar2
is
  m_vc varchar2(15);
  m_n number :=0;
  m_n1 number;
  m_loop number :=7;
begin
m_n :=p_num;
if length(to_char(m_n))<36 then
--dbms_output.put_line ('input too short');
  m_vc:='num format err';
else
  if p_trunc !='Y' then
    m_loop :=15;
  else
   m_n:=m_n+power(256,9);
  end if;
--dbms_output.put_line(to_char(m_N,'999,999,999,999,999,999,999,999,999,999,999,999'));
  for i in 1..m_loop loop
    m_n1:=trunc(m_n/(power(256,15-i)));
--    dbms_output.put_line(to_char(m_n1));
    if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
    end if;
    dbms_output.put_line(m_vc);
    m_n:=m_n-(m_n1*power(256,15-i));  
  end loop;
end if;
return m_vc;
end;
/
rem
rem Sample of use:
rem 

col id1 format 999,999,999,999,999,999,999,999,999,999,999,999
col id3 format 999,999,999,999,999,999,999,999,999,999,999,999
select
  'short'  text
 ,hist_chartonum('short') id1
 ,hist_numtochar(hist_chartonum('short')) id2
 ,hist_chartonum('short','N') id3
 ,hist_numtochar(hist_chartonum('short','N'),'N') id4      
from dual;
select
  'alongteststring'  text
 ,hist_chartonum('alongteststring') id1
 ,hist_numtochar(hist_chartonum('alongteststring')) id2
 ,hist_chartonum('alongteststring','N') id3
 ,hist_numtochar(hist_chartonum('alongteststring','N'),'N') id4      
from dual;
select
  'amuchlongerteststring'  text
 ,hist_chartonum('amuchlongerteststring') id1
 ,hist_numtochar(hist_chartonum('amuchlongerteststring')) id2
 ,hist_chartonum('amuchlongerteststring','N') id3
 ,hist_numtochar(hist_chartonum('amuchlongerteststring','N'),'N') id4      
from dual;
spool off

The test at the end produces the following output:-

strng  histgrm_translation
------------------------------------------------
translate_back
------------------------------------------------
untrimmed_translation
------------------------------------------------
untrimmed_translate_back
------------------------------------------------
short  599,232,339,077,851,000,000,000,000,000,000,000
short
 599,232,339,077,850,549,055,205,294,263,500,800
short

alongteststring  505,852,124,009,532,000,000,000,000,000,000,000
alongtm
 505,852,124,009,531,849,435,547,374,314,942,055
alongteststring

amuchlongerteststring  505,872,878,384,947,000,000,000,000,000,000,000
amuchly
 505,872,878,384,946,809,648,430,989,359,543,156
amuchlongertest

Final comment. Oracle trims the value stored in sys.histgrm$ and knackers up the untranslate. Why? Well, this is only me thinking out aloud, but this reduces the number from 36 to 15 significant digits and numerics are stored as one byte per two digits plus an offset byte {am I missing a length byte?}. So 10 bytes or so. This will keep the size of data stored down. I mean, as I posted here, the sys.histgrm table gets big enough as it is!

Tomorrow {or whenever my brain recovers} I’ll show how I use it and one or two oddities I have found or come across from other people’s blogs.

Comments»

1. B. Polarski - August 12, 2009

Will wait on the second article, this one is a bit technical and it is summer in Belgium. Just curious as of what can be derived from all this. At least the 3 functions may integrate an anonymous block (avoid creating objects in DB) in order to launch system queries.

Nice would be to report count of buckets on columns for a given index with high and low values translated in a human readable format.

ie) Report on historgram index(empno, custname) :

empno Name bucket no bucket count
—— —— ——— ————-
1 AAAAA 1 23
1 BBBBBB 2 66
.
.

mwidlake - August 13, 2009

Hi Bernard,

Sorry, life outside of work is seriously hampering life in work at the moment. Seem in danger of running out of relatives…

Might not get to post part 2 (or 3, 4 etc) on this topic of a week or two.
However, I have a scriot that uses converting RAW values for min/max values from DBA-??? column views and my home-created functions for Histograms. Why Oracle Corp can’t stick to one format or another, hell, I don’t know, sticking to their own internal standards never seemed to me to be much of a priotrity for Oracle Corp. Bunch of amatures….

2. Blogroll Report 07/08/2009 – 14/08/2009 « Coskan’s Approach to Oracle - August 14, 2009

[…] Martin Widlake – Decrypting Histogram Data […]

3. Marie Hoffmann - September 17, 2013

Hi,
thanks for the precise description about endpoint_value conversions. I currently need to reconvert the numeric bin values into character strings.
I tested your script (or Johnathan’s) on Oracle 11.2 g and loaded it via sqlplus> @{pathtoscript}/script.sql;
I am not sure if I understand the function of variable “lt”. During the call of the script I am asked to set a value for lt (e.g. tried 8). After having given values for “lt”, the script then compiles with errors:

NAME TYPE SEQUENCE LINE POSITION TEXT ATTRIBUTE MESSAGE_NUMBER
—————————— ———— ———- ———- ——————————————————————————————
HIST_NUMTOCHAR FUNCTION 1 11 24 PLS-00103: Encountered the symbol “8” when expecting one of the following:
. ( * % & = – + at in is mod remainder not rem then
or != or ~= >= <= and or like like2
like4 likec between || multiset member submultiset
ERROR 103

Hence, the script replaces “lt” with my input 8 and complains about it like it were a syntax error. What are appropriate settings for “lt”?

Jaco de Graaf - January 6, 2014

Marie,
For some reason the ‘<'-sign in Martin's script was replaced by '<'.
The 9th line in function hist_numto_char should read: if length(to_char(m_n)) < 36
Hope this helps.


Leave a comment