jump to navigation

Format of Date Histograms February 6, 2010

Posted by mwidlake in internals.
Tags: , ,

Oracle annoyingly handles dates internally in several formats. Dates are stored in tables as seven bytes, each byte representing century, year-in-century, month, day, hour, minute and second, shifted by different amounts. For the high/low values seen in DBA_TAB_COLUMNS or DBA_TAB_COL_STATISTICS, it is stored as a RAW value, where a two-digit hex string represents the century, year-in-century,month etc as before – see this post on decoding high and low column values and check out the comments for corrections.

So what about histograms? You might want to know what is in the histogram for a date column in order to better understand the decisions made by the CBO. Below, I pull out the histogram for an example date column {I’ve trimmed the output a little to save space}:

select table_name
,endpoint_value end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_TABLE_X'
and owner ='TEST_1'
and column_name = 'PLACED_DT'
order by endpoint_number

--------------- ------------- ------------- ---------------
TEST_TABLE_X     PLACED_DT   2,452,258         0
TEST_TABLE_X     PLACED_DT   2,454,334         1
TEST_TABLE_X     PLACED_DT   2,454,647         2
TEST_TABLE_X     PLACED_DT   2,454,737         3
TEST_TABLE_X     PLACED_DT   2,454,820         4
TEST_TABLE_X     PLACED_DT   2,454,867         5
TEST_TABLE_X     PLACED_DT   2,454,929         6
TEST_TABLE_X     PLACED_DT   2,454,981         7
TEST_TABLE_X     PLACED_DT   2,455,006         8
TEST_TABLE_X     PLACED_DT   2,455,024         9
TEST_TABLE_X     PLACED_DT   2,455,039         10
TEST_TABLE_X     PLACED_DT   2,455,050         11
TEST_TABLE_X     PLACED_DT   2,455,205         42
TEST_TABLE_X     PLACED_DT   2,455,207         43
TEST_TABLE_X     PLACED_DT   2,455,209         44
TEST_TABLE_X     PLACED_DT   2,455,211         45
TEST_TABLE_X     PLACED_DT   2,455,213         46
TEST_TABLE_X     PLACED_DT   2,455,215         47
TEST_TABLE_X     PLACED_DT   2,455,216         48
TEST_TABLE_X     PLACED_DT   2,455,218         49
TEST_TABLE_X     PLACED_DT   2,455,220         50
TEST_TABLE_X     PLACED_DT   2,455,221         51
TEST_TABLE_X     PLACED_DT   2,455,222         52
TEST_TABLE_X     PLACED_DT   2,455,222         53
TEST_TABLE_X     PLACED_DT   2,455,222         54
TEST_TABLE_X     PLACED_DT   2,455,222         55
TEST_TABLE_X     PLACED_DT   2,455,223         56
TEST_TABLE_X     PLACED_DT   2,455,223         57
TEST_TABLE_X     PLACED_DT   2,455,223         58
TEST_TABLE_X     PLACED_DT   2,455,223         59
TEST_TABLE_X     PLACED_DT   2,455,223         69
TEST_TABLE_X     PLACED_DT   2,455,223         70
TEST_TABLE_X     PLACED_DT   2,455,223         71
TEST_TABLE_X     PLACED_DT   2,455,223         72
TEST_TABLE_X     PLACED_DT   2,455,223         73
TEST_TABLE_X     PLACED_DT   2,455,224         74
TEST_TABLE_X     PLACED_DT   2,455,226         75

Well, it looks like a seven digit number so it must be representing the date in a similar way to as described above, yes? No. The format is obviously something different {Oh come ON Oracle, some internal standards would be nice once in a while! :-) }

I pulled out the minimum and maximum values from the DBA_TAB_COLUMNS table and translated them:-

COLUMN_NAME  NUM_DIST   LOW_V               HI_V 
------------ ---------- ------------------- ------------------- 
PLACED_DT    375,428    2001-12-13 17:31:38 2010-01-28 23:51:38 

So the above low values and high values will pretty much match the first and last values in the histograms, which are 2452258 and 2455226. Any guesses? Those values from the histogram are very close to each other, only 2968 different to cover around 9 years of values. 9 times 365…

The histogram is representing the dates in Julian format, ie number of days since 1st Jan 4712BC. As a quick proof of this:

select to_date('2452258','J'),to_date('2455226','J')
from dual;

TO_DATE('2452258' TO_DATE('2455226'
----------------- -----------------
14-DEC-2001 00:00 29-JAN-2010 00:00

Well, what do you know. Very close to the 13th Dec 2001 and 28th Jan 2010

This of course makes sense, storing the date as an ascending numeric means it is simple to calculate the width of the range and how many values per day there are.

Imagine how complex it would be to do this if the date was stored in the bizarre way we humans deal with it – an ascending numeric for year, a cycling 12 digit number for month and a varying cyclic number for the day of the month. And that is ignoring other calendars in common use.

However, I’m looking at this method of representing the date and something bothers me. There is no allowance for the time portion. Maybe column histograms can’t cope with time? I feel a couple of tests coming on…

If you have seen the comments, you will know that Jonathan commented to ask if I might have truncated/been selective with the data in creating my test table as he is sure there is a fractional section representing the time portion.

Well, I did say I needed to check further so maybe I would have spotted my error on my own… :-)
This is the script I use to pull out histogram data (and yes, I trimed the output from the script before copying it into this post, so it does not quite match the script):

-- chk_hist.sql
-- Martin Widlake 7/4/4
-- pull out the histograms on a table
set pause on pages 32
spool chk_hist.lst
col owner form A8
col table_name form A15
col column_name form a20
col rowcount form 99,999,999,999
col end_val form 99,999,999,999
select owner
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name like upper(nvl('&tab_name','WHOOPS')||'%')
and  owner like upper('&tab_own')||'%'
and  column_name like upper('&col_name')||'%'
order by table_name,column_name,endpoint_number
spool off
clear colu

Hmm wonder what happens if I change the “col end_val form 99,999,999,999″ so that it would show fractions…

col end_val form 999,999,999.99999999

--------------- ---------- --------------------- ------------
TEST_TABLE_X    PLACED_DT     2,452,257.73030093            0
TEST_TABLE_X    PLACED_DT     2,454,333.76546296            1
TEST_TABLE_X    PLACED_DT     2,454,647.32561343            2
TEST_TABLE_X    PLACED_DT     2,454,737.25017361            3
TEST_TABLE_X    PLACED_DT     2,454,820.02204861            4
TEST_TABLE_X    PLACED_DT     2,454,866.98009259            5
TEST_TABLE_X    PLACED_DT     2,454,928.66848380            6
TEST_TABLE_X    PLACED_DT     2,454,980.94815972            7
TEST_TABLE_X    PLACED_DT     2,455,005.68413194            8
TEST_TABLE_X    PLACED_DT     2,455,023.67142361            9
TEST_TABLE_X    PLACED_DT     2,455,039.03236111           10
TEST_TABLE_X    PLACED_DT     2,455,050.39246528           11


Thanks Jonathan :-)
It is still a Julian date, but with the time as a fraction as he said.

Dealing with Bind Issues December 1, 2009

Posted by mwidlake in Meeting notes, performance.
Tags: , ,

One of the presentations I have seen today was on handling bind values. I can’t say it was, for me, the best I have seen this week, I’ve done a lot of work on binds in the past so I had come across most of the material before. Should I have bothered?

Well, there was one little gem in there which struck me as very simple and potentially very effective.

Bind variables and histograms are not a good mix, as has been commented on many, many times. In essence, if your data in a column is skewed so that some values match very few records and others match a large number, when oracle sees a SQL statement with a bind value being compared to that column, it peeks at the first value being passed in with the bind and uses it to decide on the plan (this is pre 11G, by the way).
That plan is then used for every execution of that sql statement until it is thrown out the SGA. Which is jolly unfortunate if all the values subsequently passed in via the bind do not suit the plan. 

The solutions to this usually boil down to one of three approaches; remove the histograms on the column in question so that all values are treated equally;stop it being a bind/prevent bind peeking; force the “bad” plan out of the SGA and hope the next parse gets a better plan.

All have their merits and drawbacks.

Well, in this presentation there was a fourth solution. Something like this:

if :status in (1,2,3) then
  select /*+ cardinality (t1 100000) */
  from table_1 t1
      , table_t t2
  where t1.status=:status
  select /*+ cardinality (t1 100) */
  from table_1 t1
      , table_t t2
  where t1.status=:status

I might be missing something blindingly obvious here (and this might be a common solution that has just passed me by), but it seems to me to be a simple and effective solution that could be used in many situations.

I also learnt that it is rare not to find at least one good thing out of any presentation, so long as you keep paying attention.

Decrypting Histogram Data #3 – is 7 or 15 Characters Significant? September 15, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,

At last I am going to wrap up about looking at Histogram data and endpoint_value. Sorry about the delay, life is a tad complex at present. This is also a very, very long post. You may want to skip to the end to get the conclusions

I first posted about how to translate endpoint_values.
I then posted in the second post a week ago about how endpoint_value is limited to 7 or so characters for varchar2 columns.
{Addition – this later post discusses the content of endpoinit_value for dates}

Coskan contacted me to say he believed I had missed something, and of course he was right – but I’d laid big hints that I was using theory and not practical test and there was more to know. {There is usually more to know, only practical tests confirm whether your theory works or not. And even after that, it seems there is still more to know!}

OK, I have a test table, created with the following {oh, this is under, Enterprise Edition on linux}:

create table hist_test
as select rownum  id
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
,lpad('H',16,'H')||dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001

I need to skew the data in some columns so that different WHERE clauses should see different expected numbers of rows
I now need to add some skew to the data so you can see the histogram in action .

update hist_test set
where mod(id,10)=1

Gather histogram stats, 10 buckets:
,method_opt=>’FOR ALL COLUMNS SIZE 10′)

And looking at ALL_TAB_COLUMNS we can see the general stucture of the data (just the VC_ columns are shown)

COL_NAME NUM_DIST LOW_V                HI_V                 N_BUCK  AVG_L
ID          1,000 1                    1000                     10      4
VC_1            6 AAAAA                FFFFF                     6      6
VC_2           26 AAA                  AAZ                      10      4
VC_3           26 BBBA                 BBBZ                     10      5
VC_4           26 CCCCA                CCCCZ                    10      6
VC_5           26 DDDDDA               DDDDDZ                   10      7
VC_6           26 EEEEEEA              EEEEEEZ                  10      8
VC_7           26 FFFFFFFA             FFFFFFFZ                 10      9
VC_8           26 GGGGGGGGA            GGGGGGGGZ                10     10
VC_16          26 HHHHHHHHHHHHHHHHA    HHHHHHHHHHHHHHHHZ        10     18

I have created a table with varchar2 columns set to a fixed sting of characters the length indicated by the column name with the last column varying from A to Z. So VC_5 contains EEEEEx, where x is A to Z. 26 distinct values each column.

If you look at the above, all columns are showing as having 26 values and 10 buckets EXCEPT VC_40, but it does have 26 distinct values:

select count(distinct(VC_40)) from hist_test;

I have my function to unpack the endpoint_value for varchar2 columns from the number stored to the character string it represents. My function is not perfect, but it is reasonably good. Go checkout post one for the function.
Looking at the histograms with my function you can see the below {I show only some records}

colname                               END_VAL ROWCOUNT MOD_REAL
VC_5     354460798875962000000000000000000000        1 DDDDDA
VC_5     354460798875972000000000000000000000        2 DDDDDD
VC_5     354460798875986000000000000000000000        3 DDDDDG
VC_5     354460798876071000000000000000000000        9 DDDDDY
VC_5     354460798876080000000000000000000000       10 DDDDDZ
VC_7     364886116489977000000000000000000000        1 FFFFFGK
VC_7     364886116489977000000000000000000000        2 FFFFFGK
VC_7     364886116489977000000000000000000000        3 FFFFFGK
VC_7     364886116489977000000000000000000000        4 FFFFFGK
VC_7     364886116489977000000000000000000000        9 FFFFFGK
VC_7     364886116489977000000000000000000000       10 FFFFFGK
VC_16    375311434103976000000000000000000000        1 HHHHHI:
VC_16    375311434103976000000000000000000000        2 HHHHHI:
VC_16    375311434103976000000000000000000000        3 HHHHHI:
VC_16    375311434103976000000000000000000000        9 HHHHHI:
VC_16    375311434103976000000000000000000000       10 HHHHHI:
VC_40    380524092910976000000000000000000000    1,000 IIIIIJM

For VC_5, there are 10 buckets, all with different endpoint_values.
The VC_7 has 10 buckets, but most have the same endpoint_value.
The VC_16 has 10 records in the HISTOGRAM table, all with the same endpoint_value.
VC_40 has only one record in the HISTOGRAM table.

Basically, if I am right, WHERE filters on a column holding values that are the same for the first seven or so characters will not be well supported by the histograms. We should see the estimated cost for such a WHERE filter to be wrong.

I have skewed the data so that Histograms should show more expected values for DDDDDA than DDDDDS, for FFFFFFFA than FFFFFFFS etc

update hist_test
where mod(id,10)=1

And now I’ll see how the CBO estimates the rows coming back for some WHERE clauses.

select count(*) from hist_test
where VC_5 = 'DDDDDA'

Execution Plan
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)|
|   0 | SELECT STATEMENT   |           |     1 |     7 |     8   (0)|
|   1 |  SORT AGGREGATE    |           |     1 |     7 |            |
|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |   700 |     8   (0)|

select count(*) from hist_test
where VC_5 = 'DDDDDS'

Execution Plan
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)|
|   0 | SELECT STATEMENT   |           |     1 |     7 |     8   (0)|
|   1 |  SORT AGGREGATE    |           |     1 |     7 |            |
|*  2 |   TABLE ACCESS FULL| HIST_TEST |    49 |   343 |     8   (0)|

For column VC_5 it all works as we would expect. the CBO estimates 100 values for DDDDDA and 49 for DDDDDS. I’ve shown the actual count(*)’s as well to show there is a real variation in numbers. There are good reasons why the estimates do not match reality, I won’t go into them now but this does highlight that histograms can help but do have flaws.

What about the other rows? We are only interested in the WHERE clause and the estimate for the number of rows from the table access, so I’ll show only those.

where VC_7 = 'FFFFFFFA'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |   900 |     8   (0)|

where VC_7 = 'FFFFFFFS'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |    49 |   441 |     8   (0)|

|*  2 |   TABLE ACCESS FULL| HIST_TEST |   100 |  7200 |     8   (0)|

|*  2 |   TABLE ACCESS FULL| HIST_TEST |    47 |   216 |     8   (0)|

|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

{and just to show there is no typo in the string of 'I's above...
where VC_40 = lpad('I',40,'I')||'S'
|*  2 |   TABLE ACCESS FULL| HIST_TEST |  1000 | 42000 |     8   (0)|

In the above you can see that the CBO estimates the costs for VC_7 exactly the same as it did for VC_5. Despite the endpoint_values no longer distinguishing the buckets.

Even more telling is tha for VC_16 the CBO also is able to predict there are more values ending in A than S.

As for VC_40, the CBO utterly fails to spot the skew. In fact it utterly fails to spot that there is more than one value in the column (look back at the section showing min and max values and number of distinct values), and there is only one histogram bucket. so I supposed it is no surprise.

How does the CBO still detect the skew for VC_7 and VC_16, even though the endpoint_valuess are the same?
Because it swaps to using column ENDPOINT_ACTUAL_VALUE.

ENDPOINT_ACTUAL_VALUE is only populated for varchar2 columns when the precision of endpoint_value is exceeded:

------- ---------------------------------------------
VC_5           7 DDDDDR
VC_5           8 DDDDDV
VC_5           9 DDDDDY
VC_5          10 DDDDDZ
VC_6           0 EEEEEF8           EEEEEEA
VC_6           1 EEEEEF8           EEEEEEC
VC_6           2 EEEEEF8           EEEEEEF
VC_6           3 EEEEEF8           EEEEEEH
VC_6           4 EEEEEF8           EEEEEEK
VC_6           5 EEEEEF8           EEEEEEN
VC_6           6 EEEEEF8           EEEEEEP
VC_6           7 EEEEEFn           EEEEEES
VC_6           8 EEEEEFn           EEEEEEU
VC_6           9 EEEEEFn           EEEEEEX
VC_6          10 EEEEEFn           EEEEEEZ
VC_7           1 FFFFFGK           FFFFFFFA
VC_7           2 FFFFFGK           FFFFFFFC
VC_7           3 FFFFFGK           FFFFFFFF
VC_7           4 FFFFFGK           FFFFFFFI
VC_7           5 FFFFFGK           FFFFFFFK
VC_7           6 FFFFFGK           FFFFFFFO
VC_7           7 FFFFFGK           FFFFFFFR
VC_7           8 FFFFFGK           FFFFFFFT
VC_7           9 FFFFFGK           FFFFFFFW
VC_7          10 FFFFFGK           FFFFFFFZ
VC_16          1 HHHHHI:           HHHHHHHHHHHHHHHHA
VC_16          2 HHHHHI:           HHHHHHHHHHHHHHHHD
VC_16          3 HHHHHI:           HHHHHHHHHHHHHHHHG
VC_16          4 HHHHHI:           HHHHHHHHHHHHHHHHJ
VC_16          5 HHHHHI:           HHHHHHHHHHHHHHHHL
VC_16          6 HHHHHI:           HHHHHHHHHHHHHHHHO
VC_16          7 HHHHHI:           HHHHHHHHHHHHHHHHS
VC_16          8 HHHHHI:           HHHHHHHHHHHHHHHHU
VC_16          9 HHHHHI:           HHHHHHHHHHHHHHHHX
VC_16         10 HHHHHI:           HHHHHHHHHHHHHHHHZ
VC_40      1,000 IIIIIJM

You can see that VC_5 columns have no ENDPOINT_ACTUAL_VALUEs but VC_6, VC_7 and VC_16 do. VC_40 does not.

So, at what point does the CBO stop storing values in ENDPOINT_ACTUAL_VALUES? 32 characters. I created  another test table and these are the low/high values, number of distinct values and number of histogram buckets:

ID 1,000 1 1000 10 4
NUM_1 10 0 9 10 3

The name of the column is the length of the fixed string, so the columns are actually one character longer, as can be seen by the “AVG_L” column.

And just to check {as after all this is just a nice theory until you test}, I skew the number of records ending in ‘A’ for VC_31 and VC_32:-

update hist_test2
where mod(id,10)=1

And see what the CBO thinks the rows to be identified are:-

where VC_31 = lpad(‘B’,31,’B’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 100 | 3300 |

where VC_31 = lpad(‘B’,31,’B’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 49 | 1617 |

where VC_32 = lpad(‘C’,32,’C’)||’A’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 |

where VC_32 = lpad(‘C’,32,’C’)||’S’
|* 2 | TABLE ACCESS FULL| HIST_TEST2 | 1000 | 34000 | [/sourcecode ]

Yes. The CBO can see the skew for the 32 characters of VC_31 and not for the 33 characters of VC_32

So in conclusion

  • endpoint_value is only accurate for varchar2 columns up to around 7 characters.
  • If the varchar2 exceeds the precision of the endpoint_value column then endpoint_actual_value is used.
  • endpoint_actual_value is used for varchar2 columns up to 32 characters and not for columns less than 7 characters.
  • Histograms will not help with columns having many values with the same leading 32 or more characters.
  • If you column has the same leading 32 characters or more, even the number of distinct values is not gathered by DBMS_STATS.
  • You can’t tell how SQL/CBO will work from theory, you have to test. And still consider you knowledge as open to correction after that :-)

Decrypting Histogram Data #2 September 3, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,

Hint – if you have a column where most or all of the entries have the same 15 plus characters, don’t bother gathering histograms on it. They will not help you a lot. Or at all. eg you have a column that holds “CUSTOMER IDENTIFIER xxxxxxxxx”, “CUSTOMER IDENTIFIER yyyyyyyy”, etc. Of course, good design suggests that the “CUSTOMER IDENTIFIER” bit is probably redundant and can be got rid of, but we live in the real world and may not have the power or ability to enforce that change, or we might have 3 or 4 such “prefix” strings.

Further, histograms on dnumerics longer than 15 significant digits will also potentially behave not as you would expect.

I better justify my assertion.

In the previous post on decrypting histogram data I covered how Oracle turns a varchar2 value into a numeric value that is held in the ENDPOINT_VALUE of DBA_TAB_HISTOGRAMS and I also gave you a cunning (but slightly flawed) function to convert it back. I use it in this post so you might want to go look at the prior post. Sorry it is long, I can’t stop rambling.

First, I’ll create some test data {Oh, this is on on Linux}. The below script create a table HIST_TEST with columns NUM_1 through to NUM7, which hold numbers padded out to greater lengths and then 0-9 as the last value. Similarlry columns VC_2 to VC_8 are padded out and have a random character added. VC_1 is a random 5-character string.

create table hist_test
as select rownum  id
,mod(rownum,10)  num_1
,trunc(dbms_random.value(1,10)) num_2
,1000+trunc(dbms_random.value(1,10)) num_3
,1000000+trunc(dbms_random.value(1,10)) num_4
,1000000000+trunc(dbms_random.value(1,10)) num_5
,1000000000000+trunc(dbms_random.value(1,10)) num_6
,1000000000000000+trunc(dbms_random.value(1,10)) num_7
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )vc_1
,lpad('A',2,'A')||dbms_random.string('U',1) vc_2
,lpad('B',3,'B')||dbms_random.string('U',1) vc_3
,lpad('C',4,'C')||dbms_random.string('U',1) vc_4
,lpad('D',5,'D')||dbms_random.string('U',1) vc_5
,lpad('E',6,'E')||dbms_random.string('U',1) vc_6
,lpad('F',7,'F')||dbms_random.string('U',1) vc_7
,lpad('G',8,'G')||dbms_random.string('U',1) vc_8
--below skews the table to AB.
                                       ,   dbms_random.string('U',1)) vc_16
,lpad('I',40,'I')||dbms_random.string('U',1) vc_40
from dba_objects
where rownum < 1001
  ,method_opt=>'FOR ALL COLUMNS SIZE 10');

The below is a simple select against DBA_TAB_COLUMNS to see the information for the column {Oh, I say simple, but you have to use the functions utl_raw.cat_to_number and utl_raw.cast_to_varchar2 to turn the raw values held in the columns LOW_VALUE and HIGH_VALUE to something we humans can read. Why does Oracle Corp have to make life so difficult? *sigh*.

Oh, click on “view plain” on the box below to get a better layout.

,num_distinct num_dist
                 ,          low_value
       ) low_v
                 ,          high_value
       ) hi_v
,num_nulls    n_nulls
,num_buckets  n_buck
,avg_col_len  avg_l
from dba_tab_columns
where table_name ='HIST_TEST'
and owner=USER
order by owner,column_id

COLUMN_NAME       NUM_DIST LOW_V              HI_V                    N_NULLS N_BUCK  AVG_L
ID                   1,000 1                  1000                          0     10      4
NUM_1                   10 0                  9                             0     10      3
NUM_2                    9 1                  9                             0      9      3
NUM_3                    9 1001               1009                          0      9      4
NUM_4                    9 1000001            1000009                       0      9      6
NUM_5                    9 1000000001         1000000009                    0      9      7
NUM_6                    9 1000000000001      1000000000009                 0      9      9
NUM_7                    9 1000000000000001   1000000000000009              0      9     10
VC_1                     6 AAAAA              FFFFF                         0      6      6
VC_2                    26 AAA                AAZ                           0     10      4
VC_3                    26 BBBA               BBBZ                          0     10      5
VC_4                    26 CCCCA              CCCCZ                         0     10      6
VC_5                    26 DDDDDA             DDDDDZ                        0     10      7
VC_6                    26 EEEEEEA            EEEEEEZ                       0     10      8
VC_7                    26 FFFFFFFA           FFFFFFFZ                      0     10      9
VC_8                    26 GGGGGGGGA          GGGGGGGGZ                     0     10     10
VC_16                   26 HHHHHHHHHHHHHHHHA  HHHHHHHHHHHHHHHHZ             0     10     18
VC_40                    1 IIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIII            0      1     42/

I now select the data out of the DBA_TAB_HISTOGRAMS table to see what is really being stored. For the sake of brevity {which means, “so I can finish this post tonight”} I’ll just show bits, but if you want, download the script right at the end and, if you created the functions from the prior post, you can play with this yourself.

First, here you can see some of the values actually stored in ENDPOINT_VALUE and what they translate into:-

TABLE_NAME   colname                        ENDPOINT_VALUE ROWCOUNT REAL_VAL
HIST_TEST    NUM_5                              1000000007      760 1000000007
HIST_TEST    NUM_5                              1000000008      870 1000000008
HIST_TEST    NUM_5                              1000000009    1,000 1000000009
HIST_TEST    NUM_6                           1000000000001      103 1000000000001
HIST_TEST    NUM_6                           1000000000002      219 1000000000002
HIST_TEST    NUM_6                           1000000000003      328 1000000000003
HIST_TEST    NUM_6                           1000000000004      427 1000000000004
HIST_TEST    VC_2     338824386822815000000000000000000000        8 AAU
HIST_TEST    VC_2     338824624507302000000000000000000000        9 AAW
HIST_TEST    VC_2     338824782963627000000000000000000000       10 AAY
HIST_TEST    VC_3     344035480872391000000000000000000000        0 BBB@
HIST_TEST    VC_3     344035481491361000000000000000000000        1 BBBB
HIST_TEST    VC_3     344035482419816000000000000000000000        2 BBBE
HIST_TEST    VC_3     344035483348271000000000000000000000        3 BBBH
HIST_TEST    VC_3     344035483967241000000000000000000000        4 BBBJ

Note that for numerics the number itself is stored and I do not need to translate it.
For VARCHAR2 columns the value held is the 15-digit number padded with zeros.
Also note, for VC_2 the range covered seems to end at AAY not AAZ and column VC_3 starts at BBB@ not BBBA {I am creating values with the last character set to A-Z}. Also, bucket 8 for VC_2 ends in a control character.

To reduce this I add a fudge to my function {again, see previous post}. It helps:

HIST_TEST    VC_2           9 AAW               AAX
HIST_TEST    VC_2          10 AAY               AAZ
HIST_TEST    VC_3           0 BBB@              BBBA
HIST_TEST    VC_3           1 BBBB              BBBC
HIST_TEST    VC_3           2 BBBE              BBBF
HIST_TEST    VC_3           3 BBBH              BBBI
HIST_TEST    VC_3           4 BBBJ              BBBK
HIST_TEST    VC_3           5 BBBM              BBBN
HIST_TEST    VC_3           6 BBBO              BBBP
HIST_TEST    VC_3           7 BBBR              BBBS
HIST_TEST    VC_3           8 BBBT              BBBU
HIST_TEST    VC_3           9 BBBW              BBBX
HIST_TEST    VC_3          10 BBBY              BBBZ
HIST_TEST    VC_4           0 CCCC@             CCCCA
HIST_TEST    VC_4           1 CCCCB             CCCCC
HIST_TEST    VC_4           2 CCCCD             CCCCE
HIST_TEST    VC_4           3 CCCCH            CCCCH
HIST_TEST    VC_4           4 CCCCJ            CCCCJ

As you can see, I now get a better translation, but it still goes wrong sometimes (see last couple of rows). So, feel free to take my functions and use them, but be aware even the modified version is not perfect. If YOU perfect it, can I have a copy please :-)

The below shows that Numeric value histograms break when you exceed 15 digits:

colname                  END_VAL ROWCOUNT REAL_VAL
------- ------------------------ -------- --------
NUM_5                 1000000009    1,000 1000000009
NUM_6              1000000000001      103 1000000000001
NUM_6              1000000000002      219 1000000000002
NUM_6              1000000000003      328 1000000000003
NUM_6              1000000000004      427 1000000000004
NUM_6              1000000000005      542 1000000000005
NUM_6              1000000000006      651 1000000000006
NUM_6              1000000000007      771 1000000000007
NUM_6              1000000000008      881 1000000000008
NUM_6              1000000000009    1,000 1000000000009
NUM_7           1000000000000000      133 1000000000000000
NUM_7           1000000000000000      256 1000000000000000
NUM_7           1000000000000000      367 1000000000000000
NUM_7           1000000000000000      467 1000000000000000
NUM_7           1000000000000010      567 1000000000000010
NUM_7           1000000000000010      665 1000000000000010
NUM_7           1000000000000010      784 1000000000000010
NUM_7           1000000000000010      896 1000000000000010
NUM_7           1000000000000010    1,000 1000000000000010

This is the point at which storage of numbers for histograms breaks. You can see that NUM_6 is fine but NUM_7 is not. That is because NUM_6 is below 15 significant digits and NUM_7 is over 15 significant digits and effectively gets truncated.

Histograms on numeric values with more than 15 significant digits will not work as you expect, possible not at all.

With Varchar(2) values, histogram END_VALUES break even sooner, at around 7 digits:

colname                               END_VAL ROWCOUNT MOD_REAL
------- ------------------------------------- -------- --------
VC_5     354460798876024000000000000000000000        5 DDDDDO
VC_5     354460798876038000000000000000000000        6 DDDDDR
VC_5     354460798876047000000000000000000000        7 DDDDDS
VC_5     354460798876061000000000000000000000        8 DDDDDV
VC_5     354460798876071000000000000000000000        9 DDDDDY
VC_5     354460798876080000000000000000000000       10 DDDDDZ
VC_6     359673457682977000000000000000000000        0 EEEEEF8
VC_6     359673457682977000000000000000000000        1 EEEEEF8
VC_6     359673457682977000000000000000000000        2 EEEEEF8
VC_6     359673457682977000000000000000000000        3 EEEEEF8
VC_6     359673457682977000000000000000000000        4 EEEEEF8
VC_6     359673457682977000000000000000000000        5 EEEEEF8
VC_6     359673457682977000000000000000000000        6 EEEEEF8
VC_6     359673457682977000000000000000000000        7 EEEEEF8
VC_6     359673457682978000000000000000000000        8 EEEEEFn
VC_6     359673457682978000000000000000000000        9 EEEEEFn
VC_6     359673457682978000000000000000000000       10 EEEEEFn
VC_7     364886116489977000000000000000000000        0 FFFFFGK
VC_7     364886116489977000000000000000000000        1 FFFFFGK
VC_7     364886116489977000000000000000000000        2 FFFFFGK
VC_7     364886116489977000000000000000000000        3 FFFFFGK
VC_7     364886116489977000000000000000000000        4 FFFFFGK
VC_7     364886116489977000000000000000000000        5 FFFFFGK
VC_7     364886116489977000000000000000000000        6 FFFFFGK
VC_7     364886116489977000000000000000000000        7 FFFFFGK
VC_7     364886116489977000000000000000000000        8 FFFFFGK
VC_7     364886116489977000000000000000000000        9 FFFFFGK
VC_7     364886116489977000000000000000000000       10 FFFFFGK

You can see that for column VC_5 the actual ENDPOINT_VALUE is varying at the 14th and 15th significant digit and my translated value changes. But for VC_6 the numeric ENDPOINT_VALUE is hardly changing. Each row translated to one of two values.
For VC_7 the ENDPOINT_VALUE is static. All histogram END_VALUES are the same.

This means Histograms will not work properly for any VARCHAR(2) columns which do not vary for the first 7 or more characters and any characters after the 7th will be ignored.

Or does it? My logic is correct but tomorrow (or soon after) I’ll try some actual tests over theory… {So don’t ping me with corrections just yet}

I’ve not mentioned VC_16 and VC_40 have I? I will tomorrow. Maybe :-|.

Finally, Here is the selecting script, as promised.

col column_name form a7 head colname
col rowcount form 99,999
col real_val form a17
col end_val form 999999999999999999999999999999999999
col end_string form a10
col endpoint_actual_value form a40
col mod_real form a17
select --owner
,dth.endpoint_value  end_val
,dth.endpoint_number rowcount
                 ,          dth.endpoint_value
       ) real_val

                 ,          dth.endpoint_value
       ) mod_real
from dba_tab_histograms dth
,dba_tab_columns dtc
where dth.table_name ='HIST_TEST'
and  dth.owner=USER
and dth.owner=dtc.owner
and dth.table_name=dtc.table_name
and dth.column_name=dtc.column_name
order by dth.table_name,dth.column_name,dth.endpoint_number

Decrypting Histogram Data August 11, 2009

Posted by mwidlake in internals, performance.
Tags: , , ,

{part two is here}

I’m looking at Histograms at the moment and I’ve hit a problem I always hit when looking at histograms. So I’ve decided once and for all to see if I can find a solution.

The value held in DBA_TAB_HISTOGRAMS.ENDPOINT_VALUE for VARCHAR2 columns is a number with no apparent link to reality.

{Before anyone who has read my previous posts suggests it, I’ve already looked at the view and the relevant column resolves to data in either sys.histgrm$ or sys.hist_head$ and the underlying values held are still numbers and not processed in the view, so they are numbers. I think}.
{{I have a sneaking suspicion this could turn into a set of increasingly obscure postings…}}
{{{THIS one might be my most obscure so far!}}}

Let’s have a look:

select --owner
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_P'
and  owner='WIDLAKEM'
order by table_name,column_name,endpoint_number
--------------- --------------- ---------- ---------------
TEST_P          STATUS                   0          41,523
TEST_P          STATUS                   1          42,065
TEST_P          STATUS                   2          42,099
TEST_P          VC_1            3.3882E+35               0
TEST_P          VC_1            3.4951E+35               1
TEST_P          VC_1            3.6487E+35               2
TEST_P          VC_1            3.7558E+35               3
TEST_P          VC_1            3.9095E+35               4
TEST_P          VC_1            4.0162E+35               5
TEST_P          VC_1            4.1697E+35               6
TEST_P          VC_1            4.3235E+35               7
TEST_P          VC_1            4.4305E+35               8
TEST_P          VC_1            4.5841E+35               9
TEST_P          VC_1            4.6914E+35              10

So what in heck is 4.6914E+35 {last value listed}?

Well, it is a number of course. If I set my column heading appropriately I can see this:-

col end_val form 999,999,999,999,999,999,999,999,999,999,999,999
col column_name form a10
select column_name
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_P'
and  owner='WIDLAKEM'
and column_name='VC_1'
order by table_name,column_name,endpoint_number
COLUMN_NAM                                          END_VAL   ROWCOUNT
---------- ------------------------------------------------ ----------
VC_1        338,822,823,410,931,000,000,000,000,000,000,000          0
VC_1        349,512,448,932,628,000,000,000,000,000,000,000          1
VC_1        364,867,102,368,954,000,000,000,000,000,000,000          2
VC_1        375,575,265,424,979,000,000,000,000,000,000,000          3
VC_1        390,949,334,699,583,000,000,000,000,000,000,000          4
VC_1        401,618,352,253,998,000,000,000,000,000,000,000          5
VC_1        416,972,612,321,579,000,000,000,000,000,000,000          6
VC_1        432,345,727,450,272,000,000,000,000,000,000,000          7
VC_1        443,054,364,035,286,000,000,000,000,000,000,000          8
VC_1        458,409,658,691,912,000,000,000,000,000,000,000          9
VC_1        469,139,289,515,351,000,000,000,000,000,000,000         10

The ENDPOINT_VALUE is an encoded representation of the varchar2 value, based on the trick of turning each character into it’s ascii equivalent (0-255) and then multiplying it by 256*no-of-characters-in to the string and adding it all together. Let’s use “Cat” as an example.

  • Last character is ‘t’, ascii value is 116=116;
  • Next-to-last character is ‘b’, ascii value 97, *(256)= 24832;
  • Chr 3 in is ‘C’, ascii value 67, *(256*256) =4390912
  • Total is 4390912+24832+116=4415860

What Oracle actually does is slightly different {and I could have this slightly wrong, my brain waved a white flag a couple of hours back} it takes the first 15 characters and multiplies the first character ascii value by power(256*15), second character ascii value by power(256*14) etc until it runs out of characters or gets to character 15.

How do I know this? I decoded the ‘C’ in the kernel :-). No, I didn’t, I found a link to an encode script Jonathan Lewis had written {it’s also in his latest book, or at least in the set of scripts the book tells you how to download}. I’ve lost the original link I found, and thus a nice little article on the whole topic, but this link will do until I can re-find the original. {NB this referenced article mentions using the hexstr function which may be neater but, as I said earlier, my brain has given up}.

I’ve outrageously stolen Jonathan’s script which encodes a varchar2 string into the relevant numeric and used it as a template to create a decode version too. Just a couple of notes:

- I can imagine this might not work if you have multibyte characters.
- Oracle confuse things by doing something like translating the 15 character into a 36-digit string – and then taking only the first 15 significant digits. This kind of messes up the untranslate.

So, I create three functions:

  • hist_chartonum that converts a varchar2 into something very similar to what is stored in the histogram views. Jonathan will find it incredibly familiar. An optional second paramter turns off the truncation that oracle does for the histogram data, so you can get the full glorious 36 significant digits for 15 characters if you wish.
  • hist_numtochar which converts the numeric back to a string. It gets it slightly wrong due to the truncation of the least significant 20 or so characters. Again, an optional second paramater allows it to not replicate the trunaction and work with all 15 characters.
  • hist_numtochar2 which is an attempt to allow for the truncation errors. I add power(256,9) to the value if it has been truncated. It seems to make eg the numeric representation of AAAAA translate back to AAAAA not AAAA@. Yes, it’s a fudge.

This is the script:-

-- cre_hist_funcs.sql
-- Heavily borrowed from Jonathan Lewis, sep 2003
-- MDW 11/8/09 - all mistakes mine.
-- hist_chartonum converts a string to a number in a very similar way
-- to how Oracle does so for storing in column ENDPOINT_VALUE in XXX_TAB_HISTOGRAMS
-- Optionally get it to not truncate the value in the same way to only 15 digits.
-- hist_numtochar converts the ENDPOINT_VALUE back to human-readable format.
-- Optinally get it to go beyond 15 numerics (7 or 8 characters)
-- hist_numtochar2 attempts to allow for truncation errors.
-- JLs version has been checked against, and
-- I've used on
set timing off
create or replace function hist_chartonum(p_vc varchar2
                                         ,p_trunc varchar2 :='Y') return number
m_vc varchar2(15) := substr(rpad(p_vc,15,chr(0)),1,15);
m_n number := 0;
  for i in 1..15 loop
/*  dbms_output.put(ascii(substr(m_vc,i,1)));
    dbms_output.put_Line(to_char( power(256,15-i) * ascii(substr(m_vc,i,1)),
                        ); */
    m_n := m_n + power(256,15-i) * ascii(substr(m_vc,i,1));
  end loop;
-- this converts it from a 36 digit number to the 15-digit number used
-- in the ENDPOINT_VALUE Column.
  If p_trunc = 'Y' then
    m_n := round(m_n, -21);
  end if;
-- dbms_output.put_line(to_char(m_n,'999,999,999,999,999,999,999,999,999,999,999,999'));
return m_n;

create or replace function hist_numtochar(p_num number
                                         ,p_trunc varchar2 :='Y') return varchar2
  m_vc varchar2(15);
  m_n number :=0;
  m_n1 number;
  m_loop number :=7;
m_n :=p_num;
if length(to_char(m_n))&lt;36 then
--dbms_output.put_line ('input too short');
  m_vc:='num format err';
  if p_trunc !='Y' then
    m_loop :=15;
  end if;
  for i in 1..m_loop loop
--    dbms_output.put_line(to_char(m_n1));
    if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
    end if;
  end loop;
end if;
return m_vc;
create or replace function hist_numtochar2(p_num number
                                         ,p_trunc varchar2 :='Y') return varchar2
  m_vc varchar2(15);
  m_n number :=0;
  m_n1 number;
  m_loop number :=7;
m_n :=p_num;
if length(to_char(m_n))&lt;36 then
--dbms_output.put_line ('input too short');
  m_vc:='num format err';
  if p_trunc !='Y' then
    m_loop :=15;
  end if;
  for i in 1..m_loop loop
--    dbms_output.put_line(to_char(m_n1));
    if m_n1!=0 then m_vc:=m_vc||chr(m_n1);
    end if;
  end loop;
end if;
return m_vc;
rem Sample of use:

col id1 format 999,999,999,999,999,999,999,999,999,999,999,999
col id3 format 999,999,999,999,999,999,999,999,999,999,999,999
  'short'  text
 ,hist_chartonum('short') id1
 ,hist_numtochar(hist_chartonum('short')) id2
 ,hist_chartonum('short','N') id3
 ,hist_numtochar(hist_chartonum('short','N'),'N') id4      
from dual;
  'alongteststring'  text
 ,hist_chartonum('alongteststring') id1
 ,hist_numtochar(hist_chartonum('alongteststring')) id2
 ,hist_chartonum('alongteststring','N') id3
 ,hist_numtochar(hist_chartonum('alongteststring','N'),'N') id4      
from dual;
  'amuchlongerteststring'  text
 ,hist_chartonum('amuchlongerteststring') id1
 ,hist_numtochar(hist_chartonum('amuchlongerteststring')) id2
 ,hist_chartonum('amuchlongerteststring','N') id3
 ,hist_numtochar(hist_chartonum('amuchlongerteststring','N'),'N') id4      
from dual;
spool off

The test at the end produces the following output:-

strng  histgrm_translation
short  599,232,339,077,851,000,000,000,000,000,000,000

alongteststring  505,852,124,009,532,000,000,000,000,000,000,000

amuchlongerteststring  505,872,878,384,947,000,000,000,000,000,000,000

Final comment. Oracle trims the value stored in sys.histgrm$ and knackers up the untranslate. Why? Well, this is only me thinking out aloud, but this reduces the number from 36 to 15 significant digits and numerics are stored as one byte per two digits plus an offset byte {am I missing a length byte?}. So 10 bytes or so. This will keep the size of data stored down. I mean, as I posted here, the sys.histgrm table gets big enough as it is!

Tomorrow {or whenever my brain recovers} I’ll show how I use it and one or two oddities I have found or come across from other people’s blogs.


Get every new post delivered to your Inbox.

Join 152 other followers