jump to navigation

Update to Decoding High and Low values February 24, 2010

Posted by mwidlake in internals, performance.
Tags: ,

After what seems an age, I finally got a little time to enhance my col_stats code with the extra info people provided as comments to
this original posting on it. I also now handle the translation of dates correctly and also timestamps.

I had a real struggle with timestamps. I knew I could find the answer in Tom Kyte’s “Expert Oracle Database Architecture” as one of the comments said so, but I tried to work it out myself as my copy of the book was at my Mothers {no, she is not learning Oracle Databases, I left it there by accident}. As the other elements of timestamps are held as numerical elements stored as hex (number of days, number of months) I tried to interpret it like that. I was being too complex, it is just an 8-digit hex number – to_number(value,’XXXXXXXX’).

Anyway, this is new-improved, cleanes whiter-than-white script {I’ve increased the width of myt blog to better show code like the belwo, but try clicking on “show source” if you want a plain text version}:

-- col_stats
-- Martin Widlake mdw 21/03/2003
-- MDW 11/12/09 enhanced to include more translations of low_value/high_value
-- pilfered from Gary Myers blog
-- MDW 20/02/10 added in the handling of timestamps.
col owner        form a6 word wrap
col table_name   form a15 word wrap
col column_name  form a22 word wrap
col data_type    form a12
col M            form a1
col num_vals     form 99999,999
col dnsty        form 0.9999
col num_nulls    form 99999,999
col low_v        form a30
col low_v2       form a18
col hi_v         form a30
col data_type    form a10
col low_value    form a25
col high_value   form a25
set lines 110
break on owner nodup on table_name nodup
spool col_stats.lst
select --owner
--      ,table_name
      ,decode (nullable,'N','Y','N')  M
      ,num_distinct num_vals
      ,density dnsty
,decode(substr(data_type,1,9) -- as there are several timestamp types
  ,'NUMBER'       ,to_char(utl_raw.cast_to_number(low_value))
  ,'VARCHAR2'     ,to_char(utl_raw.cast_to_varchar2(low_value))
  ,'NVARCHAR2'    ,to_char(utl_raw.cast_to_nvarchar2(low_value))
  ,'BINARY_FL' ,to_char(utl_raw.cast_to_binary_float(low_value))
                      + (to_number(substr(low_value,3,2),'XX')-100),'fm0000')||'-'||
               to_char(to_number(substr(low_value,7,2),'XX'),'fm00')||' '||
                      + (to_number(substr(low_value,3,2),'XX')-100),'fm0000')||'-'||
               to_char(to_number(substr(low_value,7,2),'XX'),'fm00')||' '||
              ||'.'||to_number(substr(low_value,15,8),'XXXXXXXX')  )
       ) low_v
,decode(substr(data_type,1,9) -- as there are several timestamp types
  ,'NUMBER'       ,to_char(utl_raw.cast_to_number(high_value))
  ,'VARCHAR2'     ,to_char(utl_raw.cast_to_varchar2(high_value))
  ,'NVARCHAR2'    ,to_char(utl_raw.cast_to_nvarchar2(high_value))
  ,'BINARY_FL' ,to_char(utl_raw.cast_to_binary_float(high_value))
                      + (to_number(substr(high_value,3,2),'XX')-100),'fm0000')||'-'||
               to_char(to_number(substr(high_value,7,2),'XX'),'fm00')||' '||
                      + (to_number(substr(high_value,3,2),'XX')-100),'fm0000')||'-'||
               to_char(to_number(substr(high_value,7,2),'XX'),'fm00')||' '||
  ,  high_value
       ) hi_v
from dba_tab_columns
where owner      like upper('&tab_own')
and   table_name like upper(nvl('&tab_name','WHOOPS')||'%')
ORDER BY owner,table_name,COLUMN_ID
clear colu
spool off
clear breaks

Yes, I know, it is a considerable chunk of code just to get the high and low values for columns, as identified by the last stats gather to be run on the table. {it is important to remember that these high and low values will only be as accurate as the stats gather that was run and when. An “estimate_percent=> 100” run a minute ago will give probaby give accurate values, an “estimate_percent=>0.1′” run last month is not – which is usually what I am checking for}. I could make the script a lot neater by writing a set of functions to call to get the translation, as was suggested as a comment on the original posting and is in fact somethign I have done in the past, but one is often not allowed to put new stored PL/SQL code onto production systems, so a script which does it is, for me, best.

I should show some proof of it working, I guess. The below will create a table with a nice spread of column type and three records with a reasonable spread of data:

-- test_col_hilo
-- M Widlake 11/12/09
--testing script to show high and low values for various column types
set lines 90 pages 1000 trims on pause off
--set sqlpro 'test102>'
set sqlpro'>'
set echo on
set autotrace off
spool test_col_hilo.lst
-- basic setup info
col name form a30
col value form a20
select substr(name,1,30) name,substr(value,1,20) value
from v$parameter
where name in ('db_block_size','compatible','cpu_count','db_file_multiblock_read_count'
order by name
drop table test_hilo purge;
create table test_hilo
(id          number(8) not null
,numsmall    number(4)
,numlarge    number
,datesmall   date
,datelarge   date
,ts          timestamp
,ts0        timestamp(0)
,ts3	     timestamp(3)
,ts6	     timestamp(6)
,ts9	     timestamp(9)
,vcsmall     varchar2(10)
,vclarge     varchar2(100)
,vcodd       varchar2(20)
,nvcsmall    nvarchar2(10)
,nvclarge    nvarchar2(100)
,bfsmall     binary_float -- NB precision is only 6 or 7 digits
,bflarge     binary_float
,bdsmall     binary_double
,bdlarge     binary_double
tablespace users
-- insert low values
 insert into test_hilo
,to_timestamp('01-FEB-2010 12:34:56.123','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('01-FEB-2010 12:34:56.1234567','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('01-FEB-2010 12:34:56.987654321','DD-MON-YYYY HH24:MI:SS.FF')
-- insert mid values
 insert into test_hilo
,to_timestamp('05-FEB-2010 12:34:56.123','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('05-FEB-2010 12:34:56.1234567','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('05-FEB-2010 12:34:56.987654321','DD-MON-YYYY HH24:MI:SS.FF')
--insert high values
 insert into test_hilo
,to_timestamp('20-FEB-2010 18:17:16.876','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('21-FEB-2010 17:16:15.555555','DD-MON-YYYY HH24:MI:SS.FF')
,to_timestamp('22-FEB-2010 16:15:14.123456789','DD-MON-YYYY HH24:MI:SS.FF')
set autotrace off
set echo off
spool off
clear colu

And the below is the translation of the values – I used a cut-down version of my script, so it woud fit on the blog better:

COLUMN_NAME  DATA_TYPE     LOW_V                         HI_V
------------ ------------- ----------------------------- -----------------------------
LOW_VALUE                 HIGH_VALUE
------------------------- -------------------------
ID           NUMBER        1                             3
C102                      C104
NUMSMALL     NUMBER        -10                           99
3E5B66                    C164
NUMLARGE     NUMBER        -123456789.12345              9898989898989
3A644E38220C59433366      C70A5A5A5A5A5A5A
DATESMALL    DATE          2007-05-31 00:00:00           2010-02-23 00:00:00
786B051F010101            786E0217010101
DATELARGE    DATE          0641-03-10 17:36:47           2012-11-20 17:36:47
6A8D030A122530            78700B14122530
TS           TIMESTAMP(6)  2010-02-24 17:36:47.255015000 2010-03-06 17:36:47.255015000
786E02181225300F333858    786E0306122530
TS0          TIMESTAMP(0)  2010-02-24 17:36:47.          2010-03-06 17:36:47.
786E0218122530            786E0306122530
TS3          TIMESTAMP(3)  2010-02-01 12:34:56.123000000 2010-02-20 18:17:16.123000000
786E02010D23390754D4C0    786E02141312113436B300
TS6          TIMESTAMP(6)  2010-02-01 12:34:56.123457000 2010-02-21 17:16:15.123457000
786E02010D2339075BCDE8    786E0215121110211D18B8
TS9          TIMESTAMP(9)  2010-02-01 12:34:56.987654321 2010-02-22 16:15:14.987654321
786E02010D23393ADE68B1    786E021611100F075BCD15
VCSMALL      VARCHAR2      ABCDE                         ZYXWV
4142434445                5A59585756
4141424341424344454142434 5A5A5A5A5A595A5A585A5A575
4454647414243444546474849 A5A565A5A555A5A545A5A535A
VCODD        VARCHAR2      Abc!"£$%^&*()deF              Z;#[]{}~@:
4162632122C2A324255E262A2 5A3B235B5D7B7D7E403A
NVCSMALL     NVARCHAR2     ABCDE                         VWXYZ
00410042004300440045      0056005700580059005A
0041004100420043004100420 005A005900580057005600550
0430044004500410042004300 0540053005200510050004F00
44004500460047            4E004D004C004B
BFSMALL      BINARY_FLOAT  3.53999972E+000               -9.86999989E+000
40628F5B                  C11DEB85
BFLARGE      BINARY_FLOAT  3.64494306E-004               -9.87653984E+004
39BF1999                  C7C0E6B3
BDSMALL      BINARY_DOUBLE -1.2345E+004                  -9.87654321E+008
C0C81C8000000000          C1CD6F3458800000
BDLARGE      BINARY_DOUBLE -9.8765432112345672E+008      -5.432112345E+009
C1CD6F34588FCD6E          C1F43C774D900000

I could go on and look at handling intervals and time zones but I leave this to you as an exercise. In other words, I have no real need for those data types right now and can’t justify the time! I hope the above is enough to answer whatever questions you may have about the high and low values of your columns…


Missing SQL in AWR/ASH February 23, 2010

Posted by mwidlake in AWR, performance.
Tags: , , ,

I’ve been working on some demonstrations of AWR/ASH within OEM for a client and I’ve come across a situation where you can be missing the SQL (and thus a lot of the useful information) for code run via DBMS_JOB or DBMS_SCHEDULER.

This is due to bugs 5928612.8 for DBMS_JOB and 5140631.8 for DBMS_SCHEDULER. If you can’t access metalink, then check out the end of this webpage. The SQL_ID/SQL_HASH_VALUE is not populated in V$SESSION for those jobs running under DBMS_JOB or DBMS_SCHEDULER. I am presuming from the fact that V$SESSION is not updated with these values, it is not available to V$ACTIVE_SESSION_HISTORY.

I did check that the SQL_ID was null in V$ACTIVE_SESSION_HISTORY for the records that connected to the USER_ID I was using, except for the handful of entries for when I was testing my code manually before running it via DBMS_JOB. The SQL_ID was captured fine for these interactive sessions (and the code executed was identical).

The bugs appear in and both are fixed in and 11. Both have patches available.
I am using

As soon as I mentioned my woes about not being able to see the SQL for code I was running via DBMS_JOB I knew of the existence of the bugs, as colleagues mentioned being aware of them as the patches have been applied to a couple of the other systems. However, I could find no trace of them on Metalink or Google.These bugs are listed as against v$session, not AWR or ASH, which is why I could not find them via searches for “missing sql awr” or missing sql ash” etc. So hopefully this posting will allow the connection to be made.

I find it very vexing when you know there is a fix for an oracle problem but you can’t find it in metalink (which seems even more true with the new, all-singing all-dancing all-fancy-graphics little help metalink). Sometimes it is because the bug number is not being made publically available and in other cases, such as this one, it is because no one has made the connection between the initial problem and other things it could impact. I guess it is asking a lot of Oracle Corp to do this for us and it would be unreasonable of us to hold it against them – But they are supposed to be a blooming data, information and knowledge expertise company! Tsch.

As an example of what I see in AWR with my workload, generated via DBMS_JOB {and I’m sorry it is so small, you should be able to click on it and get a full-size version}:

Notice that the top five jobs are via user MDW_2. I know that none of the top SQL on the left is from my workload (you will have to just take my work, but it is, in fact, all the “OEM/AWR code 🙂 ). Also, the top code shown is heave on green for CPU, but you can see most of the workload in the Top Activity and the Top Sessions is blue, for User IO.

If I pick out the top session, 545, and click on it I get the below screen:

Note the entry for SQL_ID is blank and SQL Command is “UNKNOWN”. I can see what it is doing (db_file_scattered read and some CPU) but not what on or why. 😦

Friday Philosophy – Alternatives to Testing February 19, 2010

Posted by mwidlake in Friday Philosophy, Testing.
Tags: ,

We all know that we should test everything before we do it on a production system.

Or should we? Well yes, you should, but sometimes there is no suitable test system (or even any test system, which seems to be becoming a more common problem) or the testing regime takes weeks and you need a fix right now.

So given that there really is a proper business reason why doing the testing is expensive {in effort or time} or impossible, what can you do? I believe there is an alternative.

To be able to reduce or remove the need for proper testing you need to be absolutely sure the following three things are true, in order of importance:

  1. Be able to monitor the live system for negative impact of your changes.
  2. Have a proven and quick rollout.
  3. Be confident that the chance of negative impact on your system is low likelihood and low impact.

Note that, in my opinion, the least important {yet still very important} consideration is that the change be almost certain to work and won’t cause too much grief if something unexpected happens.  However, this seems to be the main concern when people are considering a “Quick Fix” on a production system. “Is it likely to go wrong? Is it going to matter if it does not help?”. At this point all eyes tend to swivel to the DBA or the Lead Developer who has to go on what I call DBA Instinct. Given your experience to date and what you know about how the database works, give it your best guess.

The problem with DBA Instinct is that if you are not very experienced you are unlikely to get it right. If you are very experienced, most of that experience was on earlier version of Oracle and you probably do little hands-on work now as you are a manger. So you are still likely unlikely to get it right 🙂

So, I argue that point 1 (spotting things going wrong) and 2 (Getting back to where you started) are most important.

Let us take the classic example of just adding an index as our untested change (as we have no system capable of generating a realistic workload and thus the impact across the board).

3) Most of us would agree that adding an index is easy and the impact is low and the chance of plans going terribly slow by using the new index unsuitably are low…

2) Rollback is relatively simple, you drop the index again. Question, can you just drop an index on a table if a SQL statement is in-flight and using that index? You know, a SQL statement that the CBO has decided is going to be useful and, despite your DBA Instinct that nothing is likely to go wrong, now takes 20 seconds to run rather than 0.2 seconds. And it is being run all the time on your OLTP system so now there is no time when there are not 2 or 3 queries in flight.

3) Verifying impact. This point is vital and can be tricky. You can just rely on staring at your monitoring tool and hoping any serious impact shows up so you can catch it and do something about it. Or people ring up and complain. That would not be so good, you have impacted the business. 

I would pull out code from the SGA that is referencing the table to be indexed before you start (I’ll stick a bit of code at the bottom for doing this). If the list is not too large, you can do the same check after the code has gone in and compare changes to buffer gets and disk reads per execution.

A more targeted check is to pull out of v$SQL_PLAN anything that starts using the new index and check to make sure it is not now a problem.

A second example, I want to gather system statistics as I am sure the system does not appreciate how blindingly fast my disks are for multi block reads.

3) Impact? Well, I would suggest the potential impact is wide and across the board, else why do it?

2) Rollback? Delete the system statistics. What happens when you delete the system statistics? Are they all set to null? Are the defaults put back in place? Could I use DBMS_STATS.RESTORE_SYSTEM_STATS? What is the impact of any of them? {RESTORE appears to work reliably and, if you delete the system stats, you get the “out of the box” values again, which could be very different to what you had before your spur-of-the-moment gather}

1) How to assess negative impact? You are going to have to monitor the whole system and hope to pick up any code that starts running slowly before it impacts the business too much.

I would probably add the index but I would need to do at least some testing and proof of rollback before I gather system statistics. And I would only do so without proper testing if said proper testing was impossible due to a lack of test systems.

So, I do think you can implement changes without testing or with reduced testing, but it is not always the easy option.

I also think it is a valid (and often quicker, in terms of elapsed time) way of changing the productions system without going through full unit/integration/regression/NFR testing.

Now you just need to persuade the Change Advisory Board that it is OK to do the change. Good luck 🙂

Oh, that code:-

-- chk_sga_txt.sql
-- Martin Widlake - look for sql in sga containing the provided txt
set pause on
set pages 36
col first_load_time form A20
spool chk_sga_txt
,parse_calls prse
,executions  excs
,buffer_gets buffs
,disk_reads   discs
,rows_processed rws
--,address        address
,hash_value     hash_value
from v$sqlarea
--where parsing_schema_id !='0'
where upper(sql_text) like upper('%'||nvl('&sql_txt','whoops')||'%')
and rownum <50
--order by first_load_time desc
order by (greatest(buffer_gets,1)/greatest(rows_processed,1)) desc
spool off

Command Line or GUI – Which is Best? February 18, 2010

Posted by mwidlake in performance.
Tags: , ,

At present I am suffering ever so slightly from “split personality disorder”* in respect of my liking for Command Line and GUI interfaces.

On the one hand, much to my colleagues mild reproach, I use SQL*PLus and not PL/SQL Developer for my day-to-day work. Even worse, I got sick of using notepad to hack around scripts {I am working in a windows client environment and you simply can’t use MS Word with SQL files!} so I have retrograded recently and downloaded a windows-complient ‘vi’ interface and it is soooooo nice to be able to use powerful ‘vi’ commands on my files once more. “:.,$s/^ /,/”. Ahhh, it is so much easier. I can do stuff in 3 seconds in ‘vi’ that would take me 10 minutes in Notepad in a large, complex file. That and, I’m sorry, but notepad seems to be unable to manage a 100MB file, despite me having 2GB of real memory and a decent machine, but ‘vi’ has no problem with it at all.
Even more retrograde, I have direct telnet access to my linux servers and I am getting back to that as it makes me so much more productive. “ls -alrt ^d” for all directories anyone? “df -k .” to see how many data files I can add? Yep, it’s all arcane and means so little to many modern IT “Java/Struts/CDE” people but boy it is direct and fast. I might even dig out that book on SED and AWK.

On the other hand, I have finally (after much very painful discussions back and forth) got agreement that my site probably has access to AWR, ASH and all that good performance repository stuff. So I am hacking around with the OEM screens that focus on performance and snapshots and stuff. Now, I am traditionally not a big fan of GUI DBA tools. Partly it is because I am a bit old and stuck in my ways and partly it is because GUIs are really just “menus of options”. You are limited to what options are available in your DBA GUI tool and you have a harder time learning all the options available or what is really going on “under the covers”.

But with AWR and all those graphs, links and utilities, you can drill down into real problems real time or in the past so effectively that, well, once they start using this tool properly they will not need me at all. It is a fantastic boon to performance management and problem resolution, as well as proactive performance management.

So there you are, I am with Doug Burns on this one, in that I have Learned to Love Pictures. When the Pictures are well thought out and well connected and simple enough to help make sense of a complex system {and Oh Boy Oracle performance has become sooo Complex!!!!}

So right now, I spend half my day in vi/linux/command line world and half of it in pretty picture/GUI world. I think what really makes me happy is to leave behind the half-way-house of text-like Windows World {Windows SQL*Plus, Notepad}.

Just to finish, you can’t mention AWR without someone raising the ugly issue of licence cost and how Evil Oracle Corp were to charge for it. Well, I think it has been established that the guys and gals who developed AWR/ASH did not expect it to become a cost option but it did. And I suspect that what kept it a cost option was the community’s OutRage at it being a cost option. Anything that popular, hey, a commercial company is going to charge for. I still reckon Oracle Corp ballsed up as making it free and helping people use it a bit would have made 90% of customers’ lives easier and would have translated into user happiness and a certain % of sales for training courses to learn more, but heck my day job is to make things work, not maintain sales percentages, so my opinion counts for nowt. *sigh*

(*apologies to real sufferers of Dissociative Identity Disorder, I am using the term in the loose, non-scientific, “common usage” term of “not sure of my opinion” rather than having truly disparate personalities and memories.** And note, I certainly do not mean schizophrenia which, despite the on-going public-opinion misunderstanding, is rarely anything to do with multiple personality disorders or “spit minds” AT ALL, and is more to do with a difficulty in determining between reality and hallucination. ).

Richard Foote on the Impact of stats staying the same. February 16, 2010

Posted by mwidlake in performance.
Tags: , ,
add a comment

I just wanted to highlight this very good posting by Richard Foote. He is talking about how SQL execution plans can change when “nothing else does”. Not the table structures, not the code, note the initialisation parameters, not even the table and index stats.

But something does change, which is what day it is (or what hour it is or what week it is). Time moves on and our data does too. If the statistics on the tables does NOT move on, then the CBO thinks that the range of data in the table does not change. So, the CBO thinks your queries are getting further and further “out of range” and so would expect to find less and less data in the tables to bring back. That will lead to plan changes.

If you have noticed my preoccupation with identifying the contents of histograms and high/low values in column stats, you may appreciate that this topic is one I have been finding is a large part of my day job.

Richard explains the point very well, as always, so go have a look.

Stats Need Stats to Gather Stats February 16, 2010

Posted by mwidlake in performance.
Tags: , ,

Did you know that you sometimes need good stats so that the stats-gathering package can gather stats in an efficient way? This is a recent, quite extreme example.

I’ve been forced to learn a lot about gathering Oracle stats using DBMS_STATS over the last 4 or 5 years. But no matter how much I learn about the “challengingly bizarre” way in which it works, it seems every week or two there is a new oddity. I plan a whole series on the topic “soon”.

This particular example is from a system.

I am gathering partition-only table stats as we are using the ability of Oracle to roll up Partition stats to Global stats under certain specific conditions. One of the conditions is that you need stats for every partition. Plus, to get global column stats, each partition must have stats for each column. Some of our partitions lacked or had very bad stats.

So I quickly knocked up a script-generating script to create DBMST_STATS.GATHER_TABLE_STATS statements that collected, for those partitions:

  • ONLY partition stats.
  • NO cascade down to indexes
  • BLOCK sampling so it is fast {and poor, but there you go}
  • ESTIMATE_PERCENT of 2, which is quite low for block sampling
  • collect histograms as we decided the devil of having them was better than the devil of not having them.

the above is not ideal to get “good stats”, but it is quick and gets OK stats which is what we need right now. An example statement is:

-- part SD_INFO-DY07032004 rows 34554                                                              
dbms_stats.gather_table_stats(ownname => 'ERIC',tabname=> 'SD_INFO'
,partname=> 'DY07032004'
,granularity=> 'PARTITION', estimate_percent => 2,block_sample=> true ,cascade=> FALSE
,method_opt=> 'FOR ALL COLUMNS SIZE AUTO' ,no_invalidate=> false);

Not a big partition (34554 rows and not very wide rows) and so did not take long to gather:

------------------------------ --------------------
DY04012004                     12-FEB-2010 16:26:57
DY05012004                     12-FEB-2010 16:27:00
DY06012004                     12-FEB-2010 16:27:04
DY07012004                     12-FEB-2010 16:27:07
DY08012004                     12-FEB-2010 16:27:11 -- 4 seconds difference
DY09012004                     12-FEB-2010 16:27:15

I’ve collected statistics for a few thousand partitions over the last couple of days and the time taken is anything between just under half a second to 10 seconds per partition, the odd unusually large partition taking a minute or so. {I believe it takes half a second to gather stats on an empty partition, on our system at least, due to the time it takes for the internal housekeeping, including copying the old statistics information to the SYS.WRI$_OPSTAT_… tables to support restoring stats}. Sorry, I drift away from my main point.

This partition took a lot longer than 10 seconds:

-- part W_ACTIVITY_FAILURE-DY02092008 rows 49425
dbms_stats.gather_table_stats(ownname => 'ERIC',tabname=> 'W_ACTIVITY_FAILURE'
,partname=> 'DY02092008'
,granularity=> 'PARTITION', estimate_percent => 2,block_sample=> true ,cascade=> FALSE
,method_opt=> 'FOR ALL COLUMNS SIZE AUTO' ,no_invalidate=> false);

After 10 minutes it was still running. WHY? I quickly checked the number of rows in the partition and then the size of the partition segment, incase either was much larger than I expected. Neither were.
select count(*) from eric.W_ACTIVITY_FAILURE partition (DY07092008)

From dba_segments.
———- ———-
2621440 320

There is one advantage of a DBMS_STATS statement running for a long time – you can grab from the SGA the actual code being executed for the DBMS_STATS statement. I saw this.

select substrb(dump(val,16,0,32),1,120) ep, cnt 
from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring */
"ACCOUNT_ID" val,count(*) cnt 
and "ACCOUNT_ID" is not null 
group by "ACCOUNT_ID") 
order by val

DBMS_STATS is scanning the ACCOUNT_ID column on this table and it is taking a long time about it. The only index on the table is on ACCOUNT_ID. It then struck us.

The index is a global index.
—————————— ————-

And that global index actually lacked stats {another feature of DBMS_STATS and rolling up partition stats had led to that}.

————— ——— — —- ———- ———– ———– ———- ———- ——–

By this time 3 partitions for this table had been processed by my code, taking around 15 minutes each one. Incredibly slow.

I did a very quick 0.01% sample size DBMS_STATS.GATHER_INDEX_STATS on that index which took about 1 minute. As soon as the partition DBMS_STATS.GATHER_TABLE_STATS statement that was in flight finished, the following similar statements on that table’s partitions took under 3 seconds each. I’ll buy a pint for the first person to guess (within 10 minutes) WHEN I collected the global index stats {You can collect from any pub in central London if you give me a few day’s notice 🙂 }.

-------------- ---------- ---------- ---------------------
DY01092008          31461        452 16-FEB-2010 09:51:41
DY02092008          49425        686 16-FEB-2010 10:03:44
DY03092008          54472        719 16-FEB-2010 10:16:31
DY04092008          35762        491 16-FEB-2010 10:30:52
DY05092008          42182        587 16-FEB-2010 10:44:24
DY06092008          21186        312 16-FEB-2010 10:56:13
DY07092008          20898        313 16-FEB-2010 11:12:59
DY08092008         469500       1233 16-FEB-2010 11:13:02
DY09092008         480300        741 16-FEB-2010 11:13:04
DY10092008          15724        223 16-FEB-2010 11:31:01
DY11092008          55671        732 16-FEB-2010 11:31:06
DY12092008         820100       1779 16-FEB-2010 11:31:08
DY13092008          80215       1113 16-FEB-2010 11:31:16
DY14092008          10094        155 16-FEB-2010 11:31:18
DY15092008          10268        158 16-FEB-2010 11:31:20
DY16092008          24578        330 16-FEB-2010 11:31:22
DY17092008          21012        290 16-FEB-2010 11:31:24
DY18092008          22755        318 16-FEB-2010 11:31:27
DY19092008          21276        293 16-FEB-2010 11:31:29
DY20092008          20882        281 16-FEB-2010 11:31:31
DY21092008          24131        323 16-FEB-2010 11:31:34

I will investigate this oddity further if I get time this week, but the lesson to myself so far is:

Global indexes lacking stats can result in very long stats gathering times for partitions EVEN WHEN YOU DO NOT CASCADE DOWN TO INDEXES.

Friday Philosophy – Are short Posts Better Than Long Ones? February 12, 2010

Posted by mwidlake in Friday Philosophy.
Tags: ,


Making Things Better Makes Things Worse February 11, 2010

Posted by mwidlake in development, Management, Perceptions.
Tags: , ,

This could be a Friday Philosophy but I’ve got a couple of those lined up already. Anyway, I am suffering at work at the moment. I’m encountering a phenomenon that I have talked about with Dennis Adams a couple of times. It probably has a proper term, but basically it is the odd situation that when you make things better, you get more complaints. {Dennis, do you know the proper term?}

{Update. Dennis was good enough to link to this paper he wrote on customer feedback}

Anyway, Let me explain. You have an application sitting on a database. The screens are slow, the reports take an age to come out, your might even have considerable system instability and unplanned outages. The users are not unhappy. They were unhappy last year. Now they are just cynical and they just expect the system to be slow, unresponsive, flaky. So they do not report any problems.

Then things change. The system gets some much-needed care and attention and now the slowest reports get tuned up, the screens come back faster and less spontaneous department-wide coffee breaks are caused by the system crashing. Everything gets better. But not for the help desk, now they start getting calls. “This report is too slow”. “Why can’t I jump straight from the customer screen to the pending orders screen?”. This happens because the users now realise that something can be done. There is a point in complaining as there is a chance their piece of misery could be made better. I certainly went through this many years ago when I inherited a system that crashed every week. No one mentioned it, they just went to tea and complained about it. The first time it crashed after I arrived I could not believe that no one had called before I had realised it had died. Once it had been up solidly for a couple of months, then when it crashed boy did we hear about it!

Also, when you improve a system and things generally get better, occasionally something will improve and then fall back a bit. Hardly anyone says “thanks” for the initial improvement but they will say something if it improves and then drops back.

That is what is happening for my main client at the moment. The system was not that bad, but it needed some help. Why else would I be there? I’ve been beavering away with the rest of the team and we are making things better, so far mostly at an underlying “getting the overall system straight” level. A new chap has joined to help concentrate on performance and he is really making inroads into specific things, individual reports and processes that need a good sorting out.

So now that things are getting better and performance is generally improving, anything that is still slow is being brought up by the development and support teams. Also, we’ve made a few things slower (I’m sorry, it just happens like that) and they certainly get mentioned.

So, I’m busy. And I could get annoyed at people asking why X is slower when Y and Z are faster. But I don’t, because Dennis explained this counter intuitive theory to me.

I know things are getting better as people are annoyed as opposed to apathetic 🙂

Format of Date Histograms February 6, 2010

Posted by mwidlake in internals.
Tags: , ,

Oracle annoyingly handles dates internally in several formats. Dates are stored in tables as seven bytes, each byte representing century, year-in-century, month, day, hour, minute and second, shifted by different amounts. For the high/low values seen in DBA_TAB_COLUMNS or DBA_TAB_COL_STATISTICS, it is stored as a RAW value, where a two-digit hex string represents the century, year-in-century,month etc as before – see this post on decoding high and low column values and check out the comments for corrections.

So what about histograms? You might want to know what is in the histogram for a date column in order to better understand the decisions made by the CBO. Below, I pull out the histogram for an example date column {I’ve trimmed the output a little to save space}:

select table_name
,endpoint_value end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name ='TEST_TABLE_X'
and owner ='TEST_1'
and column_name = 'PLACED_DT'
order by endpoint_number

--------------- ------------- ------------- ---------------
TEST_TABLE_X     PLACED_DT   2,452,258         0
TEST_TABLE_X     PLACED_DT   2,454,334         1
TEST_TABLE_X     PLACED_DT   2,454,647         2
TEST_TABLE_X     PLACED_DT   2,454,737         3
TEST_TABLE_X     PLACED_DT   2,454,820         4
TEST_TABLE_X     PLACED_DT   2,454,867         5
TEST_TABLE_X     PLACED_DT   2,454,929         6
TEST_TABLE_X     PLACED_DT   2,454,981         7
TEST_TABLE_X     PLACED_DT   2,455,006         8
TEST_TABLE_X     PLACED_DT   2,455,024         9
TEST_TABLE_X     PLACED_DT   2,455,039         10
TEST_TABLE_X     PLACED_DT   2,455,050         11
TEST_TABLE_X     PLACED_DT   2,455,205         42
TEST_TABLE_X     PLACED_DT   2,455,207         43
TEST_TABLE_X     PLACED_DT   2,455,209         44
TEST_TABLE_X     PLACED_DT   2,455,211         45
TEST_TABLE_X     PLACED_DT   2,455,213         46
TEST_TABLE_X     PLACED_DT   2,455,215         47
TEST_TABLE_X     PLACED_DT   2,455,216         48
TEST_TABLE_X     PLACED_DT   2,455,218         49
TEST_TABLE_X     PLACED_DT   2,455,220         50
TEST_TABLE_X     PLACED_DT   2,455,221         51
TEST_TABLE_X     PLACED_DT   2,455,222         52
TEST_TABLE_X     PLACED_DT   2,455,222         53
TEST_TABLE_X     PLACED_DT   2,455,222         54
TEST_TABLE_X     PLACED_DT   2,455,222         55
TEST_TABLE_X     PLACED_DT   2,455,223         56
TEST_TABLE_X     PLACED_DT   2,455,223         57
TEST_TABLE_X     PLACED_DT   2,455,223         58
TEST_TABLE_X     PLACED_DT   2,455,223         59
TEST_TABLE_X     PLACED_DT   2,455,223         69
TEST_TABLE_X     PLACED_DT   2,455,223         70
TEST_TABLE_X     PLACED_DT   2,455,223         71
TEST_TABLE_X     PLACED_DT   2,455,223         72
TEST_TABLE_X     PLACED_DT   2,455,223         73
TEST_TABLE_X     PLACED_DT   2,455,224         74
TEST_TABLE_X     PLACED_DT   2,455,226         75

Well, it looks like a seven digit number so it must be representing the date in a similar way to as described above, yes? No. The format is obviously something different {Oh come ON Oracle, some internal standards would be nice once in a while! 🙂 }

I pulled out the minimum and maximum values from the DBA_TAB_COLUMNS table and translated them:-

COLUMN_NAME  NUM_DIST   LOW_V               HI_V 
------------ ---------- ------------------- ------------------- 
PLACED_DT    375,428    2001-12-13 17:31:38 2010-01-28 23:51:38 

So the above low values and high values will pretty much match the first and last values in the histograms, which are 2452258 and 2455226. Any guesses? Those values from the histogram are very close to each other, only 2968 different to cover around 9 years of values. 9 times 365…

The histogram is representing the dates in Julian format, ie number of days since 1st Jan 4712BC. As a quick proof of this:

select to_date('2452258','J'),to_date('2455226','J')
from dual;

TO_DATE('2452258' TO_DATE('2455226'
----------------- -----------------
14-DEC-2001 00:00 29-JAN-2010 00:00

Well, what do you know. Very close to the 13th Dec 2001 and 28th Jan 2010

This of course makes sense, storing the date as an ascending numeric means it is simple to calculate the width of the range and how many values per day there are.

Imagine how complex it would be to do this if the date was stored in the bizarre way we humans deal with it – an ascending numeric for year, a cycling 12 digit number for month and a varying cyclic number for the day of the month. And that is ignoring other calendars in common use.

However, I’m looking at this method of representing the date and something bothers me. There is no allowance for the time portion. Maybe column histograms can’t cope with time? I feel a couple of tests coming on…

If you have seen the comments, you will know that Jonathan commented to ask if I might have truncated/been selective with the data in creating my test table as he is sure there is a fractional section representing the time portion.

Well, I did say I needed to check further so maybe I would have spotted my error on my own… 🙂
This is the script I use to pull out histogram data (and yes, I trimed the output from the script before copying it into this post, so it does not quite match the script):

-- chk_hist.sql
-- Martin Widlake 7/4/4
-- pull out the histograms on a table
set pause on pages 32
spool chk_hist.lst
col owner form A8
col table_name form A15
col column_name form a20
col rowcount form 99,999,999,999
col end_val form 99,999,999,999
select owner
,endpoint_value  end_val
,endpoint_number rowcount
from all_tab_histograms
where table_name like upper(nvl('&tab_name','WHOOPS')||'%')
and  owner like upper('&tab_own')||'%'
and  column_name like upper('&col_name')||'%'
order by table_name,column_name,endpoint_number
spool off
clear colu

Hmm wonder what happens if I change the “col end_val form 99,999,999,999” so that it would show fractions…

col end_val form 999,999,999.99999999

--------------- ---------- --------------------- ------------
TEST_TABLE_X    PLACED_DT     2,452,257.73030093            0
TEST_TABLE_X    PLACED_DT     2,454,333.76546296            1
TEST_TABLE_X    PLACED_DT     2,454,647.32561343            2
TEST_TABLE_X    PLACED_DT     2,454,737.25017361            3
TEST_TABLE_X    PLACED_DT     2,454,820.02204861            4
TEST_TABLE_X    PLACED_DT     2,454,866.98009259            5
TEST_TABLE_X    PLACED_DT     2,454,928.66848380            6
TEST_TABLE_X    PLACED_DT     2,454,980.94815972            7
TEST_TABLE_X    PLACED_DT     2,455,005.68413194            8
TEST_TABLE_X    PLACED_DT     2,455,023.67142361            9
TEST_TABLE_X    PLACED_DT     2,455,039.03236111           10
TEST_TABLE_X    PLACED_DT     2,455,050.39246528           11


Thanks Jonathan 🙂
It is still a Julian date, but with the time as a fraction as he said.

Friday Philosophy – Software being Good is not Good Enough February 5, 2010

Posted by mwidlake in Friday Philosophy, Uncategorized.
Tags: ,

In a previous Friday Philosophy on In Case of Emergency I mention that something being simply a “Good Idea” with technology is not good enough. Even being a GREAT idea is not enough. It also has to be:

  1. Easy and simple to use. After all, using that funny stick thing behind your steering wheel in a car, to indicate which direction you are turning, seems to be too much of an effort for many people. If installing some bit of softare or running a web page is more than a little bit of effort, most people will not bother.
  2. Quick. No one has patience anymore, or spare time. This will probably be the fourth year in a row I do not plant any vegetables in our garden as I need to spend a day or two clearing and digging over the spot for said veg. You can’t beat home-grown veg. Similarly, I won’t use a web pages that takes as long to load as it does to plant a carrot seed.
  3. Known about. There could be a really fantastic little program Out There that allows you to take a screen shot, add a title and comment and pastes it straight into a document for you, converting to half a dozen common formats on the fly. But I do not know about it. { ScreenHunter is pretty good, I have to say, and when I have shown it to people a lot of them like it}.
  4. Popular. This is not actually the same as “known about”. For a stand-alone application to be good for you, you just need to know where it exists. Like maybe a free building architecture package. Whether thousands of people use it is moot, so long as you can get your extension drawings done in it with ease, that makes it great. But something that relies on the community, like a service to rate local eataries, unless lots of people use it and add ratings, well who cares. There are dozens (if not hundreds) of such community “good ideas” started every day but unless enough people start to use it, it will fizzle out, as the vast majority of them do.

Point 4 is highly relevant to “In Case Of Emergency” as it is simple, quick and relativley known about. It just needs to be ubiquitous.

I became very aware of point 3 a few years ago and also of the ability for very clever people to be sometimes very stupid when it comes to dealing with their fellow humans.

I was working on a database holding vast quantities of DNA information. If you don’t know, DNA information is basically represented by huge long strings of A, C, T and G. So something like AACTCGTAGGTACGGGTAGGGGTAGAGTTTGAGATTGACTGAGAGGGGGAAAAATGTGTAGTGA…etc, etc, etc. These strings are hundreds, thousand, hundreds of thousands of letters long. And Scientists like to search against these strings. Of which there are millions and millions. Not for exact match mind, but kind-of-similar, fuzzy matches, where for example 95% of the ACTGs match but some do not. It’s called a BLAST match.

Anway, suffice to say, it takes a lot of compute power to do this and a fair amount of time to run. There was a service in America which would allow you to submit a BLAST query and get the answer in 20 minutes or so {I have no idea how fast it is now}. 

Some extremely clever chaps I had the pleasure of working with came up with a faster solution. Same search, under 5 seconds. Now that is GREAT. We put together the relevant hadware and software and started the service.  Now I thought it went beyond Good or even Great. It was Amazing (and I mean it, I was amazed we could do a fuzzy search against a billion such strings in 2, 3 seconds using a dozen or so PC-type servers).

No one used it. This was because almost no one knew about it and there was already this slow service people were used to using. People who used the old service never really thought to look for a new one and the chances were they would not have found ours anyway.

I pushed for more to be made of this new, faster service, that it should be advertised to the community, that it should be “sold” to people (it was free to use, by “sold” I mean an attempt made to persuade the scientific community it was worth their while investigating). The response I was given?

“If the service is worth using, people will come and use it”.

No they won’t. And indeed they didn’t. It was, I felt, a stupid position to take by an incredibly inteligent person. How were people to know it existed? Were they just supposed to just wake up one morning knowing a better solution was out there? The internet pixies would come along in the night and whisper about it in your ear? In the unlikely event of someone who would be interested in it just coming across it, were they then going to swap to using it? After all no one else seemed to know about it and it was 2 orders of magnitude faster, suspiciously fast, how could it be any good?

The service got shut down as it was just humming in the corner consuming electricity. No one knew it existed, no one found it, no one came. I can’t but help wonder how much it could have helped the scientific community.

There must be thousands of other “failed” systems across the IT world that never took off just because the people who could use it never knew it existed. Depressing huh?