jump to navigation

Automated Statistics Gathering Silently Fails #2 July 23, 2009

Posted by mwidlake in performance.
Tags: ,

<previous postnext post>

My first posting about the automated stats gathering job on 10/11g failing is here. I show how it can fail for really large segments and can go on failing night after night, the little so-and-so.

Today I am going to give you some pointers in spotting what the job choked on and how to deal with it {damn, ran out of time – see next post}.

How can you tell what it is failing on? Well, you can use OEM to look at the latest run of the automated stats gathering job and what was processed. It probably won’t tell you what it failed on, it did not in the version I tried (early 10.2) and no one has pinged me to say it now does. Look back at the previous posting for a screen shot if you want help finding the page in OEM.

However, you can check what was analyzed last night and compare to the list of objects shown. Concentrate on tables or partitions you know to be big, the job is not going to choke on a 14-row EMP table, more like a half-billion row ORDERS table.

select owner,table_name,last_analyzed,sample_size
from dba_tables
where last_analyzed between sysdate-.5 and sysdate -.4
order by table_name
dw> /
OWNER          TABLE_NAME                     LAST_ANALYZED     SAMPLE_SIZE
-------------- ------------------------------ ----------------- -----------
ODS XXXX_TIM_RF5 16-JUL-2009 00:09 19
ODS            H_XXXXXXXXXXXXXXX_TYPES        16-JUL-2009 00:10      272838
ODS            H_TXXXXXXXXXXXXXXXNER          16-JUL-2009 00:10       85184
ODS            H_XXXXXXXXXXXXXXXCTIONS        16-JUL-2009 00:04      165999
ODS            STGXXXXXXXXXXXXXXXCTIONS       16-JUL-2009 00:51          10
ODS            TBL_XXXXXXXX                   16-JUL-2009 01:06     3499693
ODS WRK_XXXXXX_FLAG 16-JUL-2009 01:09 173340

If it shows as being analysed by the automated job last night and it’s last_analyzed date is prior to the run {so is not in the list you just pulled out}, it choked on that.
It is often quicker to make an educated guess and check those objects that you know are big and what their LAST_ANALYZED date is.

If you have no access to OEM, you could try looking at objects that should have been analysed. These are tables/partitions with more than 10% difference. You do this via DBA_TAB_MODIFICATIONS. I’ve already posted about this table {look towards the end of the post}, I am strangely attached to it, I think it is jolly useful.

The table looks like this

desc all_tab_modifications
 Name                                 Null?    Type
 ---------------- -------------------------------
 TABLE_OWNER                           VARCHAR2(30)
 TABLE_NAME                             VARCHAR2(30)
 PARTITION_NAME                        VARCHAR2(30)
 INSERTS                                   NUMBER
 UPDATES                                   NUMBER
 DELETES                                   NUMBER
 TRUNCATED                               VARCHAR2(3)

You can interogate it with something like the below {which only looks at tables, not partitions} You can get my script here – this script here

-- tab_count
-- mdw 11/05/03
-- mdw 17/01/08 Modified to look at dba_tab_modifications
set pause on pages 24 pause 'Any Key>'
colu anlyzd_rows form 99999,999,999
colu tot_rows form 99999,999,999
colu tab_name form a30
colu chngs form 99,999,999
colu pct_c form 999.999
set lines 110
spool tab_count.lst
select dbta.owner||'.'||dbta.table_name tab_name
,dbta.num_rows anlyzd_rows
,to_char(dbta.last_analyzed,'yymmdd hh24:mi:ss') last_anlzd
-nvl(dtm.deletes,0) tot_rows
,nvl(dtm.inserts,0)+nvl(dtm.deletes,0)+nvl(dtm.updates,0) chngs
     /greatest(nvl(dbta.num_rows,0),1)                          pct_c
,dtm.truncated trn
from dba_tables dbta
left outer join sys.dba_tab_modifications dtm
on dbta.owner = dtm.table_owner
and dbta.table_name = dtm.table_name
  and dtm.partition_name is null
where dbta.table_name like upper(nvl('&Tab_name'||'%','WHOOPS'))
and dbta.owner        not in ('SYS','SYSTEM')
spool off
clear colu

And it comes out with something like {view plain}:-

------------------------ -------------- --------------- -------------- ----------- -------- --
O.ADGHERHEHES                       384 060726 09:49:22            391           7     .018 NO
O.AUEHERHERHERATUS               15,578 060825 15:39:38         15,578       2,185     .140 NO
O.ADRTJRTJRENT                1,865,595 060823 07:32:53      1,892,301      26,706     .014 NO
I.APP_ARTHRTHRT_EXTR                  0 060726 22:10:21              0           0     .000 YES
I.APP_ARTJHRJHTJH_EXTR                0 060726 22:10:21              0           0     .000 YES
O.AD_RHRTHRTHRS                 143,435 061110 15:49:46        186,646      86,899     .606 NO
O.AD_RHRTHHORHUMNS              599,066 061110 15:49:57        885,304     286,238     .478 NO
O.AD_UNMTABRTHLES                    25 061110 15:49:41             31           7     .280 NO
O.AD_FTD_SRCORTHHOLD                 23 060726 09:49:25             23           1     .043 NO

Anything with a PCT_C {percentage change} of over 0.1 should have stats gathered next time the job runs, if the table does not have it’s statistics locked {phew, remembered to mention that caveat}.

You could of course write something yourself that only selects objects that have 10% or more change – it will give you a heads up of what is likely to be analyzed tonight and those objects who’s dependent SQL could act differently come the morning 🙂

There is a final way to spot what got changed and it involves Oracle’s internal tables for storing stats, but I’ve run out of time. I’ll have to post that next time.


Automated statistics gathering silently fails July 20, 2009

Posted by mwidlake in performance.
Tags: ,

…Next Post>
Or “The worst Oracle Performance Feature.”

I posted a few days ago about what I feel is the best performance feature in Oracle and cited the Automated Stats Gathering job. A few people did not agree… 🙂 So, in the best traditions of politicians and lawyers the world over, I would now like to claim I was misunderstood and that is not what I said. I said the opposite…(* see later).

Of the many problems that can occur with the automated statistics job, and the one I am going to pick up on today, is the fact that it can choke on large table or table partition segments {and does so silently at that}. What do I mean by this?

The automatic stats job runs every weekday night, by default between 22:00 hours and 06:00 hours. It appears to have some logic where by it looks at the number of table segments (full tables or partitions within a table) that need to be analysed. It then calculates the amount of time it can spend on each segment. If it has 32 objects to analyse in the 8 hours available, it seems to decide it can afford to spend not 8/32 (1/4) hours on any given segment but something more. It looks to be a couple of hours.

The job also tries to prioritise the most “in need” segments to be considered first. I am not sure what it goes on but I think it is those with the largest percentage change, but it is not just as simple as that.

As a result of the above, this is what seems to happen. The stats job starts up and identifies a massive segment to analyse, say a 200GB table. It starts analysing it, say allowing 2 hours maximum, but after 2 hours it has not analysed the massive table so it gives up.

The next night, it identifies the same massive segment to analyze (now 201GB) and maybe gives it an even higher priority. The job tries to analyse it IN EXACTLY THE SAME WAY as before. Oddly enough {this is sarcasm} it usually fails. The only reason it WILL succeed is if the night before the automated job managed to analyse other segments and fewer new ones need considering tonight, so it can spend more time on each segment. In my experience this is unlikely.

As a result, the object will probably fail to be analysed every night until the weekend, when the automated job gets to spend 2 whole days analysing segments and so can give each one more time. Unless you spot the error yourself and do a manual dbms_stats.gather_xxx statement yourself, with a low sample size that is likely to succeed.

How can you tell this is happening, or has happened? Look at the history of schema/database wide statsistics gathering operations in the table DBA_OPTSTAT_OPERATIONS . If a job ran all night, it probably choked {in the below, I look at table sys.WRI_OPSTAT_OPR – it is the underlying object}:

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

{You may ask why this is not failing the next night, like I claimed. It was because we were sorting out the failure the next morning, but had not yet properly fixed the problem}

So, for very large segments, the automated stats job can fail to analyse them.


No error appears in the alert log that I have seen, OEM or grid control does not seem to flag it up anywhere on the system summary or performance monitoring pages.

In OEM you can look at the latest stats run and see how it did. It lies. {Well, last time I checked it lies, I do not currently have access to OEM on a site suffering from the problem and, as you can imagine, clients do not want me to create massive tables just to fake this up for my blog 🙂 }. OEM has a screen to show the activity of the latest automated stats gathering run but, even though it has the option to look for objects for which there is a timeout, it does not show them.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

This situation can get completly out of hand. In the below, pulled out with this script here from one site I have seen, the stats was choking every night. Not only at night, it was now choking every weekend {click on “view plain” to see a better layout} .

-- Martin Widlake
-- show when and what type of multi=object stats gathering runs have occured
set lines 100
col operation form a40 wrap head 'operation(on)'
col target form a1
spool show_auto_stat_runs.lst
select operation||decode(target,null,null,'-'||target) operation
      ,to_char(start_time,'YYMMDD HH24:MI:SS.FF4') start_time
      ,to_char(  end_time,'YYMMDD HH24:MI:SS.FF4') end_time
from dba_optstat_operations
order by start_time desc
clear colu
spool off

operation(on)                            START_TIME                END_TIME
----------------------------------- -------------------- --------------------
gather_database_stats(auto)         090518 06:09:43.3429 090520 00:00:52.7399
gather_database_stats(auto)         090517 22:00:01.0653 090518 06:01:02.1637
gather_database_stats(auto)         090516 22:00:00.5150 090517 06:00:39.3076
gather_database_stats(auto)         090515 22:00:00.4157 090516 06:01:10.6544
gather_database_stats(auto)         090514 22:00:01.6624 090515 06:01:10.3767
gather_database_stats(auto)         090513 22:00:01.0706 090514 06:01:07.7145
gather_database_stats(auto)         090511 06:01:44.2381 090513 00:01:23.3091
gather_database_stats(auto)         090510 22:00:00.3218 090511 06:00:53.7735
gather_database_stats(auto)         090509 22:00:02.4334 090510 06:01:11.2443
gather_database_stats(auto)         090508 22:00:03.2080 090509 06:01:22.6858
gather_database_stats(auto)         090507 22:00:01.4859 090508 06:01:19.1966
gather_database_stats(auto)         090506 22:00:02.1927 090507 06:03:55.9747
gather_database_stats(auto)         090504 06:00:47.3064 090506 00:00:29.0017
gather_database_stats(auto)         090503 22:00:01.0015 090504 06:00:45.7018
gather_schema_stats-BIG_SCHEMA      090503 10:41:16.2605 090503 12:32:33.2488
gather_database_stats(auto)         090502 22:00:00.9199 090503 06:00:47.9868
gather_database_stats(auto)         090501 22:00:02.3382 090502 06:01:24.6070
gather_database_stats(auto)         090530 22:00:01.4914 090501 06:01:31.4369
gather_database_stats(auto)         090529 22:00:01.1076 090530 06:01:14.7856
gather_database_stats(auto)         090527 06:01:01.0286 090529 00:01:16.2678
gather_database_stats(auto)         090526 22:00:00.3628 090527 06:00:52.0203

The Automated stats job was virtually as a standstill as every night and every weekend it tried to do the same things in the same way and failed. So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole.


Worse still, it was burning CPU and IO doing a pointless set of partial table analyzes all night and all weekend.

Double Tsch!


(*) This claim that I never said what I said about the stats job is of course a big weasly lie. I did say it was the best feature. I also stick to this, though I think I am clinging on by my fingertips. I think it is the best step forward for smaller, average, not-too-demanding databases, those which tend to have less DBA support. But not the big, complex, highly active ones. I did say that it has problems with them right at the beginning, honest! 🙂

Oracle’s greatest performance feature July 14, 2009

Posted by mwidlake in performance.
Tags: ,

With oracle 10 onwards, you get what I feel is the greatest single step forward for general oracle database performance since…Ohhhh, I give up. I think it is more significant than anything else I’ve seen and I started with Oracle 6.

What is it? It’s the automated statistics gathering job. If you don’t know what this is, I’ll give you a brief summary in a few moments.
If you don’t agree with me, I’d love to know what you would nominate as the best single performance improvement since Oracle 6. {Or even before, for those of you even greyer and more world-weary than I}. This is after all, just an opinion and I’m open to changing my mind.

Now for that promised and brief {fairly brief} description of the automated stats gathering and why I love it {but please do not interpret this as an attempt to stop you telling me you alternatives, I really would like to know}.

By default, every work-day night, an Oracle database that is 10g or higher will spend from 10pm collecting table, index and column stats for you. This stats gathering job will keep going until it has collected all that it thinks it needs or the clock hits 6am, at which point it stops. At the weekend, your busy little databases will work even harder, from midnight Saturday {ie 1 second after Friday ended} to the end of Sunday gathering stats, if need be. Again, the job will stop once it has gathered the stats it thinks your system needs.

The job gathers stats on tables/indexes in all schemas, including SYSTEM, SYS and those other odd internal ones like DBSNMP and WMSYS, as well as all your own schemas. It gathers stats only on tables that have changed by 10% or more since last stats were gathered, or that have been truncated or created since the last run. For each such table, this job will sample a proportion of said table that it thinks is needed to give reliable stats. For each table it also gathers stats on each index {I’ll skim over a slight issue in respect of sample size for indexes}. When a table has it’s stats gathered, Oracle will even make a stab at gathering the correct level a column statistics for each column, based on whether you ever use the column in joins or where clause {ie there is some sense to collecting detailed histogram stats only for those columns it would help}.

Finally in my brief description, and something that not all DBAs and developers appreciate, is that this automatic job stores the stats as they were before it gathered new ones, so you can go back to previous stats if you so wish {and it was not more than a month ago}.

This process, this nightly job, has flaws. Some of it’s decisions can be poor. It can gather stats you might not want it it and it can mess up. BUT! It does run regularly and it does gather generally beneficial stats for all tables, indexes, partitions, sub-partitions, columns. All of them. {mostly} .

Prior to this automatic job, many sites’ object stats were in very poor health.

  • Some sites did not gather table/index stats at all.
  • Many sites gathered stats only occasionally.
  • Some sites would gather stats on a few tables and none on most.

This last situation, stats on some tables none on many others, condemned the cost based optimiser to having to make decisions based on a very poor information and was/is very common under 9. I like to think of it as taking a map of the UK and removing all the roads and then trying to drive from London to Manchester. So long as you keep going generally North and west, you will get there. Eventually. 

{I have a private theory that the number of hints in code is in proportion to the three situations above, the most being on systems with “stats on only a few tables”.}

This is why I think that job is the biggest step forward for performance. It ensures there is at least some information for every table and every index and that for most of them the data is correct to within 10%. Thus it allows the CBO to work with pretty accurate information and all that clever maths has a good chance of working.

Yes, if you know more about stats and performance and your system, you can do better than the automated job alone, but for the majority of sites it is a step forward. Especially for those sites that lack strong DBA/Developer expertise. Ie, the majority :-).

I know from talking to some Oracle support people that their lives have got quieter. I’m told that they get a lot fewer calls about SQL performance from Oracle 10 and 11 systems. Mind you, the usual “fix” of getting the customer to just gather some stats has gone out the window, which is a shame as it fixed most issues. All in all, it has been a massive boon to them.

So, I think it is the greatest step forward.

I also hate it. I hate that automated stats job. It makes silly decisions, it breaks, it is poorly documented and it makes a dog’s dinner of very big, very active or very odd systems. In fact it is rubbish. But very, very, very beneficial rubbish. I love that pile of rubbish.