jump to navigation

Automated statistics gathering silently fails July 20, 2009

Posted by mwidlake in performance.
Tags: ,
16 comments

…Next Post>
Or “The worst Oracle Performance Feature.”

I posted a few days ago about what I feel is the best performance feature in Oracle and cited the Automated Stats Gathering job. A few people did not agree… :-) So, in the best traditions of politicians and lawyers the world over, I would now like to claim I was misunderstood and that is not what I said. I said the opposite…(* see later).

Of the many problems that can occur with the automated statistics job, and the one I am going to pick up on today, is the fact that it can choke on large table or table partition segments {and does so silently at that}. What do I mean by this?

The automatic stats job runs every weekday night, by default between 22:00 hours and 06:00 hours. It appears to have some logic where by it looks at the number of table segments (full tables or partitions within a table) that need to be analysed. It then calculates the amount of time it can spend on each segment. If it has 32 objects to analyse in the 8 hours available, it seems to decide it can afford to spend not 8/32 (1/4) hours on any given segment but something more. It looks to be a couple of hours.

The job also tries to prioritise the most “in need” segments to be considered first. I am not sure what it goes on but I think it is those with the largest percentage change, but it is not just as simple as that.

As a result of the above, this is what seems to happen. The stats job starts up and identifies a massive segment to analyse, say a 200GB table. It starts analysing it, say allowing 2 hours maximum, but after 2 hours it has not analysed the massive table so it gives up.

The next night, it identifies the same massive segment to analyze (now 201GB) and maybe gives it an even higher priority. The job tries to analyse it IN EXACTLY THE SAME WAY as before. Oddly enough {this is sarcasm} it usually fails. The only reason it WILL succeed is if the night before the automated job managed to analyse other segments and fewer new ones need considering tonight, so it can spend more time on each segment. In my experience this is unlikely.

As a result, the object will probably fail to be analysed every night until the weekend, when the automated job gets to spend 2 whole days analysing segments and so can give each one more time. Unless you spot the error yourself and do a manual dbms_stats.gather_xxx statement yourself, with a low sample size that is likely to succeed.

How can you tell this is happening, or has happened? Look at the history of schema/database wide statsistics gathering operations in the table DBA_OPTSTAT_OPERATIONS . If a job ran all night, it probably choked {in the below, I look at table sys.WRI_OPSTAT_OPR – it is the underlying object}:

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

{You may ask why this is not failing the next night, like I claimed. It was because we were sorting out the failure the next morning, but had not yet properly fixed the problem}

So, for very large segments, the automated stats job can fail to analyse them.

THIS FAILURE IS SILENT.

No error appears in the alert log that I have seen, OEM or grid control does not seem to flag it up anywhere on the system summary or performance monitoring pages.

In OEM you can look at the latest stats run and see how it did. It lies. {Well, last time I checked it lies, I do not currently have access to OEM on a site suffering from the problem and, as you can imagine, clients do not want me to create massive tables just to fake this up for my blog :-) }. OEM has a screen to show the activity of the latest automated stats gathering run but, even though it has the option to look for objects for which there is a timeout, it does not show them.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

This situation can get completly out of hand. In the below, pulled out with this script here from one site I have seen, the stats was choking every night. Not only at night, it was now choking every weekend {click on “view plain” to see a better layout} .

--show_auto_stat_runs
-- Martin Widlake
-- show when and what type of multi=object stats gathering runs have occured
set lines 100
col operation form a40 wrap head 'operation(on)'
col target form a1
spool show_auto_stat_runs.lst
select operation||decode(target,null,null,'-'||target) operation
      ,to_char(start_time,'YYMMDD HH24:MI:SS.FF4') start_time
      ,to_char(  end_time,'YYMMDD HH24:MI:SS.FF4') end_time
from dba_optstat_operations
order by start_time desc
/
clear colu
spool off

operation(on)                            START_TIME                END_TIME
----------------------------------- -------------------- --------------------
gather_database_stats(auto)         090518 06:09:43.3429 090520 00:00:52.7399
gather_database_stats(auto)         090517 22:00:01.0653 090518 06:01:02.1637
gather_database_stats(auto)         090516 22:00:00.5150 090517 06:00:39.3076
gather_database_stats(auto)         090515 22:00:00.4157 090516 06:01:10.6544
gather_database_stats(auto)         090514 22:00:01.6624 090515 06:01:10.3767
gather_database_stats(auto)         090513 22:00:01.0706 090514 06:01:07.7145
gather_database_stats(auto)         090511 06:01:44.2381 090513 00:01:23.3091
gather_database_stats(auto)         090510 22:00:00.3218 090511 06:00:53.7735
gather_database_stats(auto)         090509 22:00:02.4334 090510 06:01:11.2443
gather_database_stats(auto)         090508 22:00:03.2080 090509 06:01:22.6858
gather_database_stats(auto)         090507 22:00:01.4859 090508 06:01:19.1966
gather_database_stats(auto)         090506 22:00:02.1927 090507 06:03:55.9747
gather_database_stats(auto)         090504 06:00:47.3064 090506 00:00:29.0017
gather_database_stats(auto)         090503 22:00:01.0015 090504 06:00:45.7018
gather_schema_stats-BIG_SCHEMA      090503 10:41:16.2605 090503 12:32:33.2488
gather_database_stats(auto)         090502 22:00:00.9199 090503 06:00:47.9868
gather_database_stats(auto)         090501 22:00:02.3382 090502 06:01:24.6070
gather_database_stats(auto)         090530 22:00:01.4914 090501 06:01:31.4369
gather_database_stats(auto)         090529 22:00:01.1076 090530 06:01:14.7856
gather_database_stats(auto)         090527 06:01:01.0286 090529 00:01:16.2678
gather_database_stats(auto)         090526 22:00:00.3628 090527 06:00:52.0203

The Automated stats job was virtually as a standstill as every night and every weekend it tried to do the same things in the same way and failed. So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole.

Tsch.

Worse still, it was burning CPU and IO doing a pointless set of partial table analyzes all night and all weekend.

Double Tsch!

 

(*) This claim that I never said what I said about the stats job is of course a big weasly lie. I did say it was the best feature. I also stick to this, though I think I am clinging on by my fingertips. I think it is the best step forward for smaller, average, not-too-demanding databases, those which tend to have less DBA support. But not the big, complex, highly active ones. I did say that it has problems with them right at the beginning, honest! :-)

Follow

Get every new post delivered to your Inbox.

Join 161 other followers