Automated statistics gathering silently fails

Automated statistics gathering silently fails July 20, 2009

Posted by mwidlake in performance.
Tags: performance, statistics
trackback

…Next Post>
Or “The worst Oracle Performance Feature.”

I posted a few days ago about what I feel is the best performance feature in Oracle and cited the Automated Stats Gathering job. A few people did not agree… 🙂 So, in the best traditions of politicians and lawyers the world over, I would now like to claim I was misunderstood and that is not what I said. I said the opposite…(* see later).

Of the many problems that can occur with the automated statistics job, and the one I am going to pick up on today, is the fact that it can choke on large table or table partition segments {and does so silently at that}. What do I mean by this?

The automatic stats job runs every weekday night, by default between 22:00 hours and 06:00 hours. It appears to have some logic where by it looks at the number of table segments (full tables or partitions within a table) that need to be analysed. It then calculates the amount of time it can spend on each segment. If it has 32 objects to analyse in the 8 hours available, it seems to decide it can afford to spend not 8/32 (1/4) hours on any given segment but something more. It looks to be a couple of hours.

The job also tries to prioritise the most “in need” segments to be considered first. I am not sure what it goes on but I think it is those with the largest percentage change, but it is not just as simple as that.

As a result of the above, this is what seems to happen. The stats job starts up and identifies a massive segment to analyse, say a 200GB table. It starts analysing it, say allowing 2 hours maximum, but after 2 hours it has not analysed the massive table so it gives up.

The next night, it identifies the same massive segment to analyze (now 201GB) and maybe gives it an even higher priority. The job tries to analyse it IN EXACTLY THE SAME WAY as before. Oddly enough {this is sarcasm} it usually fails. The only reason it WILL succeed is if the night before the automated job managed to analyse other segments and fewer new ones need considering tonight, so it can spend more time on each segment. In my experience this is unlikely.

As a result, the object will probably fail to be analysed every night until the weekend, when the automated job gets to spend 2 whole days analysing segments and so can give each one more time. Unless you spot the error yourself and do a manual dbms_stats.gather_xxx statement yourself, with a low sample size that is likely to succeed.

How can you tell this is happening, or has happened? Look at the history of schema/database wide statsistics gathering operations in the table DBA_OPTSTAT_OPERATIONS . If a job ran all night, it probably choked {in the below, I look at table sys.WRI_OPSTAT_OPR – it is the underlying object}:

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

{You may ask why this is not failing the next night, like I claimed. It was because we were sorting out the failure the next morning, but had not yet properly fixed the problem}

So, for very large segments, the automated stats job can fail to analyse them.

THIS FAILURE IS SILENT.

No error appears in the alert log that I have seen, OEM or grid control does not seem to flag it up anywhere on the system summary or performance monitoring pages.

In OEM you can look at the latest stats run and see how it did. It lies. {Well, last time I checked it lies, I do not currently have access to OEM on a site suffering from the problem and, as you can imagine, clients do not want me to create massive tables just to fake this up for my blog 🙂 }. OEM has a screen to show the activity of the latest automated stats gathering run but, even though it has the option to look for objects for which there is a timeout, it does not show them.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

This situation can get completly out of hand. In the below, pulled out with this script here from one site I have seen, the stats was choking every night. Not only at night, it was now choking every weekend {click on “view plain” to see a better layout} .

--show_auto_stat_runs
-- Martin Widlake
-- show when and what type of multi=object stats gathering runs have occured
set lines 100
col operation form a40 wrap head 'operation(on)'
col target form a1
spool show_auto_stat_runs.lst
select operation||decode(target,null,null,'-'||target) operation
      ,to_char(start_time,'YYMMDD HH24:MI:SS.FF4') start_time
      ,to_char(  end_time,'YYMMDD HH24:MI:SS.FF4') end_time
from dba_optstat_operations
order by start_time desc
/
clear colu
spool off

operation(on)                            START_TIME                END_TIME
----------------------------------- -------------------- --------------------
gather_database_stats(auto)         090518 06:09:43.3429 090520 00:00:52.7399
gather_database_stats(auto)         090517 22:00:01.0653 090518 06:01:02.1637
gather_database_stats(auto)         090516 22:00:00.5150 090517 06:00:39.3076
gather_database_stats(auto)         090515 22:00:00.4157 090516 06:01:10.6544
gather_database_stats(auto)         090514 22:00:01.6624 090515 06:01:10.3767
gather_database_stats(auto)         090513 22:00:01.0706 090514 06:01:07.7145
gather_database_stats(auto)         090511 06:01:44.2381 090513 00:01:23.3091
gather_database_stats(auto)         090510 22:00:00.3218 090511 06:00:53.7735
gather_database_stats(auto)         090509 22:00:02.4334 090510 06:01:11.2443
gather_database_stats(auto)         090508 22:00:03.2080 090509 06:01:22.6858
gather_database_stats(auto)         090507 22:00:01.4859 090508 06:01:19.1966
gather_database_stats(auto)         090506 22:00:02.1927 090507 06:03:55.9747
gather_database_stats(auto)         090504 06:00:47.3064 090506 00:00:29.0017
gather_database_stats(auto)         090503 22:00:01.0015 090504 06:00:45.7018
gather_schema_stats-BIG_SCHEMA      090503 10:41:16.2605 090503 12:32:33.2488
gather_database_stats(auto)         090502 22:00:00.9199 090503 06:00:47.9868
gather_database_stats(auto)         090501 22:00:02.3382 090502 06:01:24.6070
gather_database_stats(auto)         090530 22:00:01.4914 090501 06:01:31.4369
gather_database_stats(auto)         090529 22:00:01.1076 090530 06:01:14.7856
gather_database_stats(auto)         090527 06:01:01.0286 090529 00:01:16.2678
gather_database_stats(auto)         090526 22:00:00.3628 090527 06:00:52.0203

The Automated stats job was virtually as a standstill as every night and every weekend it tried to do the same things in the same way and failed. So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole.

Tsch.

Worse still, it was burning CPU and IO doing a pointless set of partial table analyzes all night and all weekend.

Double Tsch!

(*) This claim that I never said what I said about the stats job is of course a big weasly lie. I did say it was the best feature. I also stick to this, though I think I am clinging on by my fingertips. I think it is the best step forward for smaller, average, not-too-demanding databases, those which tend to have less DBA support. But not the big, complex, highly active ones. I did say that it has problems with them right at the beginning, honest! 🙂

Comments»

1. coskan - July 20, 2009

Very handy finding. I never thought it worked like this (Probably I never needed to investigate)

Thanks for sharing Martin.

2. coskan - July 20, 2009

Do you think that timeout indicator is in FLAGS columns of select * from “SYS”.”WRI$_OPTSTAT_OPR”;

Maybe you can check it at your clients site. What I see is always null but maybe you can see something different

mwidlake - July 21, 2009

Hi Coskan,

I don’t think I’ve ever seen anything in column FLAGS or SPARE1 to SPARE6 (and I just checked on a site which had the problem). But I like the way you are thinking 🙂

There IS a way to spot what the stats job choked on, I’ll post today or tomorrow, when I get chance from the day job

3. Noons - July 20, 2009

“So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole”

One of the first descriptions I heard of computers and what they do, many years ago, went like this: they are very good at repeating the same mistake, very very fast.
They still do!

4. Doug Burns - July 21, 2009

Trust me, Martin, everyone noticed your caveats right at the start.

As for this post, nice illustration of exactly what hit us the other week (I think) and screwed up an entire overnight batch. Happens once in a while, apparently 😉

5. mwidlake - July 21, 2009

Hi Noons,

“Nothing screws things up as quickly as a computer can” 🙂

Doug,
do you think I get a bit defensive at times? 😉
I’d be interested to know if the stats job choked the next night and whether you had stepped in with manual stats gathering or not.

6. Doug Burns - July 21, 2009

We used this query to identify the problem

select to_char(log_date, ‘DD/MM/YY HH24:MI’), status, additional_info
from dba_scheduler_job_log
where job_name=’GATHER_STATS_JOB’
order by log_date desc

but the situation is complicated in our case because we have both the auto stats gather job enabled (which people were unaware of) and the ETL batch gathers stats too, so the ETL batch took care of it. We set the AUTOSTATS_TARGET to ORACLE before the next night.

7. mwidlake - July 21, 2009

Nice little query Doug.
In a bizzare coincidence, the most recent case I looked at was also complicated in exactly the same way – DBA team collecting stats with their own jobs and unaware that the automated stats was running. They were also gathering stats on external tables like crazy, to support data loads, so the entires from that into dba_tab_stats history confused the picture somewhat.

8. B. Polarski - July 22, 2009

There is also dba_scheduler_job_run_details whic offer an additional column error# whose value I am sure exists somewhere yet to fe found. I don’t have faulty execution so I am curious if you have something in error# and addtional_info.

Also and you could add a second query to identify user defined gather stats to complete the picture

col operation for a40
col target for a26
col start_time for a26
col end_time for a26
set head on lines 190
col job_name for a25
col cpu_used for a20
col run_duration for a20
col additional_info for a40
col status for a16
col error# for 9999999
select job_name, to_char(actual_START_DATE,’YYYY-MM-DD HH24:MI:SS’) req_start_date
,CPU_USED,RUN_DURATION,STATUS,error#, additional_info
from DBA_SCHEDULER_JOB_RUN_DETAILS
where job_name = ‘GATHER_STATS_JOB’ order by 2 desc
/
prompt Additional gather statistics run
prompt
select operation, target,
to_char(start_time,’YYYY-MM-DD HH24:MI:SS.FF4′) start_time,
to_char( end_time,’YYYY-MM-DD HH24:MI:SS.FF4′) end_time
from dba_optstat_operations where operation != ‘gather_database_stats(auto)’
order by start_time desc ;

mwidlake - July 22, 2009

Nice idea. However, the error# is set ot 0 and the reason is set to “stop job called because associated window was closed”.
My original script on DBA_OPSTATS_OPERATIONS does already pull out any user-executed schema or database stats gathering runs, becasue I simply don’t exclude them. There is one shown in the example above 🙂

9. Automatic Statistics Gathering Fails #3 « Martin Widlake’s Yet Another Oracle Blog - July 29, 2009

[…] minutes from my lunch time to finish off this thread on the automated statistics gathering failing. <<first post … <second […]

10. Automated Statistics Gathering Silently Fails #2 « Martin Widlake’s Yet Another Oracle Blog - February 20, 2010

[…] July 23, 2009 Posted by mwidlake in performance. Tags: performance, statistics trackback <previous post…next […]

11. Mani - May 7, 2011

What if I do a manual stats for all objects onetime and then enable the job?

mwidlake - May 15, 2011

Hi Mani. Unfortunately that will not cure the issue but would give you a breathing space. By doing a manual gather you will avoid the automated job from gathering stats on any table segment and it’s indexes until the segment or whole table has grown by 10% (NB this is under V10, under 11 you can alter this percentage). However, as soon as a segment or table that is large enough to cause the job to choke has changed by 10% or more, the job WILL choke – as the method_opt and sample size is determined by the automated job in the same way each time.
Of course, it could take a long time to manually gather stats on all objects at a level that is suitable for the CBO to make good decisions…

12. Victor - September 10, 2011

Joining this very late but I’m having problems with this but cannot identify th table causing th problem. I see above you say “There IS a way to spot what the stats job choked on, I’ll post today or tomorrow, when I get chance from the day job” but I don’t think you ever did! Be grateful if you could

mwidlake - September 10, 2011

Hi Victor,

I’m always happy to see comments on old postings, it lets me know people still find them useful.

Hmmm, now what was I going to say about spotting the table you are choking on?

13. John Hallas - September 30, 2015

Martin, I know the blog is quite old now so this is probably a feature that has appeared since but the following script indicates when the auto gather stats has not completed it’s work by the end of the maintenance window.
There still doesn’t seem to be an easy way of finding where it has got to. That is something I am looking at currently.

select client_name,window_name,  job_info from DBA_AUTOTASK_JOB_HISTORY order by window_start_time

CLIENT_NAME          WINDOW_NAME          JOB_INFO
-------------------- -------------------- ---------------------------------------------------------------------------

auto optimizer stats WEDNESDAY_WINDOW     REASON="Stop job called because associated window was closed"
 collection

auto optimizer stats THURSDAY_WINDOW      REASON="Stop job called because associated window was closed"
 collection

auto optimizer stats FRIDAY_WINDOW        REASON="Stop job called because associated window was closed"
 collection
auto optimizer stats SATURDAY_WINDOW      REASON="Stop job called because associated window was closed"
 collection

auto optimizer stats SUNDAY_WINDOW        REASON="Stop job called because associated window was closed"

14. mwidlake - September 30, 2015

Thanks for that John.
It’s a while now since I had to track down the object the stats job was choking on, but the next posts in the series show how you can list what was analyzed during the run via LAST_ANALYZED on DBA_TABLES and also look at DBA_TAB_MODIFICATIONS to find the segments that are large and have over 10% change (the default percantage to be counted as STALE, but Oracle now allows you to state that at and object and system level so you can be caught out). Anything large that is staying stale is probably your problem.

I have also just logged in during the stats gather to see what is being analyzed and caught it that way, but that involves being awake at the time!

A final thing you can do is call DBMS_STATS.GATHER_DATABASE_STATS with OPTIONS set to to LIST AUTO and OBJLIST set to an array of the right type OBJECTTAB. This will give you a list of the segments that will be analysed, in the order they will be processed (I THINK – I might be wrong). The first big one that has been stale for a while is probably the one to fix.

15. Rene Jeruschkat - October 14, 2016

I am a little late here, but the content is still good. This is a third way on how to found out what the stats job spent its time. (If you have the tuning/diagnostics pack):

–1. see whether we are hit by this problem
select client_name,window_name,job_start_time,job_duration,job_name,job_status from DBA_AUTOTASK_JOB_HISTORY
where client_name = ‘auto optimizer stats collection’
and job_duration > INTERVAL ‘3:50’ HOUR TO MINUTE
and job_status = ‘STOPPED’
order by window_start_time desc;

–2. get the session and serial id from the run details
select log_date,job_name,actual_start_date,run_duration,session_id,cpu_used from SYS.DBA_SCHEDULER_JOB_RUN_DETAILS where job_name in (
select job_name from DBA_AUTOTASK_JOB_HISTORY where client_name = ‘auto optimizer stats collection’ and job_status = ‘STOPPED’ and job_duration > INTERVAL ‘3:50′ HOUR TO MINUTE
) order by log_date desc;

–3. get the sql statements this session ran from ASH, needs diagnostics/tuning pack
select sql_id,count(*) from v$active_session_history /*DBA_HIST_ACTIVE_SESS_HISTORY*/ where session_id = 466 and session_serial# = 64717
and sample_time
between trunc(sysdate) -1 + 22/24 and trunc(sysdate) + 2/24 –last night 22:00 to 02:00
and session_id||’,’||session_serial# = (
select session_id from (
select actual_start_date,session_id,row_number() over (order by actual_start_date desc) rn from SYS.DBA_SCHEDULER_JOB_RUN_DETAILS where job_name in (
select job_name from DBA_AUTOTASK_JOB_HISTORY where client_name = ‘auto optimizer stats collection’ and job_status = ‘STOPPED’ and job_duration > INTERVAL ‘3:50′ HOUR TO MINUTE
) and actual_start_date > sysdate -2 –assumes that the job runs each day and looks at the last 2 days
) where rn = 1
)
group by sql_id
order by 2 desc;

–4. check the sqltext the session ran, it probably contains the tablename
select * from dba_hist_sqltext where sql_id = ’98p2az1f8k9u4’;

mwidlake - October 14, 2016

Thanks Rene, always good to add new info

Martin Widlake's Yet Another Oracle Blog