jump to navigation

Automated statistics gathering silently fails July 20, 2009

Posted by mwidlake in performance.
Tags: ,
trackback

…Next Post>
Or “The worst Oracle Performance Feature.”

I posted a few days ago about what I feel is the best performance feature in Oracle and cited the Automated Stats Gathering job. A few people did not agree… :-) So, in the best traditions of politicians and lawyers the world over, I would now like to claim I was misunderstood and that is not what I said. I said the opposite…(* see later).

Of the many problems that can occur with the automated statistics job, and the one I am going to pick up on today, is the fact that it can choke on large table or table partition segments {and does so silently at that}. What do I mean by this?

The automatic stats job runs every weekday night, by default between 22:00 hours and 06:00 hours. It appears to have some logic where by it looks at the number of table segments (full tables or partitions within a table) that need to be analysed. It then calculates the amount of time it can spend on each segment. If it has 32 objects to analyse in the 8 hours available, it seems to decide it can afford to spend not 8/32 (1/4) hours on any given segment but something more. It looks to be a couple of hours.

The job also tries to prioritise the most “in need” segments to be considered first. I am not sure what it goes on but I think it is those with the largest percentage change, but it is not just as simple as that.

As a result of the above, this is what seems to happen. The stats job starts up and identifies a massive segment to analyse, say a 200GB table. It starts analysing it, say allowing 2 hours maximum, but after 2 hours it has not analysed the massive table so it gives up.

The next night, it identifies the same massive segment to analyze (now 201GB) and maybe gives it an even higher priority. The job tries to analyse it IN EXACTLY THE SAME WAY as before. Oddly enough {this is sarcasm} it usually fails. The only reason it WILL succeed is if the night before the automated job managed to analyse other segments and fewer new ones need considering tonight, so it can spend more time on each segment. In my experience this is unlikely.

As a result, the object will probably fail to be analysed every night until the weekend, when the automated job gets to spend 2 whole days analysing segments and so can give each one more time. Unless you spot the error yourself and do a manual dbms_stats.gather_xxx statement yourself, with a low sample size that is likely to succeed.

How can you tell this is happening, or has happened? Look at the history of schema/database wide statsistics gathering operations in the table DBA_OPTSTAT_OPERATIONS . If a job ran all night, it probably choked {in the below, I look at table sys.WRI_OPSTAT_OPR – it is the underlying object}:

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

The highlighted record is an example of where the automated stats job ran all night. It did this because it choked on a large segment. There are 3 other examples of this in this screen shot. Once the job starts choking, it becomes a constant problem!

{You may ask why this is not failing the next night, like I claimed. It was because we were sorting out the failure the next morning, but had not yet properly fixed the problem}

So, for very large segments, the automated stats job can fail to analyse them.

THIS FAILURE IS SILENT.

No error appears in the alert log that I have seen, OEM or grid control does not seem to flag it up anywhere on the system summary or performance monitoring pages.

In OEM you can look at the latest stats run and see how it did. It lies. {Well, last time I checked it lies, I do not currently have access to OEM on a site suffering from the problem and, as you can imagine, clients do not want me to create massive tables just to fake this up for my blog :-) }. OEM has a screen to show the activity of the latest automated stats gathering run but, even though it has the option to look for objects for which there is a timeout, it does not show them.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

Even though there appears to be the option to show objects for which the stats gather timed out, at least in this version it does not.

This situation can get completly out of hand. In the below, pulled out with this script here from one site I have seen, the stats was choking every night. Not only at night, it was now choking every weekend {click on “view plain” to see a better layout} .

--show_auto_stat_runs
-- Martin Widlake
-- show when and what type of multi=object stats gathering runs have occured
set lines 100
col operation form a40 wrap head 'operation(on)'
col target form a1
spool show_auto_stat_runs.lst
select operation||decode(target,null,null,'-'||target) operation
      ,to_char(start_time,'YYMMDD HH24:MI:SS.FF4') start_time
      ,to_char(  end_time,'YYMMDD HH24:MI:SS.FF4') end_time
from dba_optstat_operations
order by start_time desc
/
clear colu
spool off

operation(on)                            START_TIME                END_TIME
----------------------------------- -------------------- --------------------
gather_database_stats(auto)         090518 06:09:43.3429 090520 00:00:52.7399
gather_database_stats(auto)         090517 22:00:01.0653 090518 06:01:02.1637
gather_database_stats(auto)         090516 22:00:00.5150 090517 06:00:39.3076
gather_database_stats(auto)         090515 22:00:00.4157 090516 06:01:10.6544
gather_database_stats(auto)         090514 22:00:01.6624 090515 06:01:10.3767
gather_database_stats(auto)         090513 22:00:01.0706 090514 06:01:07.7145
gather_database_stats(auto)         090511 06:01:44.2381 090513 00:01:23.3091
gather_database_stats(auto)         090510 22:00:00.3218 090511 06:00:53.7735
gather_database_stats(auto)         090509 22:00:02.4334 090510 06:01:11.2443
gather_database_stats(auto)         090508 22:00:03.2080 090509 06:01:22.6858
gather_database_stats(auto)         090507 22:00:01.4859 090508 06:01:19.1966
gather_database_stats(auto)         090506 22:00:02.1927 090507 06:03:55.9747
gather_database_stats(auto)         090504 06:00:47.3064 090506 00:00:29.0017
gather_database_stats(auto)         090503 22:00:01.0015 090504 06:00:45.7018
gather_schema_stats-BIG_SCHEMA      090503 10:41:16.2605 090503 12:32:33.2488
gather_database_stats(auto)         090502 22:00:00.9199 090503 06:00:47.9868
gather_database_stats(auto)         090501 22:00:02.3382 090502 06:01:24.6070
gather_database_stats(auto)         090530 22:00:01.4914 090501 06:01:31.4369
gather_database_stats(auto)         090529 22:00:01.1076 090530 06:01:14.7856
gather_database_stats(auto)         090527 06:01:01.0286 090529 00:01:16.2678
gather_database_stats(auto)         090526 22:00:00.3628 090527 06:00:52.0203

The Automated stats job was virtually as a standstill as every night and every weekend it tried to do the same things in the same way and failed. So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole.

Tsch.

Worse still, it was burning CPU and IO doing a pointless set of partial table analyzes all night and all weekend.

Double Tsch!

 

(*) This claim that I never said what I said about the stats job is of course a big weasly lie. I did say it was the best feature. I also stick to this, though I think I am clinging on by my fingertips. I think it is the best step forward for smaller, average, not-too-demanding databases, those which tend to have less DBA support. But not the big, complex, highly active ones. I did say that it has problems with them right at the beginning, honest! :-)

About these ads

Comments»

1. coskan - July 20, 2009

Very handy finding. I never thought it worked like this (Probably I never needed to investigate)

Thanks for sharing Martin.

2. coskan - July 20, 2009

Do you think that timeout indicator is in FLAGS columns of select * from “SYS”.”WRI$_OPTSTAT_OPR”;

Maybe you can check it at your clients site. What I see is always null but maybe you can see something different

mwidlake - July 21, 2009

Hi Coskan,

I don’t think I’ve ever seen anything in column FLAGS or SPARE1 to SPARE6 (and I just checked on a site which had the problem). But I like the way you are thinking :-)

There IS a way to spot what the stats job choked on, I’ll post today or tomorrow, when I get chance from the day job

3. Noons - July 20, 2009

“So,unlike a human who finds they choke if they try and swallow a pork pie in one go and so stop doing it, Oracle keeps on trying to swallow it whole”

One of the first descriptions I heard of computers and what they do, many years ago, went like this: they are very good at repeating the same mistake, very very fast.
They still do!

4. Doug Burns - July 21, 2009

Trust me, Martin, everyone noticed your caveats right at the start.

As for this post, nice illustration of exactly what hit us the other week (I think) and screwed up an entire overnight batch. Happens once in a while, apparently ;-)

5. mwidlake - July 21, 2009

Hi Noons,

“Nothing screws things up as quickly as a computer can” :-)

Doug,
do you think I get a bit defensive at times? ;-)
I’d be interested to know if the stats job choked the next night and whether you had stepped in with manual stats gathering or not.

6. Doug Burns - July 21, 2009

We used this query to identify the problem

select to_char(log_date, ‘DD/MM/YY HH24:MI’), status, additional_info
from dba_scheduler_job_log
where job_name=’GATHER_STATS_JOB’
order by log_date desc

but the situation is complicated in our case because we have both the auto stats gather job enabled (which people were unaware of) and the ETL batch gathers stats too, so the ETL batch took care of it. We set the AUTOSTATS_TARGET to ORACLE before the next night.

7. mwidlake - July 21, 2009

Nice little query Doug.
In a bizzare coincidence, the most recent case I looked at was also complicated in exactly the same way – DBA team collecting stats with their own jobs and unaware that the automated stats was running. They were also gathering stats on external tables like crazy, to support data loads, so the entires from that into dba_tab_stats history confused the picture somewhat.

8. B. Polarski - July 22, 2009

There is also dba_scheduler_job_run_details whic offer an additional column error# whose value I am sure exists somewhere yet to fe found. I don’t have faulty execution so I am curious if you have something in error# and addtional_info.

Also and you could add a second query to identify user defined gather stats to complete the picture

col operation for a40
col target for a26
col start_time for a26
col end_time for a26
set head on lines 190
col job_name for a25
col cpu_used for a20
col run_duration for a20
col additional_info for a40
col status for a16
col error# for 9999999
select job_name, to_char(actual_START_DATE,’YYYY-MM-DD HH24:MI:SS’) req_start_date
,CPU_USED,RUN_DURATION,STATUS,error#, additional_info
from DBA_SCHEDULER_JOB_RUN_DETAILS
where job_name = ‘GATHER_STATS_JOB’ order by 2 desc
/
prompt Additional gather statistics run
prompt
select operation, target,
to_char(start_time,’YYYY-MM-DD HH24:MI:SS.FF4′) start_time,
to_char( end_time,’YYYY-MM-DD HH24:MI:SS.FF4′) end_time
from dba_optstat_operations where operation != ‘gather_database_stats(auto)’
order by start_time desc ;

mwidlake - July 22, 2009

Nice idea. However, the error# is set ot 0 and the reason is set to “stop job called because associated window was closed”.
My original script on DBA_OPSTATS_OPERATIONS does already pull out any user-executed schema or database stats gathering runs, becasue I simply don’t exclude them. There is one shown in the example above :-)

9. Automatic Statistics Gathering Fails #3 « Martin Widlake’s Yet Another Oracle Blog - July 29, 2009

[…] minutes from my lunch time to finish off this thread on the automated statistics gathering failing. <<first post … <second […]

10. Automated Statistics Gathering Silently Fails #2 « Martin Widlake’s Yet Another Oracle Blog - February 20, 2010

[…] July 23, 2009 Posted by mwidlake in performance. Tags: performance, statistics trackback <previous post…next […]

11. Mani - May 7, 2011

What if I do a manual stats for all objects onetime and then enable the job?

mwidlake - May 15, 2011

Hi Mani. Unfortunately that will not cure the issue but would give you a breathing space. By doing a manual gather you will avoid the automated job from gathering stats on any table segment and it’s indexes until the segment or whole table has grown by 10% (NB this is under V10, under 11 you can alter this percentage). However, as soon as a segment or table that is large enough to cause the job to choke has changed by 10% or more, the job WILL choke – as the method_opt and sample size is determined by the automated job in the same way each time.
Of course, it could take a long time to manually gather stats on all objects at a level that is suitable for the CBO to make good decisions…

12. Victor - September 10, 2011

Joining this very late but I’m having problems with this but cannot identify th table causing th problem. I see above you say “There IS a way to spot what the stats job choked on, I’ll post today or tomorrow, when I get chance from the day job” but I don’t think you ever did! Be grateful if you could

mwidlake - September 10, 2011

Hi Victor,

I’m always happy to see comments on old postings, it lets me know people still find them useful.

Hmmm, now what was I going to say about spotting the table you are choking on?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 171 other followers

%d bloggers like this: