Automatic Statistics Gathering Fails #3 July 29, 2009
Posted by mwidlake in internals, performance.Tags: data dictionary, performance, statistics
trackback
I’m steeling a few minutes from my lunch time to finish off this thread on the automated statistics gathering failing.
<<first post … <second post
There is another way to spot what the automated stats gathering choked on but it is not that reliable. But it does involve digging a little into Oracle’s internals, which is fun.
Under 10.2 onwards, when you gather stats on a segment, either directly or via a schema, database or the automatic stats collection, Oracle stores the stats that are replaced for you automatically {I say 10.2 as I am not sure that segment-level gather_table/index_stats are automatically backed up like this. I have some old notes saying it seemed not but I am not sure if this was 9i, 10.1 or me just getting confused at the time. If anyone knows or can check back, let me know 🙂 }
This means you can get the stats back using the dbma_stats.restore_xxxxxxxx_stats procedures. eg dbms_stats.restore_table_stats(ownname=>user,tabname=>’TEST_P’,as_of_timestamp=>systimestamp-1);
This will restore the stats of my table TEST_P to what they were at this time yesterday. You do not need to create a stattab table and store the prevous stats manually.
I’m not going to blog any more right now about this handy feature, the “PL/SQL packages and types” manual will tell you what you need, but I will comment that by default you can only go back 31 days.
Oracle gives you a table to see the history of stats stored, DBA_TAB_STATS_HIST
desc dba_tab_stats_history Name Null? Type ----------------------------------------- -------- ---------------------------- OWNER VARCHAR2(30) TABLE_NAME VARCHAR2(30) PARTITION_NAME VARCHAR2(30) SUBPARTITION_NAME VARCHAR2(30) STATS_UPDATE_TIME TIMESTAMP(6) WITH TIME ZONE select owner,table_name, partition_name,stats_update_time from dba_tab_stats_history where table_name='TEST_P' and owner=user OWNER TABLE_NAME PARTITION_NAME STATS_UPDATE_TIME -------- ---------- --------------- -------------------------------- WIDLAKEM TEST_P 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P ID_10K 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P ID_30K 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P ID_MAX 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P ID_40K 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P ID_20K 28-JUL-09 13.59.43.647785 +00:00 WIDLAKEM TEST_P 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P ID_MAX 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P ID_40K 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P ID_30K 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P ID_20K 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P ID_10K 28-JUL-09 13.37.42.043557 +00:00 WIDLAKEM TEST_P 28-JUL-09 09.11.40.098445 +00:00 WIDLAKEM TEST_P ID_MAX 28-JUL-09 09.11.40.098445 +00:00 WIDLAKEM TEST_P ID_40K 28-JUL-09 09.11.40.098445 +00:00 WIDLAKEM TEST_P ID_30K 28-JUL-09 09.11.40.098445 +00:00 WIDLAKEM TEST_P ID_20K 28-JUL-09 09.11.40.098445 +00:00 WIDLAKEM TEST_P ID_10K 28-JUL-09 09.11.40.098445 +00:00
You can see that I probably gathered stats on my test table 3 times yesterday, each time I gathered at a granularity of all, ie global and partition stats. Note also, all partitions and the table get the same timestamp. I think this is because oracle records the timestamp as when the stats for the set of segments was swapped into the data dictionary {note, not started}.
That’s quite useful. However, there is no similar view for indexes. That’s a shame. But you can recover index stats and this view is saying nothing about the stats as they were?…Is there more to be found?… Let’s go see what that view is looking it
@vw_txt Enter value for vw_name: dba_tab_stats_history OWNER VIEW_NAME TEXT_LENGTH -------- ------------------------------ ----------- TEXT -------------------------------------------------------------------------- SYS DBA_TAB_STATS_HISTORY 876 select u.name, o.name, null, null, h.savtime from sys.user$ u, sys.obj$ o, sys.wri$_optstat_tab_history h where h.obj# = o.obj# and o.type# = 2 and o.owner# = u.user# union all -- partitions select u.name, o.name, o.subname, null, h.savtime from sys.user$ u, sys.obj$ o, sys.wri$_optstat_tab_history h where h.obj# = o.obj# and o.type# = 19 and o.owner# = u.user# union all -- sub partitions select u.name, osp.name, ocp.subname, osp.subname, h.savtime from sys.user$ u, sys.obj$ osp, obj$ ocp, sys.tabsubpart$ tsp, sys.wri$_optstat_tab_history h where h.obj# = osp.obj# and osp.type# = 34 and osp.obj# = tsp.obj# and tsp.pobj# = ocp.obj# and osp.owner# = u.user# union all -- fixed tables select 'SYS', t.kqftanam, null, null, h.savtime from sys.x$kqfta t, sys.wri$_optstat_tab_history h where t.kqftaobj = h.obj#
Now that is interesting. Let’s go look at that table:
desc sys.wri$_optstat_tab_history Name Null? Type ----------------------------------------- -------- --------------------------- OBJ# NOT NULL NUMBER SAVTIME TIMESTAMP(6) WITH TIME ZONE FLAGS NUMBER ROWCNT NUMBER BLKCNT NUMBER AVGRLN NUMBER SAMPLESIZE NUMBER ANALYZETIME DATE CACHEDBLK NUMBER CACHEHIT NUMBER LOGICALREAD NUMBER SPARE1 NUMBER SPARE2 NUMBER SPARE3 NUMBER SPARE4 VARCHAR2(1000) SPARE5 VARCHAR2(1000) SPARE6 TIMESTAMP(6) WITH TIME ZONE
Some of those columns look interesting. Unforturnately CACHEDBLK onwards are empty, even in version 11 {but it shows that Oracle have built in the ability to use those yet-to-be used columns about average cached blocks and cache hit ratios for segments you might have spotted in eg DBA_TAB_STATISTICS}.
Could there be an index version? Of course there is :-
desc sys.wri$_optstat_ind_history Name Null? Type ----------------------------------------- -------- --------------------------- OBJ# NOT NULL NUMBER SAVTIME TIMESTAMP(6) WITH TIME ZONE FLAGS NUMBER ROWCNT NUMBER BLEVEL NUMBER LEAFCNT NUMBER DISTKEY NUMBER LBLKKEY NUMBER DBLKKEY NUMBER CLUFAC NUMBER SAMPLESIZE NUMBER ANALYZETIME DATE GUESSQ NUMBER CACHEDBLK NUMBER CACHEHIT NUMBER LOGICALREAD NUMBER SPARE1 NUMBER SPARE2 NUMBER SPARE3 NUMBER SPARE4 VARCHAR2(1000) SPARE5 VARCHAR2(1000) SPARE6 TIMESTAMP(6) WITH TIME ZONE
I’ve written a couple of scripts to extract data from these underlying tables, with the extra data you can grab. Use these links to download the table script and the index script.
Here below is the index script though:-
-- show_ish_full -- Martin Widlake 14/4/08 -- this gets the index stats history - there is no index equiv of dba_tab_stats_hist -- that I can see in 10.2 (or 11) -- col owner form a12 col ind_full_name form a40 col stat_upd_time form a15 col blevel form 99 head bl select owner,ind_full_name, to_char(stat_upd_time,'MMDD HH24:MI:SS') stat_upd_time ,numrows,numleafs,blevel,dist_kys,lf_p_ky, dbl_p_ky,clufac,samp_size,obj_hash ,greatest(nvl(samp_size,1),1)/greatest(nvl(numrows,1),1) samp_pct from ( select u.name owner, o.name ind_name,o.name ind_full_name , h.savtime stat_upd_time ,h.rowcnt numrows,h.leafcnt numleafs, h.blevel blevel,h.distkey dist_kys ,h.lblkkey lf_p_ky,h.dblkkey dbl_p_ky,h.clufac clufac, h.samplesize samp_size,h.obj# obj_hash from sys.user$ u, sys.obj$ o, sys.wri$_optstat_ind_history h where h.obj# = o.obj# and o.type# = 1 and o.owner# = u.user# union all -- partitions select u.name owner, o.name ind_name, o.name||'-'||o.subname ind_full_name , h.savtime stat_upd_time ,h.rowcnt numrows,h.leafcnt numleafs, h.blevel blevel,h.distkey dist_kys ,h.lblkkey lf_p_ky,h.dblkkey dbl_p_ky,h.clufac clufac, h.samplesize samp_size,h.obj# obj_hash from sys.user$ u, sys.obj$ o, sys.wri$_optstat_ind_history h where h.obj# = o.obj# and o.type# = 20 and o.owner# = u.user# union all -- sub partitions select u.name owner, osp.name ind_name, osp.name||'-'||ocp.subname||'='|| osp.subname ind_full_name , h.savtime stat_upd_time ,h.rowcnt numrows,h.leafcnt numleafs, h.blevel blevel,h.distkey dist_kys ,h.lblkkey lf_p_ky,h.dblkkey dbl_p_ky,h.clufac clufac, h.samplesize samp_size,h.obj# obj_hash from sys.user$ u, sys.obj$ osp, obj$ ocp, sys.indsubpart$ tsp, sys.wri$_optstat_ind_history h where h.obj# = osp.obj# and osp.type# = 35 and osp.obj# = tsp.obj# and tsp.pobj# = ocp.obj# and osp.owner# = u.user# ) where ind_name like upper(nvl('&indname','whoops')||'%') and owner like upper('&indown')||'%' order by owner,ind_name,stat_upd_time desc / clear colu
I’ll leave you to play with these as you wish, especially as I ran out of time to finish this blog an hour ago and now owe my current employer some of my weekend in return, but I’ll finish off with a use for the scripts.
One thing to bear in mind, these tables are holding the stats that were replaced, not the stats that were created. So they hold historical information.
If you read the earlier posts, you may remember that the automatic stats job can choke and fail on extremely large segments. Well, sometimes {and I have to admit, only sometimes} you can spot what segments it failed on. That is because, at the end of the window when the job gets stopped, it attempts to clean up after itself. It appears to check for stats that have changed during the run but it failed to process correctly and resets them. It copes the stats back. So you sometimes see this:-
@show_tsh_full OWNER TAB_FULL_NAME STAT_UPD_TIME NUMROWS NUMBLKS AVG_RL SAMP_SIZE OBJ_HASH SAMP_PCT ------------ ---------------------------------------- -------------------- ---------- ---------- ------ ---------- -------- -------- TABOWNER1 HUGETABLE4 0522 06:00:01 1473990266 27319245 128 5660375 54714 .004 TABOWNER1 HUGETABLE4 0522 05:48:22 1327119100 24544873 128 13271191 54714 .010
In the above case, you can see that this massive table had stats set at 5:48am and then again at 6:00.01am. That second record is the automated stats swapping the last known good stats back in place. That is the segment it choked on.
How do you stop the automatic stats job chocking on this extent? You do a manual gather of course. You will see the above job attempted a 0.1% sample size that almost worked at 05:48. The value it swapped back is 0.04 – which was the sample size of the manual gather I did a few days earlier to get over the last choke. {I also tweaked the histograms I gathered to save time}. I had not yet finished converting the manual fix to an automated one when I recorded the above.
Why do I think that the automatic job sets the timestamp in the WRI$_OPTSTAT_xxx_HIST tables when it swaps in the new stats and not when it starts? Because of the time of the entries after a large object has been analysed. Why do I think it checks for stats that have changed since the start of the run and replaces them rather than keeping track of the information as it goes? Because, when I first came across this choking issue, I was veryifying the problem after 10pm at night. When I realised there was an issue at around 1am, I started manually gathering stats. It took me until 4am, I checked they were in place and went to bed. Imagine my bad temper the next day when I found out that the automatic job had gone and re-wound most of the stats I had gathered, resetting them to what they had been at 10am the previous night. It was a seriously bad temper.
Addition – I’ve had a comment that may or may not be spam, saying they do not understand “the last bit” (If you are a real person molamola, I apologise). Reading it, it is maybe not clear, so this is what happened and why I was angry 🙂
1) I had realised the auto stats job was failing overnight.
2) I logged on from home at just before 10pm and watched the system
3) at 10pm the auto stats job started and collected stats on some small tables and indexes
4) around 11,11:30pm I realised is was not progressing,
5) from 11:30 to 3am in the morning, I created my own gather_table_stats and gather_index_stats statements and ran them.
6) at around 4am I went to bed, having checked stats were gathered and execution plans were good.
7) Got to work around 9am, a bit tired.
8) Found lots of performance issues due to old stats.
9) Shouted randomly at people and got angry, started re-gathering stats.
10) Worked out that at 6am, when the auto stats job finished, all the tables/indexes IT had intended to gather stats on but had failed to do so, it reset the stats to what they had been set to at 10pm the previous night, when it started.
ie very old and bad stats.
Excellent post Martin..I feel that if we know that there are big tables to be analyzed then we should lock the table stats so that gather_stats_job doesn’t spend time on these and have a separate job with force option to gather stats on these objects..
Cheers
Amit
http://askdba.org/weblog/
You are spot-on Amit, A very valid response is to lock the statistics on the large object, handle gathering stats on the table/partitions manually and also the dependent indexes {don’t forget the indexes, and for large tables, don’t use the cascade option, it under-gathers for indexes – I need to post about that!}.
Many site alter the automatic job to gather stats on only the internal oracle segments and gather stats on user segments with a custom job, but that needs some thought in developing. Locking table stats and either gathering your own stats with the FORCE =>true option or allowing dynamic sampling are both useful alternatices.
[…] Automated Statistics Gathering Silently Fails #2 July 23, 2009 Posted by mwidlake in performance. Tags: performance, statistics trackback <previous post…next post> […]
Martin,
I knew this blog post would come in handy for me down the line. Just used your table script to get some stats history. Worked like a dream.
Cheers
Graham
Hey, someone used one of my scripts! 🙂
I’m glad it was of use Graham.
The script IS copyright though. You now owe me 0.0000001p (which covers you for a lifetime of use)
Sloppy kisses Martin. Its late I’m in the office and your query has been a lifesaver. I started writing my own, then google and you came to the rescue and, you know, copy and paste is so much easier.
Being a bloke you might not want to take me up on the first offer. In fact I retract it. But thanks for the query anyway.
Well my personal preferences for kisses is cute brunette ladies, but I appreciate the sentiments. I’m very glad I could help