jump to navigation

Buffer Cache Hit Ratio – my “guilty” Confession November 1, 2009

Posted by mwidlake in Perceptions, performance.
Tags: , ,
14 comments

My Friday Philosophy this week was on Rules of Thumb on buffer gets per row returned.

Piet de Visser responded with a nice posting of his own, confessing to using ratios to help tuning {We seem to be playing some sort of blog-comment tag team game at the moment}.

Well, I have a confession so “guilty” or “dirty” that I feel I cannot inflict it on someone else’s blog as a comment.

I use the Buffer Cache Hit Ratio.

And the Library Cache Hit Ratio and the other Ratios.

As has been blogged and forum’d extensively, using these ratios is bad and stupid and anyone doing so does not know what they are doing as they do not help you solve performance problems. I mean, hell, you can download Connor McDonald’s/Jonathan Lewis’s script  to set it to what you want so it must be rubbish {go to the link and chose “tuning” and pick “Custom Hit Ratio” – it’s a rather neat little script}.

The point I am trying to make is that once the Buffer Cache Hit Ratio (BCHR) was wrongly elevated to the level of being regarded as a vital piece of key information but the reaction against this silly situation has been that it is now viewed by many (I feel) as the worst piece of misleading rubbish. Again a silly situation.

I think of the BCHR as similar to a heart rate. Is a heart rate of 120 good or bad? It’s bad if it is an adult’s resting heart rate, but pretty good if it is a kitten’s resting heart rate. It’s also probably pretty good if it is your heart rate as you walk briskly. Like the BCHR it can be fudged. I can go for a run to get mine higher, I can drain a couple of pints of my blood from my body and it will go up {I reserve the right not to prove that last one}. I can go to sleep and it will drop. Comparing my resting heart rate to yours (so like comparing BCHRs between systems) is pretty pointless, as I am a different size, age and metabolism to you {probably} but looking at mine over a year of dieting and exercising is very useful. If only I could keep up dieting and exercising for a year…

So what do I think the much-maligned Buffer Cache Hit Ratio gives me? It gives me what percentage of sql access, across the whole database activity, is satisfied from memory as opposed to disc. Or, put another way, the percentage of occurences a block has to be got from the I/O subsystem. Not how many blocks are read from storage or memory though, but you can get that information easily enough. As Physical IO is several orders of magnitude slower than memory access {ignoring I/O caches I should add} , it gives me an immediate feel for where I can and can’t look for things to improve.

If I am looking at a system that is overall very slow (eg high process wait queues under l/unix, the client has said the system is generally slow) and I see that the BCHR is low, say below 90%, this tells me I probably can get some performance increase by reducing physical access. I’ll go and look for those statements with the highest physical IO and the hottest tablespaces/objects in the DB.
If the BCHR is already up at the 99% level, I need to look at other things, such as tuning sort, looking at removing activity in the database, to be very mindful of nested loop access where maybe it is not the best access method (very likely due to old stats on tables).

When I have got to know a system and what it’s BCHR generally sits at, a sudden change, especially a drop, means there is some unusual physical IO going on. If the phones start going and someone is complaining “it’s all slow”, the BCHR is one of the first things to look at – especially as it is available from so many places.

Another thing the BCHR gives me is, if I am looking at a given SQL statement or part of an application, it’s specific BCHR can be compared to the system BCHR. this does not help me tune the statement itself, but I know if it’s specific BCHR is low then it has unusually high IO demands compared to the rest of the system. Further, Reducing it might help the whole system, so I might want to keep an eye on overall system throughput. If I reduce the statement’s execution time by 75% and the whole system IO by 1%, the client is likely to be more happy, especially if that 1% equates to other programs running a little faster “for free”.

So, I don’t use the BCHR to tune individual statements but I feel confident using it to track the general health of my database, so long as I am mindful of the impact of new functionality or upgrades. It’s a rule of thumb. It’s a database heart rate. (and so is redo generation and half a dozen other things).

A Tale of Two Meetings – 11GR2 and MI SIG October 5, 2009

Posted by mwidlake in Meeting notes, Perceptions.
Tags: , ,
7 comments

Last week I attended two Oracle events, each very different from the other.

The first was an Oracle Corp event, giving details of the new 11GR2 release and what it was introducing. It was in a nice hotel in London with maybe 250, 300 attendees and all quite swish.

The other was a UK Oracle User Group meeting, the last Management and Infrastructure SIG for 2009. 30 people in the Oracle City office and far more unassuming {And note, as I chair the MI SIG, anything I say about the day is liable to bias…}.

Both events were useful to attend and I learnt things at both, but I also found the difference between the two quite interesting.

Oracle 11G Release 2

The official Oracle 11GR2 presentation was where you went for the definitive information on what Oracle Corp feel are the new features of 11G R2 that are of interest (though some of it was not R2-specific but general 11G).

Chris Baker started off by telling us “there has never been a better time” to move to the latest technology or a greater need to gain business advantage through using said latest technology. You know, it would be really nice, just once, to go to such a corporate event and not be given this same thread of pointless posturing? I know it is probably just me being old and grumpy and contrary, but after 20 years in the business I am sick to the hind teeth of Keynotes or Announcements that say the same empty “Raa-Raa” stuff as the previous 19 years - the need “now” to get the best out of your technology has been the same need since the first computers were sold to businesses, so give it a rest. Just tell us about the damned technology, we are smart enough to make our own decision as to whether  it is a big enough improvement to warrant the investment in time and effort to take on. If we are not smart enough to know this, we will probably not be in business too long.

Sorry, I had not realised how much the Corporate Fluff about constantly claiming “Now is the time”, “Now things are critical” gets to me these days. Anyway, after that there were some good overviews of the latest bits of technology and following form them some dedicated sessions in two streams on specific areas, split between semi-technical and management-oriented talks, which was nice.

There was plenty of talk about the Oracle Database Machine, which appears to be exadata version 2 and sits on top of Sun hardware, which is no surprise given the latest Oracle Acquisition. I have to say, it looks good, all the hardware components have taken a step up (so now 40Gb infiniband interconnect, more powerful processors, even more memory), plus a great chunk of memory as Sun’s “FlashFire” technology to help cache data and thus help OLTP work. More importantly, you can get a 1/4 machine now, which will probably make it of interest to more sites with less money to splash out on a dedicated Oracle system. I’ll save further details for another post, as this is getting too long.

The other interesting thing about the new Oracle Database Machine was the striking absence of the two letters ‘P’ and ‘H’. HP was not mentioned once. I cannot but wonder how those who bought into the original exadata on HP hardware feel about their investment, given that V2 seems only available on Sun kit. If you wanted the latest V2 featries such as the much-touted  two-level disc compression is Oracle porting that over to the older HP systems, are Oracle offering a mighty nice deal to upgrade to the Sun systems or are there some customers with the HP kit currently sticking needles into a clay model of top Oracle personnel?

The other new feature I’ll mention is RAT – Real Application Testing. You can google for the details but, in  a nutshell, you can record the activity on the live database and play it back against an 11g copy of the database. The target needs to be logically identical to the source {so same tables, data, users etc} but you can alter initialisation parameters, physical implementation, patch set, OS, RAC… RAT will tell you what will change.

For me as a tuning/architecture guy this is very, very interesting. I might want to see the impact of implementing a system-wide change but currently this would involve either only partial testing and releasing on a wing and a prayer or a full regression test on an expensive and invariably over-utilised full test stack , which often does not exist. There was no dedicated talk on it though, it was mentioned in parts of more general “all the great new stuff” presentations.

Management and Infrastructure SIG

RAT leads me on to the MI SIG meeting. We had a talk on RAT by Chris Jones from Oracle, which made it clearer that there are two elements to Real Application testing. One is the Database Replay and the other is SQL Performance Analyzer,  SPA. Check out this oracle datasheet for details.

SPA captures the SQL from a source system but then simply replays the SELECT only statements, one by one, against a target database. The idea is that you can detect plan changes or performance variations in just the Select SQL. Obviously, if the SELECTS are against data created by other statements that are not replayed then the figures will be different, but I can see this being of use in regression testing and giving some level of assurance. SPA has another advantage in that it can be run against a 10g database, as opposed to RAT which can only be run against 11 (though captured from a terminal 10g or 9i system – that is a new trick).
There are no plans at all to backport RAT to 10, it just ain’t gonna happen guys.

The SIG also had an excellent presentation on GRID for large sites (that is, many oracle instances) and how to manage it all. The presentation was as a result of requests for a talk on this topic by people who come to this SIG and Oracle {in the form of Andrew Bulloch} were good enough to oblige.

The two Oracle Corp talks were balanced by technical talks by James Ball and Doug Burns, on flexible GRID architectures and using OEM/ASH/AWR respectively. These were User presentations, mentioning warts as well as Wins. Not that many Warts though, some issues with licence daftness was about it as the technology had been found to work and do it’s job well. Both talks were excellent.

The fifth talk was actually an open-forum discussion, on Hiring Staff, chaired by Gordon Brown {No, not THAT Gordon Brown, as Gordon points out}. Many people joined in and shared opinions on or methods used in getting new technical staff. I found it useful, as I think did many. These open sessions are not to everyone’s taste and they can go wrong, but Gordon kept it flowing and all went very well.

 

The difference between the two meetings was striking. Both had strong support from Oracle  {which I really appreciate}. Both included talks about the latest technology. However, the smaller, less swish event gave more information and better access to ask questions and get honest answers. There was also almost no Fluff at the SIG, it was all information or discussion, no “Raa-Raa”. But then, the lunch was very nice and there were free drinks after the Corporate event {we shared rounds at a local pub after the SIG event - maybe one round too much}. 

I guess I am saying that whilst I appreciate the Big Corporate event, I get a lot more out of the smaller, user group event. Less fluff, more info. Thankfully, Oracle support both, so I am not complaining {except about the “there has never been a better time” bit, I really AM sick of that :-( ).

 So if you don’t support your local Oracle user group, I’d suggest you consider doing so. And if, like so many sites seem to, you have membership but don’t go along to the smaller events, heck get down there! There is some of the best stuff at these SIG meetings.

Friday Philosophy – Cats and Dogs October 2, 2009

Posted by mwidlake in Perceptions.
Tags: , ,
2 comments

I like cats. Cats are great. I don’t like dogs. I’ve been attacked by a nasty bitie dog and that is my reason. And dogs growl at you. And woof.

This is of course unfair, I have been bitten by cats lots more than dogs (seeing as I own cats and have never owned a dog, this is to be expected), cats scratch, cats hiss at you and yowl and they have been known to leave “presents” in my slippers.

My animal preference comes down to personal, even personality, reasons as opposed to logic. A dog needs attention, a walk twice a day, they follow you around and always want attention and tend to be unquestioning in their affection. Cats can often take you or leave you, will come when called only if they had already decide to come over and the issue of who owns who is certainly not clear. If you do not keep your cat happy, there is always Mrs Willams down the road who Tiddles can up and go and live with instead.

These same illogical preferences riddle IT I think. People make decisions for what I sometimes term “religious” reasons. As an example, I’ve worked with a lot of people who either are strongly for or against Open Source. There are logical and business reasons for and against Open Source, but it seems to me that many people have decided which they prefer for personal reasons {often, Open Source people tend towards anti-establishement and anti-corporation views, Open Source detractors tend towards supporting business and personal wealth}. They then will argue their corner with the various pros and cons but you know there is no swaying their opinion as it was not derived from logic.

In the same way I will not stop preferring cats to dogs. And I know I personally have a couple of Religious decisions about IT that are not based on cold logic {And I am not changing them, OK!}.

I think it helps to realise that people do make decisions this way (some make most of them this way, most make some decisions this way) and it’s not worth getting that angry or annoyed when someone seems to be intractable in their stance against your ideas. After all, you might have made a “religious” decision which side you are on and they can’t understand why you don’t agree with them :-)

opinions formed in this manner are difficult to change. They can and do change, but usually only over time and in a gradual way, certainly not from someone saying to them they are an idiot for preferring Sybase to Ingress and verbally berating them with various arguments for and against.

So, if it is only a work thing {and heck, computers and software really are not that important} be passionate, but try and be a little flexible too.

This post was, of course, just a shallow excuse to include a link to a Cat thing – my favorite cat animation. Sorry Dog lovers {It’s your own faulty for liking nasty, smelly dogs}.

Testing is Not Just for Code. September 16, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: , , ,
7 comments

Someone I am currently working with has a wonderful tag line in her emails:

Next time we want to release untested, why don’t we just release undeveloped?

Testing is not limited to testing code of course. I have recently posted about how a backup is not a backup until you have tested it with a practice recovery.  How you think the database will work by looking at the data dictionary is just a nice theory until you run some actual tests to see how the database responds, as I have been doing with Histograms lately. Sadly, you could even say an Oracle feature is not an Oracle feature until you have tested it.

In my experience, this is particularly true when you test the edges of Oracle, when you are working on VLDBs {Very Large DataBases}.

Last month Jonathan Lewis posted about a 2TB ASM disc size bug, where if you allocated a disc over 2TB to ASM, it would fill it up, wrap around and write over the begining of the file. This week I heard from some past colleagues of mine that they hit this very same bug.
With these very same colleagues we hit a big in 10.1 where you could not back up a tablespace over 8TB in size with RMAN {I can’t give you a bug number for it as we were working with HP/Oracle direct at the time and they “handled it internally”, But when I mentioned it to him, Jonathan found a similar one, bug 5448714 , which stated a 4TB limit on backups. It could be the same bug}.

Yet another VLDB issue was we wanted to move just under one thousand tablespaces from one database to another {again, 10.1}, using transportable tablespaces. We tried to use the utility for checking you are working on a consistent set of tablespaces, but it could not cope with that many. But to plug them into the new tablespace you have to export the metadata and we found a 4000 character limit on the variable stating the tablespaces to transport. That’s 2.3 characters per tablespace, as you need comas to delimit them…Yes, you could manage if you renamed all tablespaces to AA, AB, AC…BA.,BB, BC etc.  If memory servers, the problem was with data pump export and we reverted to old style export which did not have the problem.

Another limit I’ve blogged on is that the automated stats job chokes on very large objects.

Some Data dictionary views can become very slow if you have several tens of thousands of tables/extents/tablespace/indexes

I can appreciate the issues and problems Oracle has with testing their code base, it is vast and people use the software in odd ways and it has to run on many platforms. You might also feel I am being picky by saying Oracle breaks a little when you have 8TB tablespaces or a thousand tablespaces. But

  • Oracle will say in big, glossy presentations, you can build Petabyte and Exabyte databases with Oracle {and have a product called Exadata, don’t forget}.
  • More and more customers are reaching these sizes as data continues to grow, for many site, faster than mores law.
  • Some of these limits appear with databases well below a Petabyte (say a tiddly small 50TB one :-) ).

I’ve been running into these issues with VLDBs since Oracle 7 and they are often with pretty fundamental parts of the system, like creating and backing up tablespaces! I think it is poor show that it is so obvious that Oracle has been weak in testing with VLDB-sized database before release. 

I wonder whether, with 11gR2, Oracle actually tested some petabyte data sizes to see if it all works? After all, as is often said, disk is cheap now, I’m sure they could knock one up quite quickly…

Friday Philosophy – What ever happened to System Design? September 14, 2009

Posted by mwidlake in Architecture, performance.
Tags: ,
8 comments

{OK, so it is not Friday, I’ve got a lot on, OK?! Shesh…}

I’m a Contractor/Consultant. As a Contractor I am expected to arrive On Site and within a week be making progress.  As a Consultant the week becomes a day.

That’s fine, it’s the territory. However, there is a question I always ask and I know other people ask it when they arrive on site:-

“Can I please have a look at the database design”.

We normally mean an Entertity Relationship diagram or at least a schema design by this. The usual answer these days?

” We keep meaning to do one, but it never seems to quite get done – here are some misleading and poor notes on a few of the the schemas, if you get an ER diagram together, we would love to see it”.

Why does no one do an ER diagram these days and why is there never a picture of the tables and relationships that you can look at for a system? Despite the fact that anyone new to the system hopes one will exist? What Happened to the skill of logical design?!

When did the ethos of having an overall schema layout just go out of the window??? And how in hell do You (yes, You, the system owner) expect to have a clue about the overall system without one?

Oh, and what is that I hear? You want me to improve the performance of your system without having a description of said system!? You have no clue of the overall database design but you want it to run faster?!? I better set the hidden “_RUN_DATABASE_FASTER” parameter to TRUE immediately then.

Is this just me or WHAT?

Another Day, Another Obscure Oracle Error September 11, 2009

Posted by mwidlake in internals, performance.
Tags: , ,
8 comments

I seem to be spending my life hitting odd bugs or errors in Oracle’s performance-related areas at the moment. I guess part of it is because I am using features not widely touched, part is I’m not working as the SYS or SYSTEM user {in my experience a lot of people working on such things do it when connected as sysdba or at least as SYSTEM} and part is that Larry Ellisson personally hates me {he fills in for God when it comes to Oracle}.

I’m seeing this a lot today,and it seems virtually unknown:-

ERROR at line 1:
ORA-20003: Unable to gather system statistics
ORA-06512: at “SYS.DBMS_STATS”, line 15882
ORA-06512: at line 1

I’m also seeing this, but this error is documented:-

ERROR at line 1:
ORA-20000: Unable to gather system statistics : insufficient privileges
ORA-06512: at “SYS.DBMS_STATS”, line 15822
ORA-06512: at line 1

The rest of this post will explain my path to only partial enlightenment. If you have hit the above errors and found this page via Google, this might help you out and you might like to skip towards the end.
If you just read my blog, regard this as a very long “tweet” or whatever the hell they are. Again, feel free to skip to the end. Or just skip.

The task I’m doing which is causing me to hit these issues is that I’m trying to assure myself of the ability to role back gathering system statistics before getting the DBA team to do it on a production system. {I am not in the DBA team on this one}.

No system statistics have been gathered yet so I have the default set, I’m sure many of you will recognise them:

SNAME           PNAME                     PVAL1 PVAL2
---------------------------------------------------
SYSSTATS_MAIN   CPUSPEED
SYSSTATS_MAIN   CPUSPEEDNW         1,567.798516
SYSSTATS_MAIN   IOSEEKTIM             10.000000
SYSSTATS_MAIN   IOTFRSPEED         4,096.000000
SYSSTATS_MAIN   MAXTHR
SYSSTATS_MAIN   MBRC
SYSSTATS_MAIN   MREADTIM
SYSSTATS_MAIN   SLAVETHR
SYSSTATS_MAIN   SREADTIM
SYSSTATS_INFO   DSTART                          07-25-2006 12:39
SYSSTATS_INFO   DSTOP                           07-25-2006 12:39
SYSSTATS_INFO   FLAGS                  1.000000
SYSSTATS_INFO   STATUS                          COMPLETED

I decided to pull the current system statistics into a statistics table:-

exec dbms_stats.create_stat_table(‘DWPERF’,’AD_BF_STATS’,’DW_COMMON_MG’)

EXEC DBMS_STATS.EXPORT_SYSTEM_STATS
(‘AD_BF_STATS’,’SYSSTAT’||TO_CHAR(SYSDATE,’YYMMDDHH24MI’),’DWPERF’)

SELECT * FROM AD_BF_STATS;
SYSSTAT0909101309 S 4 1 COMPLETED
07-25-2006 12:39 07-25-2006 12:39 CPU_SERIO

SYSSTAT0909101309 S 4 1
PARIO
10 4096 1567.79852

 
Two records created, first is the header record about what this statid is all about, the second contains the stored info. I’ve highlighted the relevant records.

All well and good, I can save my stats. Why do I want to? Because we may gather system statistics on an infrequent but regular basis, and I want a record of them. The automtically stored history only goes back a month {by default}.

Of course, there is no need for me to explicitly export the stats to the table, it can be done as part of the gathering process, by stating the statistics table, owner and an ID when I do the gather, so I did:-

exec dbms_stats.gather_system_stats(gathering_mode => ‘INTERVAL’,interval => 15,stattab=> ‘
AD_BF_STATS’,statown=> ‘DWPERF’ ,statid=>’SYSSTAT’||TO_CHAR(SYSDATE,’YYMMDDHH24MI’))

I went off for a cup of coffee, came back, checked my system stats.

SNAME           PNAME                     PVAL1 PVAL2
---------------------------------------------------
SYSSTATS_MAIN   CPUSPEED
SYSSTATS_MAIN   CPUSPEEDNW         1,567.798516
SYSSTATS_MAIN   IOSEEKTIM             10.000000
SYSSTATS_MAIN   IOTFRSPEED         4,096.000000
SYSSTATS_MAIN   MAXTHR
SYSSTATS_MAIN   MBRC
SYSSTATS_MAIN   MREADTIM
SYSSTATS_MAIN   SLAVETHR
SYSSTATS_MAIN   SREADTIM
SYSSTATS_INFO   DSTART                          07-25-2006 12:39
SYSSTATS_INFO   DSTOP                           07-25-2006 12:39
SYSSTATS_INFO   FLAGS                  1.000000
SYSSTATS_INFO   STATUS                          COMPLETED

Oh. No change. No system stats gathered. Damn.

Well, it’s a test box, there may be too little worload to detect anything – like none! I checked the SGA for recent code and the only SQL was that for recording the stats gathering :-).
So, repeated with me running some workload in another window for 15 minutes.

Still no change, still no system stats. Double Damn!

I remembered from reading around that if MREADTIME is less than SREADTIME the stats gathered could be ignored so I tried again.
And again.
And then decided it would not magically work, there was a problem.

I’ll gather the system stats with START and STOP and really hammer the box for a while, ensuring with autotrace
exec dbms_stats.gather_system_stats(gathering_mode => ‘START’,stattab=>’AD_BF_STATS’,statown
=> ‘DWPERF’ ,statid=>’SYSSTAT’||TO_CHAR(SYSDATE,’YYMMDDHH24MI’))

work-work-work-work-work

exec dbms_stats.gather_system_stats(gathering_mode => ‘STOP’);
BEGIN dbms_stats.gather_system_stats(gathering_mode => ‘STOP’); END;
*
ERROR at line 1:
ORA-20003: Unable to gather system statistics
ORA-06512: at “SYS.DBMS_STATS”, line 15882
ORA-06512: at line 1

Oh.

Oh Hell. Well, at least I now know that the chances are my interval systenm stats collection failed with this, but silently.

I bet if I dug around I would find a few failed jobs recorded where the INTERVAL version of the stats collection failed with this. I have not checked, I am just up to my eyes right now.

I found nothing at all via Google or Metalink about this exact error for gathering system stats. BUT maybe I lacked the correct priveleges on my non-SYS account. So I granted ANALYZE AND DICTIONARY and ANALYZE ANY to my user, as those or the priveleges mentioned for many of the DBMS_STATS management procedures.

Nope, did not help.

Blow it, If DBMS_STATS won’t let me keep the stats in my table, I’ll just rely on the default store of them, and not state my stats table in the call:

DWD2_DWPERF> exec dbms_stats.gather_system_stats(gathering_mode => ‘START’)
BEGIN dbms_stats.gather_system_stats(gathering_mode => ‘START’); END;
*
ERROR at line 1:
ORA-20000: Unable to gather system statistics : insufficient privileges
ORA-06512: at “SYS.DBMS_STATS”, line 15822
ORA-06512: at line 1

Oh. Well, that’s interesting.
If I don’t state a stats table the gather fails when initiated.
If I state a stats table the gather fails when I stop.
How frustrating.

I decided to give up. Then I decided I won’t be beaten and to grant myself DBA role and see if it works.

DWDEV2> exec dbms_stats.gather_system_stats(gathering_mode => ‘START’)
PL/SQL procedure successfully completed.

YEEEESSSSSS!!!

But will it now error when I stop the gather…

DWDEV2> exec dbms_stats.gather_system_stats(gathering_mode => ‘STOP’)
PL/SQL procedure successfully completed.

Deep Joy.

And it gathered stats.

SNAME           PNAME                     PVAL1 PVAL2
------------------------------------------
SYSSTATS_MAIN   CPUSPEED           1,545.000000
SYSSTATS_MAIN   CPUSPEEDNW         1,567.798516
SYSSTATS_MAIN   IOSEEKTIM             10.000000
SYSSTATS_MAIN   IOTFRSPEED         4,096.000000
SYSSTATS_MAIN   MAXTHR
SYSSTATS_MAIN   MBRC
SYSSTATS_MAIN   MREADTIM
SYSSTATS_MAIN   SLAVETHR
SYSSTATS_MAIN   SREADTIM               5.000000
SYSSTATS_INFO   DSTART                          09-11-2009 13:02
SYSSTATS_INFO   DSTOP                           09-11-2009 13:02
SYSSTATS_INFO   FLAGS                  1.000000
SYSSTATS_INFO   STATUS                          COMPLETED

{ignore the lack of mreadtime and sreadtime, nothing happend on the box in the few seconds I was collecting for}

SO, the lack of DBA role was a problem but had been hidden by my attempts to use a stats table. Now surely I can do what I wanted to and store the stats in my stats table…

exec dbms_stats.gather_system_stats(gathering_mode => ‘START’,stattab=>’AD_BF_STATS’,statown
=> ‘DWPERF’ ,statid=>’SYSSTAT’||TO_CHAR(SYSDATE,’YYMMDDHH24MI’))

PL/SQL procedure successfully completed.

Wehay.
work-work-work-work-work

exec dbms_stats.gather_system_stats(gathering_mode => ‘STOP’)
BEGIN dbms_stats.gather_system_stats(gathering_mode => ‘STOP’); END;
*
ERROR at line 1:
ORA-20003: Unable to gather system statistics
ORA-06512: at “SYS.DBMS_STATS”, line 15882
ORA-06512: at line 1

AAAAAGGGGHHHHHH!!!!!!!!!

Let’s check back.

exec dbms_stats.gather_system_stats(gathering_mode => ‘START’)
PL/SQL procedure successfully completed.

work-work-work-work-work

exec dbms_stats.gather_system_stats(gathering_mode => ‘STOP’)
PL/SQL procedure successfully completed.

That works, because I am not involving a stats table. If you look back at the two errors, they come from different (but very similar) line numbers. I’d missed that for a couple of hours, I blame dyslexia.

Conclusion:

Gathering and storing System Stats is as buggy and random as the rest of the dbms_stats functionality. {yes, I am peeved about this}. It Seems..

  • If you gather system stats with the interval method, it can silently fail. This might only be if you involve a stats table.
  • You need the DBA role to gather system stats,  otherwise you get ORA-20000, via ORA-06512 at line 15822
  • analyze and dictionary and analyze any are not enough to allow a non DBA account to gather system stats.
  • If you try and store the previous system stats in your stats table as part of the gather you get ORA-20003, via ORA-06512 at line 15882 {different line}, even if you have DBA role.
  • If you just rely on Oracle preserving the previous version of the stats, you can gather system stats fine.
  • You can work around the issue by exporting the system stats to your stats table as a single step first, then gathering system stats without involving your stats table.

There is no metalink for the ORA-20003 via line 15882 in metalink and no google hit for it, so I reckon it is a damned rare event. And since the last time I blogged about a damned rare error, I see 1 or 2 hits on that blog page every day since :-).

When a Backup is Not a Backup September 8, 2009

Posted by mwidlake in Architecture.
Tags: ,
13 comments

I had a call from an old client today who had found themselves in a bit of a pickle. They were facing what I’d judge is the third most common “avoidable disaster” for Oracle databases. And probably the most devastating.

They had a damaged database, just one of those things. Media failure had corrupted a couple of datafiles. This is notthe disaster, most DBAs who have been around for a few years have experienced this. Disks are mechanical devices, they go wrong, controllers develop faults, people accidentally pull the plugs out of switches.

The disaster was that they had never tested their backup. Guess what? It turned out that their backup was not be a valid backup. It was just a bunch of data on tape {or disk or wherever they had put the backup}.

A backup is not a backup until you have proven you can recover from it.

If your backup has not been tested, and by that I mean used to recreate the working system which you then test, then it is just a bunch of data and a hopeful wish. You will not know if it works until you are forced to resort to it, and that means (a) you already have a broken system and an impact on your business (b) time is going to be short and pressure is high and (c) if you are the DBA, you could be about to fail in your job in a most visible and spectacular way.

Oddly enough, only two or three weeks ago I had another client in exactly the same position. Database is corrupted, the backup had not been tested, the backup turned out to be a waste of storage. In fact, in this case, I think the backup had not been taken for a considerable period of time as a standby database was being relied on as the backup. Which would have been fine if the Standby had been tested as fulfilling the purpose of “working backup”.

The standby had not, as far as I could deduce, ever been opened and tested as a working and complete database since the last major upgrade to the host system. When tested for real, It proved not to be a working and complete database. It was an expensive “hopeful wish”.

The client from a few weeks ago finally got their system back, but it took a lot of effort for the DBAs and the developers to sort it out and they were a tad lucky. The jury is out on the client who called me today.

I can’t think of anything at all that the DBA function does that is more important than ensuring backups are taken and work. {Maybe an argument could be made that creating the system is more important as nothing can be done about that, but then you could argue that the most important thing you do in the morning is get out of bed}. Admittedly, setting up, running and testing backups is not a very exciting job. In fact it often seems to be a task passed on to a less experienced member of the team {just like creating databases in the first place}. But most of us are not paid to have fun, we are paid to do a job.

I’ll just make a handful more comments before giving up for today.

  • The database backup nearly always needs to be part of a more comprehensive backup solution. You need a way to recreate the oracle install, either a backup of the binaries and auxiliary files (sql*net, initialization files, password files etc) or at least installation media and all patches so you can recreate the installation. You need a method to recover or recreate your application and also your monitoring. You might need to be able to recreate your batch control. O/S? Have you covered all components of your system, can you delete any given file off your system and fix it?
  • Despite the potential complexity resulting from the previous point, you should keep your backup as simple as you can. For example, if you only have a handful of small databases, you might just need old-style hot backups via a shell script, rather than RMAN. Only get complex {like backing up from a standby} if you have a very, very compelling business need to do so.
  • Testing the backup once is a whole world better then never testing it. However, regular, repeated recovery tests not only allow the DBA/Sys Admin teams to become very comfortable with the recovery process and help ensure it is swift and painless, but by trying different scenarios, you may well discover issues that come under the first two points.

I’ve not even touched on the whole nightmare of Disaster Recovery :-)

Friday Philosophy – Should the Software or the User be the Stupid One? August 7, 2009

Posted by mwidlake in internals, performance.
Tags: , ,
5 comments

Oracle’s performance engine is complex and copes with a lot of database situations automatically – or to be more precise, it tries to cope with lots of database situations automatically.

Over the last few versions, Oracle has added many, many things to allow the database to cope automatically with all sorts of different data volumes, spreads of data, relationships between tables, use of different oracle technologies (By this I mean bitmap tables, index tables, partitions, clusters, external tables). All of these things aim to allow the database to just “cope” with whatever you need it to do, with less and less effort by the users {by users, I mean technical users; DBAs and Developers}. Thus it allows for “stupid” users. {I mean no offence, maybe read “inexperienced” instead of stupid}.

As an example, you can now have some very large tables consisting of several partitions and some status look-ups. You query against them. Oracle’s CBO will automatically ignore partitions it can ignore, use indexes or full table scans to use the least amount of IO,use histograms to spot where clauses are on low-cardinality values, Hash joins rather then nested loops as appropriate depending on memory availability, use bitmap indexes when it thinks it can and merge the results from several bitmap indexes, use function based indexes to support functions in where clauses….
It even self-gathers the information to look after all this. Column usage and table modifications are tracked, statistics are gathered when needed and in ways to support data skew, PGA and SGA can be automonitoring and managing…

It all sounds great. In fact, most of the time, for most people, it is great. {I know, most people reading this post are probably people who have encountered the problem systems and so know it goes wrong and so you need more knowledge to cope – you are a biased set of people. In the nicest way, I should add :-) } The idea is, I believe, that you do not neet to be smart to look after Oracle.

If it is not great, if this highly complex system gets it wrong and tries to satisfy SQL statements in sub-optimal ways, then the User has to step in and fix things. ie You.

It is now horrifically complex for us technical users to understand what is going on. You have to not only be “not stupid”, but “not average” either. Sometimes you have to be “not great”, ie brilliant.

In my example, we need to look at if the SQL is constructed to allow the indexes to be used, are functions correctly laid out to use function indexes, are partitions being maintained correctly, when were stats last gathered, did it include histograms and do they help, has oracle missed the need for histograms, are the indexes analyzed at a high enough sample size, are the bitmaps greatly slowing down inserts, have hints been used on the code, are initialisation parameters set to override default fucntionality…

You get the idea, I won’t drone on further. I didn’t even mention memory considerations though {OK, I’ll shut up}.

My point is, the more complex the software, the more “intelligent” it is, the more it is designed to allow for “stupid” users, then the more super-intelligent the user has to be to cope when it breaks.

How about an alternative?

How would it be if we went back to the Rule Based Optimizer and no automatic management of complex situations?

Oracle would maybe need to add a few rules to the RBO for it to cope with later developments, so it would be slightly more complex than V6 but not a lot.
Everything else, the User decides. You only gather stats you decide to gather, on objects you decide need them. No you don’t, it’s a Rule Based Optimizer – no stats gathering! {But see below}.

No automatic memory management. No automatic anything.

The User {the technical user, the DBA and Developer} would have to be smart. Not brilliant, just smart. You would probably have to do more, but most of it would be easier as the levels of complexity and interdependence are reduced. All those tweaks and tricks in the CBO and all the monitoring to cope with “complex” would not exist to go wrong.

Plus it might solve another concern I have. I think there is a chasm growing as there is no need to solve simple problems as Oracle copes but then having to solve complex problems when Orcle does not cope. If you don’t develop skills and experience solving the simple problems, how do you solve the complex ones? I think this is why most Oracle performance and architecture experts are old {Sorry, pleasantly middle-aged}. Young people new to the arena have a massive learning mountain to climb.

So, if we have stupid software, maybe we can get away with more stupid “smart” expert users. ie ALL of us can cope. You cut your teeth on smaller, simpler systems and learn how to cope with the stupid software beast. As you learn more, you learn to cope with more complex situations and they never get that complex as the database is not so “clever”

I’d actually still argue that all the intelligence gathering the Oracle database does should still continue – stats gathered on objects, the ability to gather information on memory usage and thus advice on changes, tracking column usages and table changes. But We, the Stupid Users get to look at it and use it as we see fit for our systems.

I’m sure many systems would not work quite so fast in my senario, but I’d rather have a system working at 75% it’s theoretical fastest all the time rather than one working at 95% and breaking regularly, and in ways so complex it needs weeks to work out and fix.

I now await all the comments to tell me how stupid I am {I can be blindlingly stupid, especially on Fridays}.

Unhelpful “helpful” people June 9, 2009

Posted by mwidlake in Management, Perceptions.
Tags: ,
11 comments

I keep meaning to getting back to more technical blogs but I need to spend some time sorting it out first, and something has just bugged the hell out of me, so another “wordy” one today.

Richard Foote has gone back to explaining the basics of indexes and the CBO, which if you are new to CBO or indexes, hot-foote {sorry} it over there immediately and check it out. He is a brilliant teacher.

Near the start, he has a link to someone on one of the OTN forums who is stating the Cost Based Optimiser sucks. I won’t repeat the link. No, sod it, I WILL repeat the link. Here it is. It might elevate the page in google’s scoring but what the heck.

This person’s rather poor outlook on the CBO did not bother me, nor the fact that he is, in my opinion, wrong {he suggests using the rule hint on oracle 10, which is an option but is not, in my opinion, a mighty good idea as (a) who now knows the rules of the Rule Based Optimizer and (b) the hint will be ignored if you are using most new features added since 8, such as bitmap indexes or IOTs or partitioning. It might {and I have no proof for this} cause the feature not understood by the RBO to not be used, so maybe ignoring a nice bitmap index or function based index. Oh, and (c), if you have good stats the CBO usually wins.}.

What irritated me was his/her high-handed and abusive posting. That really annoys me. Then I thought “no, we all lose our temper sometimes and the person they are having a pop at did kind of ask for it”. But because I think that being abusive or condescending on forums is such bad behaviour, I dug a little into the other postings by this person.

Some were helpful. Many were simply links back to other pages or to some front end to Google the person’s question. And several, many, were abusive. Along the lines of “Why are you so stupid”; “If you can’t be bothered reading the manual you don’t deserve help”; “I would not employ you as you are a moron”. You get the idea?

It is a big problem with forums, and actually also in the work place (and occasionally, sadly, at meetings and conference). People being condescending, antagonistic and demeaning to others who do not know what they, the Mighty Brain, knows, who seem to Mighty Brain to not be trying quite as hard as they could or are seemingly asking something obvious.

OK, if it is obvious, give the answer. It might be that you, oh Mighty Brain, did not understand the question. OK, they maybe are not trying hard enough. Suggest to them where they could look, maybe this person has 3 managers breathing down their necks and they just really, really want an expert opinion now as they are not sure which of the seven opinions in google to trust. And Mighty Brain, unless you were born with your knowledge placed in your head by God, as a special force on this earth, you didn’t know what this “moron” does not know at one time. Someone told you or, vary rarely, you worked it out for yourself.

The truly, blood-boilingly, unjust thing about Mighty Brain? They are so sure of their own towering knowledge that they can’t see that they are often wrong.  I can’t think of any Oracle Expert who is widely accepted by their peers who is a “Mighty Brain”. In fact, a common trait of the very best practitioners and teachers (of any subject, not just IT) is that they are always willing to admit they do not know and to learn.

I did look for a way to ask for this particular Mighty Brain to be barred from the forum, but then I just decided to vent my spleen on my blog and have a glass of wine.

I hope that person trips over and really cracks their shins or something. Nothing permanent, just something incredibly painful. Grrrrrr.

Update – Jonanthan Lewis has commented to let me know that this “unskilled and unaware” {what a brilliant phrase} is the Dunning-Kruger effect. The link got filtered out by the comment mechanism, so I’ve posted it {well, maybe a similar one} here. Thanks Jonathan.

Fear of Databases May 29, 2009

Posted by mwidlake in Management, Perceptions.
Tags: , ,
4 comments

“It’s all in the database!”

I’m sure most of you (if you are in the UK, I must remember that the web is a world spanning medium) have seen the adverts by the wonderful TV Licencing authority or the DVLA. If not they go something like:

“We keep records” {background music}
“We know if you have paid…or not.” {Music become more sinister}
We will find you, you cannot hide” {more affirmative music}
“It’s all in the database” {doom-laden musical flourish}

OK, maybe I lay it on a bit with the music.

Now, as a database professional, I see “it’s all in the database” as a good thing. With luck it will be a well designed database with referrential integrity and all nicely validated.

Nearly all news media stories about actual or perceived threats to electronic privacy also site “The Database” as the core.
“They {who?} will hold all your web searches in a vast Database”.
” A laptop holding a Database of 1 million double glazing customers has been stolen”. I bet it was actually 10 thousand and in a spreadsheet.

It’s getting to the point where I don’t feel comfortable telling people I meet outside of the IT world that I am a database expert. Databases are hardly ever now seen in a good light, they seem to be linked only to things bad and Orwellian.

The Database is also often cited when companies get things wrong for their customers. You ring up to complain about some aspect of non-service and are often told “Oh, it doesn’t agree with you in the Database” or “the Database has got it wrong”. No it hasn’t, the person putting the information in the database got it wrong. I’ve been in the unusual situation of being told a lie where the database was given as the cause but I had access to that database. So I checked and the database was fine. It was being used as a convenient and much maligned excuse.

Very little is mentioned of the beneficial uses of databases.
For most of us our salary is processed via databases and it is a lot cheaper and more reliable than having half a hundred pay clerks doing it manually in pen and ink.
Databases are used to hold or index much of that vast quantity of stuff that you can search for on the net. Even the useful stuff on Klingons.
I for one would welcome a UK-wide database holding my basic medical details so that when I go to my GP or hospital, they do not need my memory (and in fact my consciousness) to tell them my medical past. If I have an allergy to a common drug I damned well want all medical people to know that before they put 10cc of the stuff into my veins.

And to wrap up my bad-tempered tirad, I now find it particularly tricky to talk about what I still feel is my most significant achievement in IT, namely an 80TB Database of genetic information. Without getting into the topic of Bioethics, which is beyond the scope of this blog, Genetics and a lot of biological stuff is now painted grey, if not deep, murkey, scary Red by the media. I tell John down the pub that I created a huge genetics database, he is sure I am either working on a secret government project to know all about his inner workings or some evil company combining tomatoes and monkeys into some awful, new thing that {and he has seen the movies to prove this} in all likelihood will turn into a zombie killer, escape and do for mankind.

Maybe I’ll just tell people I shoplift for a living, it might be more socially acceptable than being involved in Databases or Genetics.

Follow

Get every new post delivered to your Inbox.

Join 152 other followers