jump to navigation

Stuck execution plans – purging a cursor July 3, 2009

Posted by mwidlake in performance.
Tags:
7 comments

I was at a UKOUG RAC/HA special interest group meeting last month. My favorite presentation of the day {out of what was a very good selection} was by Harald Van Breederode, an Oracle University DBA trainer from the Netherlands.  Harald’s presentation was on Checkpoints, a feature of Oracle that most DBAs tend to know a little about but never look into extensively. If you are a member of the UKOUG you can get the slides from here {If you are not a member, and you are in the UK {{or even Europe}}, nag your boss, get a membership – the conferences and meetings are a great source of info}.

Anyway, that is not the topic of this Blog. I finally downloaded a copy of the slides today and I checked out Harald’s blog. I immediately learnt something, which isthe topic of this blog.

In Oracle 10.2.0.4 (and upwards I assume) you can now flush a specific cursor out of the library cache using dbms_shared_pool.purge. You need to create the package first, it is not installed by default:-

@?/rdbms/admin/dbmspool

Package created.
Grant succeeded.
View created.
Package body created.

You also need to set an event to allow the purge to work. See metalink note 751876.1:
“The fix has gone into 10.2.0.4. However, the fix is event protected. You need to set the event 5614566, to make use of purge. Unless the event is set, purge will have no effect.”

alter session set events ‘5614566 trace name context forever';

Now that you have the feature installed you can find the address and hash_value of a cursor you want to purge with eg:

select address, hash_value from v$sqlarea
where sql_text = ‘select count(c2) from skew where c1 = :bind';
ADDRESS HASH_VALUE
——– ———-
27308318 2934790721

And purge it.

exec sys.dbms_shared_pool.purge(‘&address, &hash_value’,’c’)
PL/SQL procedure successfully completed.

For more comprehensive details check out Haralds original post.

Why would you want to flush a specific cursor out of the library cache? Because it is performing poorly and you want the CBO to have another go at coming up with a plan. Before this nice sys.dbms_shared_pool.purge function you had three other options to flush the cursor out.

  • You can flush the whole shared pool
    alter system flush shared_pool;
    You really want to avoid doing this on a live system, as it flushes all SQL, PL/SQL (functions, packages, procedures} and data dictionary information, which will now have to be read back in when next required. It’s like unloading a whole delivery truck just to throw out one letter and then reloading the truck.
  • Alter one of the objects referenced by the SQL statement, with some DDL. Some suggest doing a grant or revoke on a table, I used to try adding a table comment {to what it already was}. I’m still not that comfortable with that option on a live system as you are doing something to a live object.
  • Another option is to gather new stats on one of the objects referenced by the cursor, which is fine so long as there is a small object you can quickly execute a dbms_stats.gather_xxxx_statistics on.

So I’ll look more into this package as I think it is cleaner way to do it. Mind you, there is that setting of an event on a live system…

How does a statement get a poor plan that is going to be fixed simply by asking the CBO to parse it again?

In Harald’s posting he also covers a main reason as to why you would want to flush a sql cursor out of the shared pool. His worked example shows how a SQL statement with a bind variable is open to getting an execution plan the first time it is parsed which is suitable for that first value of the bind – but NOT suitable for most other executions. eg the first value passed to the bind is maybe a very common one matching a large percentage of the table and so an index is not used, but most executions of the statement are with a bind value that matches very few records, and thus the index lookup is the best plan. But once the sql statement is parsed, the plan will be used until the statement disappears out of the  shared pool.

Another cause of a poor plan is when the statistics for an object referenced by the SQL statement changes. The CBO sometimes just chooses a plan that is not good. You may be thinking that, in this case, the optimizer will simply come to the same plan if asked again. Maybe not.

An issue I kept encountering at one site was very interesting. One of a small handful of simple SQL statements would go rouge overnight. Usually about 2 or 3am in the morning. Instead of a plan using a a highly specific index and a couple of table joins, a very, very poor plan was used instead. It only ever occurred early in the morning and only when the automated statistics gathering job had gathered stats on one of the tables involved. It took a while to spot this as the SQL would usually go rogue a while after the stats on the relevant tables had been gathered. This is because SQL statements are not invalidated when the underlying segments have their stats re-gathered by the automated job, they are invalidated “a little later”. It seems on Oracle 10.2 to be within an hour of the gather but not always. {To be totally accurate, this delayed invalidation is due to the DBMS_STATS parameter “no invalidate” defaulting to the value DBMS_STATS.AUTO_INVALIDATE but it can be overridden if you wish}

What seemed to be happening, though I never 100% proved it {so I am guessing, do not take this as a fact} is that one table would have new stats and the range of values for a column would include recent data {let’s say values 1 to 10,000,000}. Another table had the old information and so the range of known values was different {1 to 9,200,000}. This discrepancy would cause the wrong plan to be chosen. {I have a wilder theory which is that the indexes for a table had a different range of values for a column as the table stats had, but there are problems with my theory}.

By the time I logged in to the system in the morning to fix the overnight performance problem, stats on all relevant tables had finished being gathered and prompting the code to re-parse was the solution.

That leads me to the last point {sorry, a very long and wordy post again}.

Sometimes you can’t purge the cursor. The execution plan is stuck. Why?

Let’s say you have a cursor that is executed say 100 times a minute. Normally each execution runs in less than 50 milliseconds. All is fine. It has now gone rogue and it is taking 5 seconds to run, as the execution plan is poor. Each execution completes but at any time there is always at least one session running the code, usually several.

A SQL cursor will not be flushed from the shared pool if it is in use. Thus this rogue cursor gets stuck in the SGA. You can’t alter any of the objects as you can never get an exclusive lock on them. You flush the shared pool in desperation {even on your live system} and the cursor stays there, as even alter system flush shared_pool will not flush out an in-flight cursor. You could try locking the table, but like the DML, you are likely never to get that lock.

In this situation you have to identify the application running the code, stop it and wait for all executions to finish. Not maybe an issue in my example of 5 seconds to complete, but I’ve had the same problem with code run every few minutes now taking over an hour, so we could not wait an hour for it to sort out. In that situation we also had to kill sessions.

Of course, stopping and starting the database will cure the problem but the business may not be too happy about the whole database being shut down. {This prompts a thought – I’ve never noticed this but I wonder if you get the same problem on RAC but only impacting one node?}

This new package may help with stuck execution plans in that you can just keep trying over and over again to flush the one cursor until you hit a moment when no current execution is running.

Follow

Get every new post delivered to your Inbox.

Join 156 other followers