jump to navigation

OUGN 2012 Third Day March 26, 2012

Posted by mwidlake in Meeting notes, Private Life.
Tags: , ,
7 comments

The last day of the three (and second on the ferry and of the conference proper) had a lot of talks I wanted to see, especially Dan Morgan talking about “Insanely large Databases”. It was a good talk, with an interesting interlude when a very loud announcement let us know we had docked at Kiel. Dan handled it with aplomb. Dan was talking about one specific experience he had suffered recently and he covered quite a few things I did and some I planned to but never got that far – but he had more technical details about the issues he had encountered, so all in all probably of more immediate use to the audience than my talk. It was a very good session. I have to confess, there were times I laughed out loud as memories flooded in, prompted by his wise words – I fear others may have interpreted differently but, honestly, I was laughing tears of shared pain.

I was also looking forward to seeing Uwe Hesse talk about Dataguard. I’d had the pleasure of spending a lot of time and a few beers chatting with Uwe over the last few days. His presentation was very well done (as it should be, he’s a professional trainer! He exceeded my expectations, though). And I loved the last bit, where he demonstrated how, under 11G R2 (R1 as well???), if you have a physical standby, a block corruption can be fixed “on the fly” and invisibly to the end user. I just love that feature and, though I knew about it already, seeing it demonstrated and the errors appearing in the alert log – though the user query runs fine – was sweet.

The rest of the sessions I saw were also good {Maria Colgan on preventing sub-optimal plans which was, mostly, about avoiding implicit data conversions, which I think all developers and designers should have drummed into their heads with rubber hammers; Doug Burns on those OEM performance graphs which continue to get better and better} – but I had to given in and go for a sleep. These anti-seasickness pills seem to work but make me dozy. I’d love it if those anti-travel-sickness pills were really placebos and I had a placebo side effect :-)

The last day was rounded off with a nice meal and one or two (or three, or four) beers in a bar and some excellent times. I of course spent time with the Other Martins (we could not disband the cluster too easily), Doug, Holger, Maria, our Norwegian hosts and many more of the other people there. If only I had managed to fit in the other 10, 15 people I wanted to see but I’m getting old and I was very, very, very tired.

I have to say, it was one of the best conferences I have ever been to. OUGN 2013? Get yourself on that boat.

Sail Away, Sail Away, Sail Away March 20, 2012

Posted by mwidlake in Meeting notes, VLDB.
Tags: , , , ,
1 comment so far

I’m just doing some last minute preparation for the Norwegian User Group Spring Seminar. This is quite a large conference with 5 concurrent streams and a very good line-up of presenters. You can see the agenda here.

What is also a little unusual about this conference is that it is mostly on a boat, or rather a ship. When I was first asked if I would put forward a talk or two I declined – as I am very prone to being sea-sick. Truls Bergersen, who helps organize the event, got in touch with me a few weeks later and said “we are doing a day on land first, do you want to present then?”. Sure! That would be great! So I am, I’ll be doing my talk on Index Organized Tables once more {and then I think I’ll put it to bed for a couple of years}.

Now, I am not exactly sure what made me agree to this, but then Truls asked if I would consider doing another talk I had suggested, on VLDBs (Very Large DataBases), the following day. Yes, why not?

Only, the next day we are on the ship. I’m a bit nervous about this. I don’t think that a good presentation should include the presenter going green and dashing for the door (though I did have a similar experience at Oracle Open World once, as a result of an all-you-can-eat Chinese meal the day before, but as it affected “the other end of things” you really don’t want details of that).

Thankfully, I’ve not been worrying about sea-sickness for the last couple of weeks. That will be because my laptop hard disk died and I lost everything on my still-new machine. It was still so new that I had not started backing it up properly – after all, there was nothing on there I could not replace easily.

It has been a real trial to replace all those things that I could easily replace. At least I had the presentations and some other critical bits on my temporary USB stick backup…

Database Sizing – How much Disk do I need? (The Easy Way) November 11, 2010

Posted by mwidlake in Architecture, development, VLDB.
Tags: , , , ,
7 comments

How much Disk do I need for my new Oracle database? Answer:-

  • 8-10 times the volume of raw data for an OLTP system
  • 2-4 times the raw data volume for a Data Warehouse.
  • The bigger the database, the nearer you will be to the lower multiplication factors.

{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}

Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…

In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.

First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.

Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.

If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.

If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.

The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:

  • The size of the Oracle binaries and the OS does not change as the database gets bigger.
  • The size of swap space does not increase in line wiht the database as, generally speaking, if you increase the database size from 100GB to 1TB you do not have the luxury of increasing the system memory of your server. It probably doubles.
  • Very large databases tend to have something making them big, like images or embedded documents, which are not indexed. Thus the ratio of table segments to index segments increases.
  • If you have a very large database you start removing indexes (often those that support constraints) to aid performance of data load and management, again improving the ratio of table segments to index segments.
  • Backups become partial or incremental to reduce the size and duration of the backup.
  • As mentioned before, the sheer size of system is such that you just take more care over cleaning up work areas, paring down the archived redo log areas (those files to compress well) and other areas.
  • If things get extreme or you have been doing this for donkeys years {note to none-UK people, this means many, many years} you start altering PCTFREE and checking over extent sizes.

My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).

I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.

Saturday Philosophy – The unbelievably small world of VLDBs June 12, 2010

Posted by mwidlake in VLDB.
Tags: , , ,
9 comments

Yesterday I posted about the potential for a Oracle in Science community within the UK Oracle user group {and wider for that matter, there is after all a world Oracle Life Science community but it is currently less vibrant than it was, sadly}.

My friend and occasional drinking partner Peter Scott replied to say he felt there was “a place for a SIG for stonking great databases” {now wouldn’t SGDB be a better TLA than VLDB? :-) }.

Well, I would agree but for one small detail. An apparent lack of anyone willing to be part of the community.

When I was building a very considerable VLDB {and I’m sorry I keep going on about it, I’ll try and stop soon} back in the early to mid 2000’s I seemed to be working in a vacuum of information, let alone prior experience. Yes, there was stuff in the Oracle manuals about how big things could theoretically be made and some vague advice on some aspects of it, but an absolute lack of any visible Oracle customers with anything even approaching the sizes I was contemplating. 2TB was about the limit and I was already way beyond that. Was this because I really was pushing the boundaries of database size? Well, I have since found out that whilst I was up there just behind the leading edge, there were several databases much, much bigger than mine and others already envisioned that might hit the Petabyte level, let alone Terabyte.

The thing is, no one would speak about them. At all.

We were left to do it all pretty much from scratch and it would not have been possible if I had not spent years building up with VLDBS as the definition of a VLDB size increased, plus of course cracking support by the other DBAs and Systems Admins around me. And to be fair, Oracle Corp helped us a lot with our efforts to build these massive databases. Interestingly, one Oracle Consultant would regularly tell me that our systems really were not so unusually big and there were plenty larger. He usually said this when I asked, exasperatedly as something else failed to scale, if Oracle had every tested things at this level :-). But despite constantly asking to meet with these people with massive systems, so we could exchange war stories and share advice, and being promised such contacts by Oracle, they never materialized except for CERN – who we already talked to as a fellow scientific organisation – and Amazon, who it turns out did things in a very different way to us {but it was really good to talk to them and find out how they did do their big databases, thanks guys}. Both were at the same scale or just behind where we were.

This is because most of the people with massive oracle databases will not talk about them as they are either run by the largest financial organisations, are to do with defense or in some other way just not talked about. In his comment Peter refers to a prior client with an OLTP-type system that is now around the PB scale. I would be pretty sure Peter can’t say who the client is or any details about how the system was designed.

So although I think there is a real need for a “stonking great databases” forum, I think there is a real problem in getting a user community of such people/organisations together. And if you did, none of the members would be allowed to say much about how they achieved it, so all you could do would be sit around and brag about who has the biggest. There is an Oracle community about such things, called the Terabyte Club, but last I knew it was invite-only and when I managed to get invited, it turned out that mine was biggest by a considerable margin, so I was still not meeting these elusive groups with 500TB databases. Maybe there is an Oracle-supported über database society but as I never signed the official secrets act might not have been eligible to play.

If I am wrong and anyone does form such a user group (or is in one!) I would love to be a member and I would strive to present and help.

I’ll finish with what appears to be a contradiction to what I have just written. There already is a UKOUG User Group that deals with large systems and I chair it – the Management and Infrastructure SIG. {sorry, the info on the web page could do with some updating}. Part of what we cover is VLDBs. But we also cover Very Many DataBases (companies with thousands of instances) and Very Complex DataBases plus how you go about the technical and management aspects of working in a massive IT Infrastructure. It might be that we could dedicate a meeting to VLDBs and see how it goes, but I know that whilst many who come along are dealing with database of a few TB, no one is dealing with hundreds of TB or PB database. Either that or they are keeping quiet about it, which takes us back to my main point. The MI SIG is probably the closest to a VLDB SIG we have in Europe though, and is a great bunch of people, so if you have a VLDB and want to meet some fellow sufferers, we have our next meeting on 23rd September in the Oracle City office.

DBMS SIG March 10, 2010

Posted by mwidlake in internals, Meeting notes, performance.
Tags: , ,
5 comments

I went to the DBMS SIG today {DBMS Special Interest Group meeting of the UK Oracle User Group}. Don’t worry, I am not going to run through the set of presentations and make comments on them – although I like reading such entries by others on their blogs, I generally like them due to the style of writing as opposed to getting information out of them. But there is the odd tidbit that can be worth disseminating to as wide an audience as possible and I like to do my bit.

Having said I won’t go through all the talks… :-). No, I just want to comment that all the talks had merit at this meeting and, quite rightly, the meeting was full. This is nice to see as last year SIG attendance fell across the board, due to the “economic climate”. Very daft, in my opinion, as I see SIGs as free training plus networking opportunities (sorry to use the “networking” word” plus different viewpoints, all of which are invaluable at any time and especially valuable when things are hard.

Go to SIGs (or whatever is available in your country) as it is always worth the day invested. Where else can you see half a dozen experts give opinions and also corner them and ask them more stuff as well?

Anyway, the tidbits. First, Jonathan Lewis demonstrated how he goes about solving issues with complex SQL statements. It was terribly feindish and extremly clever and needs incredible in-depth knowledge of the oracle RDBMS… He draws a pseudo Entity Relationship Diagram of the tables involved, adds in some figures on what filtering will achieve and what ratio of records will be involved in table joins and asks the question “what will probably happen” a couple of dozen times. Yes, I lied, it does not need vast knowledge. It needs a clear, simple approach to solving the problem. And an ERD. I love ERDs and rue their general demise. I use exactly the same method myself to investigate complex SQL performance issues {I think I could be accused of trying to ride on shirt-tails here, but honestly, the step I take if an Explain Plan does not help me is to ERD the statement and look at the indexes and ratios between tables to see how I , as a human, would solve the query. Chatting to a few other old lags, it is a relatively universal approach by those of us who have used ERDs}. If you are a member of the UKOUG I strongly recommend downloading the slides to Jonathan’s talk. If you are not a member, maybe Jonathan will present it again at another venue, or you could get him to come along and do a course on tuning. {Jonathan, if you get a gig as a result if this, I want a pint of Marston’s Pedigree, OK?}

{And thanks to Sean malloy for commenting to provide a link to a published version of Jonathan’s method – Jonathan did mention this, highlighting the fact that it is his first real foray into SQL*Server. However, the method is database agnostic. This is the article}

Second tidbit. Adrian Dodman and Owen Ireland (who both look remarkably like Hollywood hearthrobs in their pictures, but different as their in-the-flesh selves, though very decent chaps they are too.) did an excellent talk on VLDB physical standbys, a topic that has particular resonance for myself. They mentioned parallel_execution_message_size. This defaults to 2k, on 10g at least. It is a rubbish setting. No, let me not beat about the bush, it is an utterly rubbish setting. If you use parallel query, parallel recovery or parallel anything-at-all, check out this parameter and, due dilligence allowing, increase it. Try 8k as opposed to 2k and even 16k. The manual on it says the default of 2k/4k is fine. It ain’t. Increasing the value just takes some memory out of the shared pool and, these days, if you can’t afford a few extra KB out of your shared pool, you need to replace your server with something costing about twice as much as a top-end desktop PC. { Why am I so vigorous in my opinion on this parameter? Well, I had a situation a few months back of trying to migrate a database to a new geographic location for a client in Germany. We did a backup/recovery type operation to do this. Applying the redo logs was proving to be a performance issue so Oracle Corp suggested parallel redo log application. It ran a LOT slower than single thread, about 300% slower. However, increasing the parallel_execution_message_size from 2k to 8k made the parallel log application about 400% faster than single thread. ie a dozen times faster. I know from presentations by Christian Antognini and discussions with others that it is a key parameter to getting parallel query to perform well too.}

Last tidbit. Don’t drop the OUTLN user. Yes, I know, why would you? Just don’t, OK? Especially on Oracle 11. If you do, for whatever reason, DO NOT SHUT DOWN THE DATABASE. Call Oracle Support and pray. Thanks go to Peter Mahaffey for that one. Yes he did. It all went terribly wrong for him.

Ensuring Correlated Partition Exclusion #2 December 20, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
2 comments

<Previous Post

A few days ago {oh dear, it is now two weeks!} I showed how you could better control Correlated Partition Exclusion by creating a lookup table that showed, for each partition, the minimum and maximum ID (the partition key) and the minimum and maximum CRE_DATETIME (what you want to limit your query on). Using this range table in the way I described, you did not have to take an educated guess as to what range of IDs you used to include the partitions for the date range you were interested in, you could check the table.

But this only worked reliably where the ID and CRE_DATETIME increase directly in proportion to each other. It was a special case.

What about the more normal case, where there is a correlation between the two columns but it is not a perfect correlation? This would happen, for example, if the CRE_DATETIME is entered from another system, or from paper records, where the order of the records is not enforced. So some records from today get put into the system before some from yesterday. Or if the correlation is even loser than this. eg you are looking at orders for new customers. You “know” there are going to be no orders for these customers from 5 years ago but you are not sure how far back in the orders table you should go to find what you want.

You can still use the lookup table method. The lookup table in effect becomes a meta-index – an index of the segments where you will find data but not actually the rows.

To demonstrate this, I created a table where the ID and CRE_DATETIME increase in order:

create a partitioned table test_p4
(id           number(10) not null
,cre_datetime date not null
,status        number(1) not null
,num_1         number(4) not null -- random 20
,num_2         number(4) -- random 500
,num_3         number(5) -- cycle smoothly
,num_4         number(5) -- Random 10000
,vc_1          varchar2(10)
,vc_2          varchar2(10)
,vc_pad        varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
,partition id_max values less than (maxvalue)
tablespace users
)
--
-- local partitioned indexes on the table
 IND_NAME           TAB_NAME           PSN       COL_NAME
 ------------------ ------------------ --------- ------------
 TP4_CRE_DT         TEST_P4            1         CRE_DATETIME

 TP4_PK             TEST_P4            1         ID
--
-- populate the table with data
insert into test_p4(id,cre_datetime,status,num_1,num_2,num_3,num_4
                  ,vc_1,vc_2,vc_pad)
select rownum
,to_date('01-JUL-2009','DD-MON-YYYY')+(rownum/360)
,decode(mod(rownum,100),0,1
              ,0)
,trunc(dbms_random.value(1,20)) 
,trunc(dbms_random.value(1,50)) 
,mod(rownum,10)+1
,trunc(dbms_random.value(1,10000))
,dbms_random.string('U',5)
,lpad(chr(mod(rownum,6)+65),5,chr(mod(rownum,6)+65) )
,lpad('A',1000,'A')
from dba_objects
where rownum <43000

I then messed with the data, updating 10% of records and setting the CRE_DATETIME to plus or minus a random amount up to 2 days different, so data in partitions would overlap. I then created a range table in the same way as I did for the previous post.

I ended up with a range table like the below:

MIN_CRE_DATI      MAX_CRE_DATI          MIN_ID     MAX_ID
----------------- ----------------- ---------- ----------
29-JUN-2009 20:35 05-JUL-2009 14:24          1        999
02-JUL-2009 04:50 07-JUL-2009 12:56       1000       1999
05-JUL-2009 02:34 10-JUL-2009 08:31       2000       2999
08-JUL-2009 05:23 13-JUL-2009 03:32       3000       3999
11-JUL-2009 08:07 15-JUL-2009 18:41       4000       4999
14-JUL-2009 00:14 18-JUL-2009 18:27       5000       5999
16-JUL-2009 08:58 21-JUL-2009 14:18       6000       6999
19-JUL-2009 01:28 24-JUL-2009 11:21       7000       7999
22-JUL-2009 08:02 27-JUL-2009 07:01       8000       8999
24-JUL-2009 22:06 30-JUL-2009 05:37       9000       9999
28-JUL-2009 04:59 01-AUG-2009 10:57      10000      10999
...
24-SEP-2009 01:52 28-SEP-2009 18:36      31000      31999
26-SEP-2009 16:49 01-OCT-2009 01:26      32000      32999
29-SEP-2009 13:20 04-OCT-2009 13:43      33000      33999
02-OCT-2009 08:40 07-OCT-2009 10:11      34000      34999
05-OCT-2009 04:29 10-OCT-2009 04:09      35000      35999
08-OCT-2009 02:04 12-OCT-2009 17:34      36000      36999
10-OCT-2009 20:03 15-OCT-2009 08:39      37000      37999
13-OCT-2009 15:09 18-OCT-2009 12:01      38000      38999
16-OCT-2009 06:49 21-OCT-2009 03:53      39000      39999
18-OCT-2009 20:16 23-OCT-2009 21:01      40000      40999
21-OCT-2009 12:10 26-OCT-2009 07:13      41000      41999
25-OCT-2009 01:29 29-OCT-2009 20:56      42000      42999

You can see that the MIN_CRE_DATI-MAX_CRE_DATI from record to record (partition to partition) overlap but generally increase.

How do you find the start and end of the ID range to cover a date period you are interested in? I always have to sit down with a pen a paer to work this out. It is the classic “overlapping ranges” check, but my brain cannot hold on to it. So here goes. I want all records between the 1st of October and the 6th, this year.

You want to find the partition with the lowest ID range which has a record that falls into the date range you want. ie the Maximum CRE_DATETIME record is as late or later than the range you are interested in. If the maximum CRE_DATETIME is less than the date range you are interested in, no records will be in that partition.

Here is the code to find the lowest partition

select min(min_id)
from tp4_range 
where MAX_CRE_DATI>=to_date('01-OCT-2009','DD-MON-YYYY')
MIN(MIN_ID)
-----------
      32000

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    12 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    12 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    12 |   144 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MACD_MIAD |    12 |       |     1
----------------------------------------------------------

To cover the whole of the partition of interest you need to look for records with and I greater than the minimum in that partition, thus the selection of the min(min_id)

Similarly, you only want partitions where the minimum CRE_DATETIME is before the end of the date range you want. If all records in the partition have a CRE_DATETIME beyond the range, you are not interested in it.

select max(max_id)
             from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY')
MAX(MAX_ID)
-----------
      35999

Execution Plan
----------------------------------------------------------
| Id  | Operation                    | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                |     1 |    13 |     2
|   1 |  SORT AGGREGATE              |                |     1 |    13 |         
|   2 |   TABLE ACCESS BY INDEX ROWID| TP4_RANGE      |    37 |   481 |     2
|*  3 |    INDEX RANGE SCAN          | TP4R_MICD_MIID |    37 |       |     1
----------------------------------------------------------

Then you put these sub-queries into the query to select the data you want. Below I show the “original” code which does not use the partition range table and then using the partition range table, to prove the same number of records come back and to see the plan and statistics change:

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
------------------------------------------------------
| Id  | Operation                           | Name       | R
ows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
------------------------------------------------------
|   0 | SELECT STATEMENT                    |            |
   1 |    11 |   356   (1)| 00:00:02 |       |       |
|   1 |  SORT AGGREGATE                     |            |
   1 |    11 |            |          |       |       |
|   2 |   PARTITION RANGE ALL               |            |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P4    |
  36 |   396 |   356   (1)| 00:00:02 |     1 |    46 |
|*  4 |     INDEX RANGE SCAN                | TP4_CRE_DT |
1788 |       |    52   (0)| 00:00:01 |     1 |    46 |
------------------------------------------------------------
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        676  consistent gets
          0  physical reads

Note the 676 consistent gets and the Pstart/Pstop of 1-46

select count(*) from test_p4
where cre_datetime between to_date('01-OCT-2009','DD-MON-YYYY')
                   and     to_date('06-OCT-2009','DD-MON-YYYY')
and id> (select min(min_id)
             from tp4_range where MAX_CRE_DATI
                 >=to_date('01-OCT-2009','DD-MON-YYYY'))
and id < (select max(max_id)
             from tp4_range where MIN_CRE_DATI
                <=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

  COUNT(*)
----------
        48

Execution Plan
-----------------------------------------------------------
| Id  | Operation                            | Name
  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------
-----------------------------------------------------------
|   0 | SELECT STATEMENT                     |
  |     1 |    16 |    66   (5)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                      |
  |     1 |    16 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR           |
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|*  3 |    TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P4
  |     1 |    16 |    62   (5)| 00:00:01 |   KEY |   KEY |
|   4 |     BITMAP CONVERSION TO ROWIDS      |
  |       |       |            |          |       |       |
|   5 |      BITMAP AND                      |
  |       |       |            |          |       |       |
|   6 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|   7 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|*  8 |         INDEX RANGE SCAN             | TP4_PK
  |  1788 |       |     2   (0)| 00:00:01 |   KEY |   KEY |
|   9 |          SORT AGGREGATE              |
  |     1 |    12 |            |          |       |       |
|  10 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    12 |   144 |     2   (0)| 00:00:01 |       |       |
|* 11 |            INDEX RANGE SCAN          | TP4R_MACD_MIA
D |    12 |       |     1   (0)| 00:00:01 |       |       |
|  12 |          SORT AGGREGATE              |
  |     1 |    13 |            |          |       |       |
|  13 |           TABLE ACCESS BY INDEX ROWID| TP4_RANGE
  |    37 |   481 |     2   (0)| 00:00:01 |       |       |
|* 14 |            INDEX RANGE SCAN          | TP4R_MICD_MII
D |    37 |       |     1   (0)| 00:00:01 |       |       |
|  15 |       BITMAP CONVERSION FROM ROWIDS  |
  |       |       |            |          |       |       |
|  16 |        SORT ORDER BY                 |
  |       |       |            |          |       |       |
|* 17 |         INDEX RANGE SCAN             | TP4_CRE_DT
  |  1788 |       |    52   (0)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        350  consistent gets
          0  physical reads

The plan is admittedly more complex and the SQL can be tricky to understand if you are not used to code that looks for overlap of ranges. But Consistent gets is down to 350 and the Pstart/Pstop values are KEY-KEY. The CBO cannot tell you WHAT the ranges will be when the code is parsed {I think it should check but in 10.2.0.3 at least, the CBO is not that smart at parse time}, but it knows there will be a start and stop.

I am using tiny partitions for my example and only 45 of them for the table. When the partitions are for millions of rows and there are a couple of thousand of them, excluding partitions in a manner you can rely on is a powerful tool. Which leads onto a final word of caution.

Nothing is enforcing that the range table is maintained.

You could do things with triggers or regular re-calculation of the ranges table but this will be something you need to consider if you use ranges tables to help partition exclusion. Flipping tablespaces to read-only can help the worry go away though… :-)

As an example of the issue of needing to maintain the ranges table and also a demonstration that the ranges table does work correctly if maintainted, I’ll update a record well outside of the “expected window”, show that it does not appear in my range-check controlled code, then update the ranges table and try again.

update test_p4
set cre_datetime=to_date(’02-OCT-2009 11:15′,’DD-MON-YYYY HH24:MI’)
,NUM_2=5
where id=22100
/
3> select count(*) from test_p4
where cre_datetime between to_date(’01-OCT-2009′,’DD-MON-YYYY’)
and to_date(’06-OCT-2009′,’DD-MON-YYYY’)
and id> (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI (select min(min_id)
from tp4_range where MAX_CRE_DATI>=to_date(’01-OCT-2009′,’DD-MON-YYYY’))
and id < (select max(max_id)
from tp4_range where MIN_CRE_DATI<=to_date('06-OCT-2009','DD-MON-YYYY'))
and num_2 = 5

COUNT(*)
———-
49
— success

I wonder if I will do the next posting on this topic in less than two weeks!

Ensuring Correlated Partition Exclusion December 7, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,
9 comments

<Previous Post…Next Pos >
I’ve posted a couple of times recently about a tuning method with partition exclusion where you infer a relationship between the value you want to limit a query on and the partition key. It takes a while to explain the theory so I am going to give it a name, in the vain hope it catches on {forgive me if someone has already done this – in fact, just flame me with pointers if you already know a name for this}. At the very least, for my own postings, I can use this name from now on and link back to a single posting explaining it.

I’m going to call it Correlated Partition Exclusion. In essence, you have partitioned the table on a key, in my case the primary key ID, which is an ascending numeric probably sourced from a sequence.
You have a second column, in my case CRE_DATETIME, which increases in line with the PK. If you limit your query on the CRE_DATETIME partition exclusion is not possible as there is no guarantee which CRE_DATETIME values appear in which partition. But, as a human, you understand that if you create 10,000 records a day, if you want to look at the last week’s date you can use:

WHERE ID > MAX_ID-(7*10000)

to exclude partitions with an id more than 7 days worth ago.

So, you have your Correlated Partition Exclusion.

How can you be sure that going back 70,000 IDs is going to safely cover one week of data and how can you maximise your efficiency of including only partitions that cover the date range? {note, I am using a numeric ID as my partition range key and a datetime as my correlated column, this principle works just as well if you partition on datetime and want to correlate to a (generally) ascending numeric key}

Here is my test table :-

create table test_p3
(id number(10) not null
,cre_datetime date not null
,status number(1) not null
,num_1 number(4) not null -- random 20
,num_2 number(4) -- random 500
,num_3 number(5) -- cycle smoothly
,num_4 number(5) -- Random 10000
,vc_1 varchar2(10)
,vc_2 varchar2(10)
,vc_pad varchar2(2000))
tablespace users
partition by range (id)
(partition id_01k values less than (1000)
tablespace users
,partition id_02k values less than (2000)
tablespace users
,partition id_03k values less than (3000)
tablespace users
,partition id_04k values less than (4000)
tablespace users
...
,partition id_45k values less than (45000)
tablespace users
--
,partition id_max values less than (maxvalue)
tablespace users
)
/
--@ind_cols
IND_NAME TAB_NAME PSN COL_NAME
------------------ ------------------ --------- --------------------
TP3_PK TEST_P3 1 ID

TP_CRE_DT TEST_P3 1 CRE_DATETIME

If I want to look for all records between two dates I could use code like the below (based in that suggested by Bernard Polarski, any mistakes are mine).

with get_min_id as
(select max(id) min_id from test_p3
where cre_datetime >= TO_DATE('18-OCT_2009','DD-MON-YYYY') )
,get_max_id as
(select min(id) max_id from test_p3
where cre_datetime <= TO_DATE('20-OCT_2009','DD-MON-YYYY') )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')
/
COUNT(*)
----------
721

1 row selected.

ie find the minimum ID for the start date and the maximum ID for the end date and query between them.

This works fine. Unfortunately, you get a plan like the below

-------------------------------------------------------
| Id | Operation | Name |
Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
-------------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 39 | 887 (4)| 00:00:04 | | |
| 1 | SORT AGGREGATE | |
1 | 39 | | | | |
| 2 | NESTED LOOPS | |
2 | 78 | 887 (4)| 00:00:04 | | |
| 3 | NESTED LOOPS | |
1 | 26 | 717 (5)| 00:00:03 | | |
| 4 | VIEW | |
1 | 13 | 505 (5)| 00:00:02 | | |
|* 5 | FILTER | |
| | | | | |
| 6 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 7 | VIEW | index$_join$_001 |
38393 | 974K| 505 (5)| 00:00:02 | | |
|* 8 | HASH JOIN | |
| | | | | |
| 9 | PARTITION RANGE ALL | |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
|* 10 | INDEX RANGE SCAN | TP_CRE_DT |
38393 | 974K| 418 (6)| 00:00:02 | 1 | 46 |
| 11 | PARTITION RANGE ALL | |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 12 | INDEX FAST FULL SCAN | TP3_PK |
38393 | 974K| 152 (4)| 00:00:01 | 1 | 46 |
| 13 | VIEW | |
1 | 13 | 212 (5)| 00:00:01 | | |
| 14 | SORT AGGREGATE | |
1 | 26 | | | | |
|* 15 | VIEW | index$_join$_002 |
2972 | 77272 | 212 (5)| 00:00:01 | | |
|* 16 | HASH JOIN | |
| | | | | |
| 17 | PARTITION RANGE ALL | |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
|* 18 | INDEX RANGE SCAN | TP_CRE_DT |
2972 | 77272 | 59 (6)| 00:00:01 | 1 | 46 |
| 19 | PARTITION RANGE ALL | |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 20 | INDEX FAST FULL SCAN | TP3_PK |
2972 | 77272 | 152 (4)| 00:00:01 | 1 | 46 |
| 21 | PARTITION RANGE ITERATOR | |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 22 | TABLE ACCESS BY LOCAL INDEX ROWID| TEST_P3 |
2 | 26 | 170 (1)| 00:00:01 | KEY | KEY |
|* 23 | INDEX RANGE SCAN | TP_CRE_DT |
707 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
892 consistent gets
0 physical reads

If you look at the plan {towards the end, and sorry about the naff layout, my blog is not wide enough for this sort of printout} there are two checks on the TP_CRE_DT indexes that scan all partitions of the index – Pstart/Pstop are 1-46. This is the CBO looking for the partitions where the stated CRE_DATETIME records occur in the whole table. Cost is 892 consistent gets, much of which is the checking of local index partitions that will hold no relevant entries.

Bernard also spotted that the code was considering all partitions and was not as efficient as it could be but did not know how to get around it except with a Global index, which brings it’s own issues.

One way is to just say “well, I know how many records per day I get so I will fake up the ID limits based on this”. The problem is, and this is why the CBO cannot make the decision for you, is that there is no guarantee, no fixed rule, saying that CRE_DATETIME will always increment with the ID. In fact, there is nothing stopping you altering a record which has an ID from yesterday having a CRE_DATETIME from 10 years ago {but forget I said that until tomorrow’s post}. Now, in my example the CRE_DATETIME is going to increment with ID. We will use this special case for now, I know it is flawed and will address that flaw {tomorrow}.

So to ensure yo do not miss data you end up making your ID window pretty large to endure you do not miss records. Say you want all records for the last day, you process 1500 records a day, so you consider a window covering all samples with an ID within the last 10,000. It will still be more efficient than scanning all partitions and should be safe. Fairly safe.

The way out of this is to have a ranges table. Something that tells you, for each partition, which is the maximum and minimum CRE_DATE and the IDs covered by that range. You can then use that to identify the partitions that cover the date range you are interested in.

Here is an example. I will create a simple table to hold the ranges:

create table tp3_range
(min_cre_dati date
,max_cre_dati date
,min_id number
,max_id number)

Now you have to populate it.
The below code will create the first record for the first partition.

insert into tp3_range
SELECT MIN(CRE_DATETIME)
,MAX(CRE_DATETIME)
,MIN(ID)
,MAX(ID) FROM TEST_P3 PARTITION (ID_01K)
/
Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (
%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 12 | 102
(1)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 12 |
| | | |
| 2 | PARTITION RANGE SINGLE| | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
| 3 | TABLE ACCESS FULL | TEST_P3 | 999 | 11988 | 102
(1)| 00:00:01 | 1 | 1 |
------------------------------------------------------------------

Statistics
----------------------------------------------------------
1 recursive calls
3 db block gets
195 consistent gets
0 physical reads

As you can see from the plan and cost, this is not very efficient as it has to scan the whole partition. Maybe not a problem if you do this once, but there are indexes on both these columns, can’t this be done more efficiently? Yes, if you split up the code into four in-line selects (if you want more details about it being more performant to do MIN and MAX on their own than in one statement then see this post on the topic ):

insert into tp3_range
SELECT (SELECT MIN(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(CRE_DATETIME) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MIN(id ) FROM TEST_P3 PARTITION (ID_04K))
,(SELECT MAX(ID ) FROM TEST_P3 PARTITION (ID_04K))
FROM DUAL

PLAN
---------------------------------------------------------
SORT AGGREGATE
PARTITION RANGE SINGLE cst:2 rws:1000
INDEX FULL SCAN (MIN/MAX) TP_CRE_DT cst:2 rws:1000
Statistics
----------------------------------------------------------
1 recursive calls
5 db block gets
9 consistent gets
0 physical reads

{You may wonder why the Explain Plan section above has a different look. This is because there seems to be a bug in 10.2.0.3 where the autotrace plan for the insert statement comes out wrong, as a FAST DUAL access, so I had to Explain the statement in a different way}

You would run one of the above statements against each partition. Probably, the “Best Practice” way would be to generate a SQL script from a query against DBA_TAB_PARTITIONS.

To populate my table I cheated – I just assumed all my partitions are of the same size (1000 records) and used:-
insert into tp3_range
select min(cre_datetime),max(cre_datetime),min(id),max(id)
from test_p3
— where id between 10000 and 20000
group by trunc(id/1000)
/

Note I left in a commented line. You could run the above against the whole table and then limit it to just new partitions as you add them. This is a tad risky though, you are relying on the partitioning being perfect and it would not scale to hundreds of very large partitions. You would be better off with the partition-by-partition methods above.

You end up with a table like the below:-

select * from tp3_range order by min_cre_dati


MIN_CRE_DATI MAX_CRE_DATI MIN_ID MAX_ID
----------------- ----------------- ---------- ----------
01-JUL-2009 00:04 03-JUL-2009 18:36 1 999
03-JUL-2009 18:40 06-JUL-2009 13:16 1000 1999
06-JUL-2009 13:20 09-JUL-2009 07:56 2000 2999
09-JUL-2009 08:00 12-JUL-2009 02:36 3000 3999
12-JUL-2009 02:40 14-JUL-2009 21:16 4000 4999
14-JUL-2009 21:20 17-JUL-2009 15:56 5000 5999
17-JUL-2009 16:00 20-JUL-2009 10:36 6000 6999
20-JUL-2009 10:40 23-JUL-2009 05:16 7000 7999
23-JUL-2009 05:20 25-JUL-2009 23:56 8000 8999
26-JUL-2009 00:00 28-JUL-2009 18:36 9000 9999
28-JUL-2009 18:40 31-JUL-2009 13:16 10000 10999
31-JUL-2009 13:20 03-AUG-2009 07:56 11000 11999
...
14-OCT-2009 13:20 17-OCT-2009 07:56 38000 38999
17-OCT-2009 08:00 20-OCT-2009 02:36 39000 39999
20-OCT-2009 02:40 22-OCT-2009 21:16 40000 40999
22-OCT-2009 21:20 25-OCT-2009 15:56 41000 41999
25-OCT-2009 16:00 28-OCT-2009 10:36 42000 42999

add the two below indexes:
create index tp3r_micd_miid on tp3_range(min_cre_dati,min_id);
create index tp3r_macd_miad on tp3_range(max_cre_dati,max_id);

And now you can find the upper and lower ID bounds for a date range with the following code:

select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
)
MIN_ID
----------
39000

Execution Plan
---------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
---------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MICD_MIID |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |    40 |   360 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MICD_MIID |    40 |   360 |     1
---------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
)

MAX_ID
----------
39999

Execution Plan
----------------------------------------------------------
| Id  | Operation                     | Name           | Rows  | Bytes | Cost 
----------------------------------------------------------
|   0 | SELECT STATEMENT              |                |     1 |    22 |     2
|*  1 |  INDEX RANGE SCAN             | TP3R_MACD_MIAD |     1 |    22 |     1
|   2 |   SORT AGGREGATE              |                |     1 |     9 |        
|   3 |    FIRST ROW                  |                |     4 |    36 |     1
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| TP3R_MACD_MIAD |     4 |    36 |     1
----------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads

Put it all together into one statement and let’s see how much it costs:

with get_min_id as
(select min_id from tp3_range
where min_cre_dati = (select max(min_cre_dati)
from tp3_range
where min_cre_dati <TO_DATE('18-OCT_2009','DD-MON-YYYY')
) )
,get_max_id as
(select max_id from tp3_range
where max_cre_dati = (select min(max_cre_dati)
from tp3_range
where max_cre_dati >to_date('20-OCT_2009','DD-MON-YYYY')
) )
select count(*)
from test_p3
,get_min_id
,get_max_id
where id >get_min_id.min_id
and id < get_max_id.max_id
and cre_datetime between TO_DATE('18-OCT_2009','DD-MON-YYYY')
and TO_DATE('20-OCT_2009','DD-MON-YYYY')

COUNT(*)
----------
721
1 row selected.

Execution Plan
----------------------------------------------------------
| Id | Operation | Name | Row
s | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------
----------------------------------------------------
| 0 | SELECT STATEMENT | |
1 | 57 | 67 (5)| 00:00:01 | | |
| 1 | SORT AGGREGATE | |
1 | 57 | | | | |
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID | TEST_P3 |
2 | 26 | 65 (5)| 00:00:01 | | |
| 3 | NESTED LOOPS | |
2 | 114 | 65 (5)| 00:00:01 | | |
| 4 | MERGE JOIN CARTESIAN | |
1 | 44 | 2 (0)| 00:00:01 | | |
|* 5 | INDEX RANGE SCAN | TP3R_MICD_MIID |
1 | 22 | 1 (0)| 00:00:01 | | |
| 6 | SORT AGGREGATE | |
1 | 9 | | | | |
| 7 | FIRST ROW | |
40 | 360 | 1 (0)| 00:00:01 | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MICD_MIID |
40 | 360 | 1 (0)| 00:00:01 | | |
| 9 | BUFFER SORT | |
1 | 22 | 1 (0)| 00:00:01 | | |
|* 10 | INDEX RANGE SCAN | TP3R_MACD_MIAD |
1 | 22 | 1 (0)| 00:00:01 | | |
| 11 | SORT AGGREGATE | |
1 | 9 | | | | |
| 12 | FIRST ROW | |
4 | 36 | 1 (0)| 00:00:01 | | |
|* 13 | INDEX RANGE SCAN (MIN/MAX) | TP3R_MACD_MIAD |
4 | 36 | 1 (0)| 00:00:01 | | |
| 14 | PARTITION RANGE ITERATOR | |
| | | | KEY | KEY |
| 15 | BITMAP CONVERSION TO ROWIDS | |
| | | | | |
| 16 | BITMAP AND | |
| | | | | |
| 17 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 18 | SORT ORDER BY | |
| | | | | |
|* 19 | INDEX RANGE SCAN | TP3_PK | 7
07 | | 6 (0)| 00:00:01 | KEY | KEY |
| 20 | BITMAP CONVERSION FROM ROWIDS| |
| | | | | |
| 21 | SORT ORDER BY | |
| | | | | |
|* 22 | INDEX RANGE SCAN | TP_CRE_DT | 7
07 | | 48 (0)| 00:00:01 | KEY | KEY |
------------------------------------------------------------------


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
133 consistent gets
0 physical reads

If you look way back up this post, you will see that the number of records selected by the above is the same as from the code indicated by Bernard (721 records) and for a lot less cost – 133 consistent gets compared to 892.

So I have hopefully explained the issue (having to visit all the index partitions to check the date range), shown how a ranges table can help, given some simple and a less-simple-but-more-efficient examples of code to populate the table and finally shown that using the range table can be more efficient.

And I arrogantly gave it a name – Correlated Partition Exclusion :-)

Some of you may remember tha this example has been a special case, as there is no overlap between the dates; the relationship between the CRE_DATETIME and ID is perfect. You are unlikely to have that in real life. Also, dome of you may also be tired of reading this post. So I will cover the general case in the NEXT post.

More on Assisting Partition Exclusion December 6, 2009

Posted by mwidlake in performance, VLDB.
Tags: , ,
2 comments

A couple of weeks ago I posted about how you could help the oracle CBO carry out partition exclusion and thus aid performance. In essence, you have a table partitioned on ID (based on an ascending numeric) and you also have a date column (CRE_DATETIME, indexed using a locally partitioned index), which you often want to limit queries on.
The ID and the CRE_DATETIME both increase over time, so as a human you can infer that a date created range will match to an ID range that you can limit any queries on and get partition exclusion. The CBO does not understand this relationship and so can’t help. If you want to look at records for the last week and you know that you generate 10,000 records a day maximum, you can add a where clause of {and this is pseudocode}
WHERE ID BETWEEN (MAX_ID-(10000*7))
AND (MAX_ID)
I’m afraid you will have to look back at the referenced post if this does not make sense.

The issue I want to address is that how can you be sure that you are using a fake ID range to cover the date range?

A simple real-world example of this is when someone (a business objects user or helpdesk user seems to be the usual source) wants to look up the details of a fairly new customer on the datawarehouse. They don’t know the ID of the record, but they know the new account was created this week and some other filtering data. So they run a query saying:

WHERE CRE_DATETIME >SYSDATE-7
AND COL_A = ‘XXXX’
AND COL_B = ‘YYYY’

This visits every partition in the table, even if there is a locally partitioned index on CRE_DATETIME. See this post for details of this problem. If you are really unlucky, and this seems to be the usual situation, there is no index on the CRE_DATETIME either, and a full scan of the billion-row table is initiated, tying up one CPU and giving that expansive {Sorry, type, expensive – though both are true) storage sub-system something to work on for a few hours.

However, in this situation, do this. Select max(ID) {the primary key} from the partitioned table, which will take less time than it does to type “select”. Then you say to the user:

 “OK, the max ID is currently 987,000,000. Add this to the WHERE clause:

AND ID > 980000000

If that does not find your record now change that additional clause to:

AND ID BETWEEN 970000000 AND 980000000

and just keep going down by 10 million each time.”

By doing this, the user will be limiting the query on the partition key, the ID, and partition exclusion will be possible and each query will come back quickly. The user will usually find the record they want in the first couple of attempts {or will be kept quiet and less demanding of the system until they get bored and come back to tell you “you are an idiot”, having scanned back over several 10’s of millions of records, but then you know they lied about the record being recent so you need to help them in some other way}.

This post is actually to remind you of where I had got on this thread, my being so unforgivably slow in keeping this thread running. Tomorrow (HONEST!) I will cover HOW you ensure your inferred ID range is covering the CRE_DATETIME range you are interested in.

Partitions are Not for Performance October 29, 2009

Posted by mwidlake in Architecture, performance, VLDB.
Tags: , , ,
16 comments

There is a myth that Partitions in Oracle magically aid SQL Query performance, even a general assumption that the main role of partitioning is to aid such query performance. This is not so. There are certainly many cases where Partitioning can aid performance but in my experience they just as often hinder rather than help overall performance.

The myth is so strong that when asked at interview, most DBAs {and even some performance specialists} will site query performance as the main (and occasionally only) benefit of partitioning.

Why can partitions hinder SQL query performance? Let’s say for example that you have a 10 million row table and an index on that table. That index has a B-Level of 3, which means it has a root block, one level of branch nodes, a second layer of branch nodes and then the leaf-node level. To access a row in the table via the index Oracle needs to read the root block, two branch blocks and then a leaf block to get the rowid of the record. This allows the table block {and from that the row} to be read. This is depicted in the below diagram, the numbered orange squares represent the blocks as selected in turn from the index and then table:

index_access_global

That is 5 I/O operations to access that row.

Now we partition the table into 5 partitions, 2 million rows each. The index is locally partitioned. If you are lucky, you may, just may, end up with local index partitions with a B-level 1 less then the original table, so in our case a B-level of 2. The often suggest process is now that one partition is considered and thet the CBO will read one root node, a branch level block, a leaf block and then the block from the partition.

single_index_partition_access

4 I/Os and a saving of 20% {I’m ignoring for now caching and whether it is physical or logical IO}.

A saving of 20% IF {and only if} you have local index partitions with a lower B-Level than the equivalent non-partitioned index. And the access is to one partition alone.

However, I keep seeing situations where the index look up does not include the partition key. So you get the below situation:

index_lookup_partition

Lacking the partition key, the CBO cannot exclude any partitions – so it has to scan each one. For most partitions, maybe for all but one, no records are found, but the index has to be checked with 3 IO operations each. so in my example 16 I/Os are done to get the one record of interest.

16 I/O operations instead of 5.

The situation is often not spotted, at least initially, as the time taken to carry out the extra local index partition scans is “small”, especially for specific lookups. Usually any testing is done on a table with only a small numer of partitions.

I remember well the first time I came across this {on an Oracle 9.0 database I think}, there was well over 100 partitions and a process that checked many thousands of individual records had slowed down significantly, taking 4 or 5 times as long as before.

An indicator that the problem is occurring is when a single record lookup against the index and then table is taking perhaps several dozen to several hundred consistent gets rather than the 5 or 6 it should. Also, you will see the partition iterator in the explain plan. In that first case where I came across the issue, consistent gets of about 380 per record fetched were being seen for a select that returned 1 record 99% of the time from a single table lookup. I’ve since seen the same problem on tables with over a thousand partitions, each local index being scanned for a total of almost 5000 consistent gets per record.

You may think that this would be an unusual situation as access against very large tables is either full/partial table scans or lookups on the PK/with a WHERE clause including the partitioning key – but it is actually very common. Partitioned tables are being used more and more in large but generally OLTP applications or sets of records are identified in a datawarehouse that are then checked more specifically with generally row-specific logic. With VLDBs which have many, many partitioned tables with many, many partitions each, the problem is common and often not recognized in the fog of other issues and massive I/O levels.

I’ve only covered a general performance issue with partitions here, I’ll expand on the theme and this simplistic example in the next post.

And yes, there are many ways partitioning CAN aid performance. I aim to get to those too. I really love partitioning.

VLDB Backups October 13, 2009

Posted by mwidlake in Architecture, VLDB.
Tags: , , ,
9 comments

One of the indications that your database classes as a VLDB is that your backups are giving you problems simply due to the size of the database.

As some of you will know from previous posts of mine, like this one about how vital it is to prove your backup and this one about how you maybe need to back up more than just the database, I have a thing about backups. Or, more specifically, recovery.

My focus on backups may surprise some people who have worked with me as I often state “I am not a production DBA”. {And no Dave, this is not me saying I am too good to be a proddy DBA, it is me saying I do not have current, strong skills in many of the daily proddy DBA tasks}. However, I am an architect. Whenever I work on a new system or heavily modify an existing system, I try and keep the need for backup and recovery at the front of my mind.

The most common issue encountered with backing up a VLDB is the time it takes to run the backup, it can’t be done in the time between backup cycles, usually a day. 

The second most common issue is the impact on the live system of running the backup. Sometimes this impact is overstated, after all if the backup is one thread running on the server it is only going to consume as much resource as one thread can, which may well leave enough over for the daily processing requirements, but usually for large systems steps have been taken to run the backup in parallel, thus creating considerable load on the system.

A third issue, which is related to the first, is that the backup takes so long and uses so many tapes (or space) that it rarely completes – a network glitch, a failure of the backup suite, running out of media, all stop the backup finishing. I’ve been in the situation of attempting 4 or 5 backups for each one that succeeds as something crops up in the 2 or 3 days it takes to run the backup. {In our case it was the flaky backup software, grrrrr}.

The final issue I’ll mention is one that is often overlooked. You can’t afford the time to recover the backup if it was ever needed. I’ve seen this especially with export or expdp-based backups – Some sites still use export and it has it’s place with smaller systems – often it seems to be used with OLTP systems that have more than 75% of the database volume as indexes. The export runs fine overnight, it is only processing that 25% of the system that is data. But when you ask the client if they can wait 5 days to import the export they go pale. This time-to-recovercan also be a problem with RMAN backups, you need to read in everything you wrote out.   

I’ve said it before but I’m going to say it again – a backup is not a backup until you have done a successful test full recovery. This would certainly highlight how long your recovery takes.

So, how do you solve the problem of backing up a VLDB?

Well, one solution is to not bother. I know of a couple of sites that have two physical copies of the database, at different locations, and write all data to both. If they lose one copy, they can keep running on the other copy whilst the lost version is rebuilt. Your swap-over could be almost instant.
Drawbacks here are:

  • If you lose one copy you have no redundancy until the second system is rebuilt. This is like losing a disk out of a RAID5 array, another failure is disaster. As databases get bigger, this period of zero redundancy gets longer and thus the chance of a second failure increases (which again is just like the RAID5 array – yet another argument against massive discs).
  • As you write to both systems, if the damage is caused by the feed (eg accidentally deleting data) then both are damaged, unless you have a delay on one system, in which case you now have issues with catching up on that delay if you have to swap to the second system. Flashback may save you from damage caused by the feed.
  • The cost of the second system and the complexity of the double-writes can both be issues.

Another solution is physical DataGuard. I see this as slightly different from the double-system approach as above as you have more options, such as replicating to more than one other system, opening and reading  the DataGuard copy, opening and using the copy before flashing it back and re-recovering, even Active DataGuard, where you can have the standby database open and being used, even whilst it is kept up-to-date. Again, you can set things up so that the gap between primary system failure and bringing up a new production system is small. A few issues to be mindful of are:

  • You have to ensure that your primary database is running in forced logging mode or you are extremely, and I mean extremely, careful about what you do against the database that is unrecoverable. The latter option is just asking for trouble actually. Which is a shame, as all those performance tricks of doing direct IO, append operations and nologging activities to help you process all the data in your VLDB are no longer available to you. This might be a show-stopper.
  • You have to take care in setting it all up and may need extra licence.
  • You still have the issue of damage being transmitted to your “backup” before you spot it.
  • The main issue? Someone will get clever and use your DataGuard systems for other things {Such as opening the standby to changing it and then flashing the data back, or use active data guard for reporting which becomes critical to your business} and now you actually have a production critical system split across the DataGuard architecture. It has stopped being a backup, or at least not a dedicated backup. Ooops.

There is actually no need to backup the whole database every night, though some sites seem fixated on achieving this. Or even every week. There is nothing wrong in having an RMAN level 0 {zero} backup that is a copy of everything and then just keep backing up the archived redo logs for eg 2 weeks before doing another level 0. So long as you thoroughly test the recovery and ensure you can recover the level 0, get hold of all those redo logs and apply them in a manner timely enough to support your business. I’ve recovered a level 0 backup over a month old and then run through all the archived redo logs to recreate the system, it worked fine as the volume of redo was pretty small compared to the database. Some considerations with this method are:

  • If you ever have trouble getting files from different days out of your backup area, or occasionally find files from your backup system are corrupt, do not even think of  using this method. One missed archive redo file from 13 days back and you are in serious trouble.
  • You need to do those level zero backups and they take a while. remember what I said about issues during a long backup?
  • It can get complex.
  • There is going to be a fairly significant delay in recovering your system.

There are several options with RMAN of doing incremental and cumulative incremental level 1 backups against a level 0 baseline backup. They have the same pros and cons as above, often trading more complexity with shorter recovery times. All good so long as you practice the recovery.

Physical copy at the storage level. These solutions seems to come and go every few years, but the principle is usually either (a) splitting mirrors – you have eg 3 copies of the data duplicated across the storage, you can un-couple one copy and do to it what you want, like copy it to tape- and then reintroduce the copy and catch up on changes, ie “resilver” the mirror. (b) use fancy logging within the storage layer to create a  logical copy of the whole live DB at a point in time by tracking and storing changes. You can then take your time copying that logical version to your backup destination. Taking the initial copy is usually instantaneous and with (b) can take up a surprisingly small amount of space. Disadvantages?

  • Cost. These clever IO units that can do this seem to be eye-wateringly expensive
  • Tie-in. You move storage provider, you need to re-plan and implement a new backup strategy
  • Probably personal this one, but can you trust it?  I saw it go horribly wrong in about 1998 and even now I kind of wince internally thinking about it. 

Export and Import. OK, I kind of rubbished this approach earlier and who in their right minds would try and export/import a VLDB of 20TB? You don’t. You export the critical few percent of the database that you need to recreate some sort of temporary production-capable system. Many applications can actually get by with all the reference/lookup data and the latest month or two of active business data. It gets a workable system up and running to keep your business process ticking over whilst you sort out recovering the rest of the system. The beauty of an export is that it can be imported to any working Oracle database of a high enough release level.

3 months ago I would have said this consideration needed to have been designed into you system architecture from the start, to stand any reasonable change of working, but I know of one site that managed just this technique recently. Only because they absolutely had to, but they managed it.   

My final backup methodology I’m going to mention here is – you do not need to back up all of your database in the same way. If you can move a large percentage of your database into readonly tablespaces, you can back up that section of the database once {disclaimer, by once I mean two or three times to two or three places and check you can read what you wrote and padlock the door to the vault it is in, and repeat said once-only backup every 6-12 months} and drop that section out of your backup. Now you only need to back up the remaining, hopefully small, active section of the database with whatever method you want. You can tie in the previous above of only needing to recover a critical subset of the system to get going again, ie what is not readonly, the two approaches complement each other. A few issues:

  • It only really works when you design this into the system from the start.
  • potentially complex recovery spread over a lot of tapes. Can you rely on being able to get at them in a timely manner?
  • People have a habit of wanting to update some of the stuff you made readonly. Sometimes only a few records but spread over a lot of readonly tablespaces.

All the above can be mixed and matched to come up with an overall solution. Personally I like having a physical standby database for immediate failover and an incremental backup off site for recovery beyond some situation that gets both primary and standby database.

Follow

Get every new post delivered to your Inbox.

Join 156 other followers