jump to navigation

Friday Philosophy – Oracle Performance Silver Bullet August 5, 2011

Posted by mwidlake in Architecture, Friday Philosophy, performance.
Tags: , , ,

Silver Cartridge and Bullet

For as long as I have been working with Oracle technology {which is now getting towards 2 decades and isn’t that pause for thought} there has been a constant search for Performance Silver Bullets – some trick or change or special init.ora parameter {alter system set go_faster_flag=’Y’} you can set to give you a guaranteed boost in performance. For all that time there has been only one.

There are a few performance Bronze Bullets…maybe Copper Bullets. The problem is, though, that the Oracle database is a complex piece of software and what is good for one situation is terrible for another. Often this is not even a case of “good 90% of the time, indifferent 9% of the time and tragic 1% of the time”. Usually it is more like 50%:30%:20%.

Cartridge with copper bullet &spent round

I’ve just been unfair to Oracle software actually, a lot of the problem is not with the complexity of Oracle, it is with the complexity of what you are doing with Oracle. There are the two extremes of OnLine Transaction Processing (lots of short running, concurrent, simple transactions you want to run very quickly by many users) and Data Warehouse where you want to process a vast amount of data by only a small number of users. You may well want to set certain initialisation parameters to favour quick response time (OLTP) or fastest processing time to completion (DW). Favouring one usually means a negative impact on the other. Many systems have both requirements in one… In between that there are the dozens and dozens of special cases and extremes that I have seen and I am just one guy. People get their database applications to do some weird stuff.

Partitioning is a bronze bullet. For many systems, partitioning the biggest tables makes them easier to manage, allows some queries to run faster and aids parallel activity. But sometimes (more often than you might think) Partitioning can drop rather than increase query or DML performance. In earlier versions of Oracle setting optimizer_index_caching and optimizer_index_cost_adj was often beneficial and in Oracle 9/8/7 setting db_file_multiblock_read_count “higher” was good for DWs….Go back to Oracle 7 and doing stuff to increase the buffer cache hit ratio towards 98% was generally good {and I will not respond to any comments citing Connors magnificent “choose your BCHR and I’ll achieve it” script}.
You know what? There was an old trick in Oracle 7 you could maybe still look at as a bronze bullet. Put your online redo logs and key index tablespaces on the fastest storage you have and split your indexes/tables/partitions across the faster/slower storage as is fit. Is all your storage the same speed? Go buy some SSD and now it isn’t….

Cartridge with Wooden Bullet

Then there are bronze bullets that you can use that very often improve performance but the impact can be catastrophic {Let’s call them wooden bullets :-) }. Like running your database in noarchivelog mode. That can speed up a lot of things, but if you find yourself in the situation of needing to do a recovery and you last cold backup is not recent enough – catastrophe. A less serious but more common version of this is doing things nologging. “oh, we can just re-do that after a recovery”. Have you done a test recovery that involved that “oh, we can just do it” step? And will you remember it when you have a real recovery situation and the pressure is on? Once you have one of these steps, you often end up with many of them. Will you remember them all?

How many of you have looked at ALTER SYSTEM SET COMMIT_WRITE=’BATCH,NOWAIT’? It could speed up response times and general performance on your busy OLTP system. And go lose you data on crash recovery. Don’t even think about using this one unless you have read up on the feature, tested it, tested it again and then sat and worried about could possibly go wrong for a good while.

That last point is maybe at the core of all these Performance Bronze Bullets. Each of these things may or may not work but you have to understand why and you have to understand what the payback is. What could now take longer or what functionality have I now lost? {hint, it is often recovery or scalability}.

So, what was that one Silver Bullet I tantalizingly left hanging out for all you people to wait for? You are not going to like this…

Look at what your application is doing and look at the very best that your hardware can do. Do you want 10,000 IOPS a second and your storage consists of less than 56 spindles? Forget it, your hardware cannot do it. No matter what you tune or tweak or fiddle with. The one and only Performance Silver Bullet is to look at your system and your hardware configuration and work out what is being asked and what can possibly be delivered. Now you can look at:

  • What is being asked of it. Do you need to do all of that (and that might involve turning some functionality off, if it is a massive drain and does very little to support your business).
  • Are you doing stuff that really is not needed, like management reports that no one has looked at in the last 12 months?
  • Is your system doing a heck of a lot to achieve a remarkably small amount? Like several hundred buffer gets for a single indexed row? That could be a failure to do partition exclusion.
  • Could you do something with physical data positioning to speed things up, like my current blogging obsession with IOTs?
  • You can also look at what part of your hardware is slowing things down. Usually it is spindle count/RAID level, ie something dropping your IOPS. Ignore all sales blurb from vendors and do some real-world tests that match what you app is or wants to do.

It’s hard work but it is possibly the only Silver Bullet out there. Time to roll up our sleeves and get cracking…

{Many Thanks to Kevin Closson for providing all the pictures – except the Silver Bullet, which he only went and identified in his comment!}

IOT part 3 – Significantly Reducing IO August 2, 2011

Posted by mwidlake in development, performance.
Tags: , , , ,

<..IOT1 – the basics
<….IOT2 – Examples and proofs
……>IOT4 – Boosting Buffer Cache Efficiency
……..>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

In the previous two posts I covered the basics of Index Organized Tables (IOTs) and then created some test tables to demonstrate the benefit of IOTs that is most often covered – reducing the IO needed to get a single record by one IO, say from 5 to 4. {Whether this is a buffer get from memory or a disc IO depends on if the block is cached, of course}.

In this post I am going to show how IOTs can far more significantly reduce the IO when several related rows are required.

Below is one of my test tables, the one that is a normal heap table and has a primary key, CHHE_PK on PARE_ID and CRE_DATE:

mdw11> desc child_heap
 Name                                      Null?    Type
 ----------------------------------------- -------- --------------
 PARE_ID                                   NOT NULL NUMBER(10)
 CRE_DATE                                  NOT NULL DATE
 VC_1                                      NOT NULL VARCHAR2(100)
 DATE_1                                             DATE
 NUM_1                                              NUMBER(2)
 NUM_2                                              NUMBER(2)

mdw11> select count(*),count(distinct(pare_id)) from child_heap

---------- ------------------------
   1000000                     9999

As you can see, the table has 1 million records and 9,999 values for PARE_ID, there are approx 100 records per parent. The data was created to match a common situation – that of a bit of data coming in for each parent every day. See post 2 for details.

The result of this is that the data for any given parent is scattered through the table. As the data comes in for a given day, the data for the first parent is added to the end of the table, followed by all the data for all the other parents who have data that day. The next day this is repeated, so the child records for a given parent are interspersed with the child records for many other parents.

The below diagram demonstrate what will now happen if you issue a statement like
select *
where PARE_ID=12

Oracle quickly works down the index to the leaf block containing the first key that matches the range. This takes, in my example, 4 block reads. Oracle now works through the index entries and, via the rowid, identifies the exact block to visit in the table for each key. For each key it has to visit a new block – because the data is scattered through the table. This is what the clustering_factor in the index statistics is measuring, how often contiguous rows in the index are for the same block. In our case, almost never.
In my diagram I do not number those table reads but in my simplistic diagram it would be 10 further reads.
If Oracle reaches the end of the leaf block before it reaches the end of the range of key values, oracle follows the pointer in the leaf block (not shown) to the next leaf block (whcih is another block read) and continues working through the keys until the range scan is completed.

In my simplified diagram I only have 6 entries per leaf block. In reality, and in my example tables, this is more like a few hundred. 247 in the case of CHHE_PK.

Now let’s consider my Index Organized Table, CHILD_IOT. It has exactly the same columns as CHILD_HEAP and the data was created in the same way. However, because it is an IOT, as the data came in it was inserted into the primary key index and is thus in an ordered state.

The below diagram demonstrate what will now happen if you issue a statement like
select *
where PARE_ID=12

Oracle works down the index to the leaf block where the range scan begins and now simply works along the leaf blocks. There is no need to go and visit the table as there is no table.

In my IOT diagram the leaf entries are longer and there are fewer in each leaf block, ie 5. So my scan has to visit 3 leaf blocks rather than 2. In reality the difference is more pronounced, in my example table there are actually 56 rows per leaf block, compared to the 247 in the index on the heap table. As such, my scan on the IOT will cover more leaf blocks but this is insignificant compared to the reduction in block visits caused by not having to go hunt down records scattered over the table. Even in the unlikely event of my IOT being deeper by 1 level (an extra layer of branch blocks) due to the reduces entries per leaf block, I would still be winning for range scans.

That is all nice theory and pictures. As ever, we need to back this up with some real tests. Firstly, I am using SQL*Plus and I need to set my arraysize large enough so that I do not introduce extra consistent gets through selecting small sets of rows between client and server. You will need to do the same to see similar results to me.
{I keep meaning to do a dedicated post on arraysize but H.Tonguç YIlmaz has a nice post already on it.}

set arraysize 200
set autotrace on

Now I will select all the records for PARE_ID=10, including a column not in the Primary Key, so that the table needs to be visited. I did this twice to remove the parsing overhead:

select pare_id,cre_date,vc_1
from child_heap
where pare_id =10
order by cre_date

---------- --------- -----------------------------------------------------------------------

82 rows selected.

Execution Plan
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT            |            |   100 |  6900 |   103   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |   100 |  6900 |   103   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | CHHE_PK    |   100 |       |     3   (0)| 00:00:01 |

          0  recursive calls
          0  db block gets
         86  consistent gets
          0  physical reads

82 rows collected and 86 consistent gets. That will be 4 consistent gets to process the index blocks and 82 for the table blocks.

Now let’s repeat that on the IOT:

select pare_id,cre_date,vc_1
from child_IOT
where pare_id =10
order by cre_date
mdw11> /
any key>

---------- --------- ------------------------------------------------------------

108 rows selected.

Execution Plan
| Id  | Operation        | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT |         |   100 |  6900 |     4   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| CHIO_PK |   100 |  6900 |     4   (0)| 00:00:01 |

          0  recursive calls
          0  db block gets
          6  consistent gets
          0  physical reads

We actually gathered more data, 108 rows compared to 82, all for 6 consistent gets compared to 86 consistent gets. That is a reduction to less than 10% of the original effort.

Now for a more extreme test. I am going to select a single row summary of data for 10 parents, flushing the cache between each run to show the impact when you have to do real IO to support those consistent gets. This is on a fairly old {4 years} laptop with a rather tired hard disc

alter system flush buffer_cache

System altered.

Elapsed: 00:00:00.18


select count(*),sum (num_1)
from child_heap
where pare_id between 50 and 60

---------- ----------
      1155      12031

Elapsed: 00:00:06.39

Execution Plan
| Id  | Operation                    | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT             |            |     1 |     7 |  1203   (0)| 00:00:18 |
|   1 |  SORT AGGREGATE              |            |     1 |     7 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |  1200 |  8400 |  1203   (0)| 00:00:18 |
|*  3 |    INDEX RANGE SCAN          | CHHE_PK    |  1200 |       |     7   (0)| 00:00:01 |

          0  recursive calls
          0  db block gets
       1157  consistent gets
       1112  physical reads


alter system flush buffer_cache

System altered.

Elapsed: 00:00:00.18


select count(*),sum (num_1)
from child_iot
where pare_id between 50 and 60

---------- ----------
      1111      11528

Elapsed: 00:00:00.29

Execution Plan
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT  |         |     1 |     7 |    24   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     7 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |  1200 |  8400 |    24   (0)| 00:00:01 |

          0  recursive calls
          0  db block gets
         25  consistent gets
         25  physical reads

The Heap took 6.39 seconds and required 1112 physical reads to support 1157 consistent gets.
The IOT took 0.29 seconds and required 25 physical reads to support 25 consistent gets.

I think we can all see that IOTs have the potential to greatly reduce physical and logical IO. Perhaps we should all be using IOTs more.

Final point. The Heap version took less physical reads than consistent gets as some blocks read into the block buffer cache held data required later in the query.

The impact of IOTs on the buffer cache will be the topic of my next post on IOTs. I think { hope:-) } that many of you will be very interested and impressed by what you could gain…

Database Sizing – How much Disk do I need? (The Easy Way) November 11, 2010

Posted by mwidlake in Architecture, development, VLDB.
Tags: , , , ,

How much Disk do I need for my new Oracle database? Answer:-

  • 8-10 times the volume of raw data for an OLTP system
  • 2-4 times the raw data volume for a Data Warehouse.
  • The bigger the database, the nearer you will be to the lower multiplication factors.

{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}

Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…

In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.

First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.

Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.

If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.

If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.

The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:

  • The size of the Oracle binaries and the OS does not change as the database gets bigger.
  • The size of swap space does not increase in line wiht the database as, generally speaking, if you increase the database size from 100GB to 1TB you do not have the luxury of increasing the system memory of your server. It probably doubles.
  • Very large databases tend to have something making them big, like images or embedded documents, which are not indexed. Thus the ratio of table segments to index segments increases.
  • If you have a very large database you start removing indexes (often those that support constraints) to aid performance of data load and management, again improving the ratio of table segments to index segments.
  • Backups become partial or incremental to reduce the size and duration of the backup.
  • As mentioned before, the sheer size of system is such that you just take more care over cleaning up work areas, paring down the archived redo log areas (those files to compress well) and other areas.
  • If things get extreme or you have been doing this for donkeys years {note to none-UK people, this means many, many years} you start altering PCTFREE and checking over extent sizes.

My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).

I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.

How Big is a Person? November 5, 2010

Posted by mwidlake in Architecture.
Tags: , ,

How big are you in the digital world?

By this, I mean how much space do you (as in, a random person) take up in a database? If it is a reasonably well designed OLTP-type database a person takes up 4K. OK, around 4K.

If your database is holding information about people and something about them, then you will have about 4K of combined table and index data per person. So if your database holds 100,000 customers, then your database is between 200MB and 800MB, but probably close to 400MB. There are a couple of situations I know of where I am very wrong, but I’ll come to that.

How do I know this? It is an accident of the projects and places I have worked at for 20 years and the fact that I became strangely curious about this. My first job was with the NHS and back then disk was very, very expensive. So knowing how much you needed was important. Back then, it was pretty much 1.5K per patient. This covered personal details (names, addresses, personal characteristics), GP information, stays at hospitals, visits to outpatient clinics etc,. It also included the “reference “ data, ie the information about consultants, wards and departments, lookups etc. If you included the module for lab tests it went up to just over 2K. You can probably tell that doing this sizing was a job I handled. This was not Oracle, this was a database called MUMPS and we were pretty efficient in how we held that data.

When I moved to work on Oracle-based hospital systems, probably because I had done the data sizing in my previous job and partly because I was junior and lacked any real talent, I got the job to do the table sizings again, and a laborious job it was too. I did it very conscientiously, getting average lengths for columns, taking into account the length bytes, row overhead, block overhead, indexes etc etc etc. When we had built the database I added up the size of all the tables and indexes, divided by the number of patients and… it was 2K. This was when I got curious. Had I wasted my time doing the detailed sizings?

Another role and once again I get the database sizing job, only this time I wrote a little app for it. This company did utilities systems, water, gas, electricity. My app took into account everything I could think of in respect of data sizing, from the fact that the last extent would on average be 50% empty to the tablespace header. It was great. And pointless. Sum up all the tables and indexes on one of the live systems and divide by the number of customers and it came out at 2-3K per customer. Across a lot of systems. It had gone up a little, due to more data being held in your average computer system.

I’ve worked on a few more person-based systems since and for years I could not help myself, I would check the size of the data compared to the number of people. The size of the database is remarkably consistent. It is slowly going up because we hold more and more data, mostly because it is easier to suck up now as all the feeds are electronic and there is no real cost in taking in that data and holding it. Going back to the hospital systems example, back in 1990 it used to be that you would hold the fact a lab test had been requested and the key results information – like the various cell counts for a blood test. This was because sometimes you had to manually enter the results. Now the test results come off another computer and you get everything.

I said there were exceptions. There are three main ones:

  • You are holding a very large number of transaction records for the person. Telephony systems are one of the worst examples of this. Banking, credit cards and other utility systems match the 4K rule.
  • You hold images or other “unstructured” chunks of data for people. In hospital systems this would cover x-rays, ultrasound scans etc. But if you drop them out of the equation (and this is easy as they often are held in separate sub-systems) it remains a few K per person. CVs push it up as they are often in that wonderfully bloaty Word format.
  • You are holding mostly pointers to another system, in which case it can be a lot less than 4K per person. I had to size a system recently and I arrogantly said “4K per person”. It turned out to be less than 1K, but then this system turned out to actually hold most person data in one key data store and “my” system only held transaction information. I bet that datastore was about 4K per person

I have to confess that I have not done this little trick of adding up the size of all the tables and indexes and dividing by the number of people so often over the last couple of years, but the last few times I checked it was still 3-4K – though a couple of times I had to ignore a table or two holding unstructured data.
{The massive explosion in the size of database is at least partly down to holding pictures – scanned forms, photos of products, etc, but when it comes down to the core part of the app for handling people, it seems to have stayed at 4K. The other two main aspects driving up database size seem to me to be the move from regional companies and IT systems to national and international ones, and that fact that people collect and keep all and every piece of information, be it any good for anything or not}.

I’d love to know if your person-based systems come out at around 4K per person but I doubt if many of you would be curious enough to check – I think my affliction is a rare one.

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful! :-)

More Memory Meanderings – IOPS and Form Factors July 19, 2010

Posted by mwidlake in Architecture, Management, performance.
Tags: ,

I had a few comments when I posted on solid state memory last week and I also had a couple of interesting email discussions with people.

I seriously failed to make much of one of the key advantages of solid-state storage over disk storage, which is the far greater capacity of Input/output operations per second (IOPS), which was picked up by Neil Chandler. Like many people, I have had discussions with the storage guys about why I think the storage is terribly slow and they think it is fast. They look at the total throughput from the storage to the server and tell me it is fine. It is not great ,they say, but it is {let’s say for this example} passing 440MB a second over to the server. That is respectable and I should stop complaining.

The problem is, they are just looking at throughput, which seems to be the main metric they are concerned about after acreage. This is probably not really their fault, it is the way the vendors approach things too. However, my database is just concerned in creating, fetching, and altering records and it does it as input/output operations. Let us say a disk can manage 80 IOPS per second (which allows an average 12.5 ms to both seek to the record and also read the data. Even many modern 7,200 rpm discs struggle to average less than 12ms seek time). We have 130 disks in this example storage array and there is no overhead from any sort of raid or any bottleneck in passing the data back to the server. {This is of course utterly unbelievable, but if i have been a little harsh not stating the discs can manage 8ms seek time, ignoring the raid/hba/network cost covers that}. Each disc is a “small” one of 500GB. They bought cheap disk to give us as many MB/£ as they could {10,000 and 15,0000 rpm disks will manage 120 and 160 IOPS per second but cost more per MB}.

Four sessions on my theoretical database are doing full table scans, 1MB of data per IO {Oracle’s usual max on 10.2}, Each session receiving 100MB of data a second, so 400MB in total. 5 discs {5*80 IOPS*1MB} could supply that level of IOPS. It is a perfect database world and there are no blocks in the cache already for these scans to interrupt the multi-block reads.

However, my system is primarily an OLTP system and the other IO is records being read via index lookups and single block reads or writes.

Each IOP reads the minimum for the database, which is a block. A block is 4k. Oracle can’t read a bit of a block.

Thus the 40MB of other data being transferred from (or to) the storage is single block reads of 4k. 10,000 of them. I will need 10,000/80 disks to support that level of IO. That is 125 discs, running flat out.

So, I am using all my 130 discs and 96% of them are serving 40MB of requests and 4% are serving 400MB of requests. As you can see, as an OLTP database I do not care about acreage or throughput. I want IOPS. I need all those spindles to give me the IOPS I need.

What does the 40MB of requests actually equate to? Let us say our indexes are small and efficient and have a height of 3 (b-level of 2), so root node, one level of branch nodes and then the leaf nodes. To get a row you need to read the root node, branch node, lead node and then the table block. 4 IOs. So those 10,000 IOPS are allowing us to read or write 10,000/4 records a second or 2,500 records.
You can read 2,500 records a second.

Sounds a lot? Well, let us say you are pulling up customer records onto a screen and the main page pulls data from 3 main tables (customer, address, account_summary) and translates 6 fields via lookups. I’ll be kind and say the lookups are tiny and oracle just reads the block or blocks of the table with one IO. So that is 9IOs for the customer screen, so if our 40MB OLTP IO was all for looking up customers then you could show just under 280 customers a second, across all users of your database. If you want to pull up the first screen of the orders summary, each screen record derived from 2 underlying main tables and again half a dozen lookups, but now with 10 records per summary page – that is 80 IOs for the page. Looking at a customer and their order summary you are down to under thirty a second across your whole organisation and doing nothing else.

You get the idea. 2,500 IOPS per second is tiny. Especially as those 130 500GB disks give you 65TB of space to host your database on. Yes, it is potentially a big database.

The only way any of this works is due to the buffer cache. If you have a very healthy buffer cache hit ratio of 99% then you can see that your 2500 records of physical IO coming in and out of the storage sub-system is actually supporting 250,000 logical-and-physical IOPS. {And in reality, many sites not buffer at the application layer too}.

Using Solid State Storage would potentially give you a huge boost in performance for your OLTP system, even if the new technology was used to simply replicate disk storage.

I think you can tell that storage vendors are very aware of this issue as seek time and IOPS is not metrics that tend to jump out of the literature for disk storage. In fact, often it is not mentioned at all. I have just been looking at some modern sales literature and white papers on storage from a couple of vendors and they do not even mention IOPS – but they happily quote acreage and maximum transfer rates. That is, until you get to information on Solid State Discs. NOw, because the vendor can say good things bout the situation then the information is there. On one HP white paper the figures given are:

				Modern super-fast		Top-end 
				SAS disk drive Top-end 	Solid State Disk
Sustained write     	150MB/s			180MB/s
Sustained read			90MB/s			180MB/s
Random write			285				5,000+
Random read				340				20,000+ 

More and more these days, as a DBA you do not need or want to state your storage requirements in terms of acreage or maximum throughput, you will get those for free, so long as you state your IOPS requirements. Just say “I need 5000 IOPS a second” and let the storage expert find the cheapest, smallest disks they can to provide it. You will have TBs of space.

With solid-state storage you would not need to over-specify storage acreage to get the IOPS, and this is why I said last week that you do not need solid state storage to match the capacity of current disks for this storage to take over. We would be back to the old situation where you buy so many cheap, small units to get the volume, IOPS are almost an accidental by-product. With 1GB discs you were always getting a bulk-buy discount :-)

I said that SSD would boost performance even if you used the technology to replicate the current disk storage. By this I mean that you get a chunk of solid-state disk with a SATA or SAS interface in a 3.5 inch format block and plug it in where a physical disk was plugged in, still sending chunks of 4k or 8k over the network to the Block Buffer Cache. But does Oracle want to stick with the current block paradigm for requesting information and holding data in the block buffer cache? After all, why pass over and hold in memory a block of data when all the user wanted was a specific record? It might be better to hold specific records. I suspect that Oracle will stick with the block-based structure for a while yet as it is so established and key to the kernel, but I would not be at all surprised if something is being developed with exadata in mind where data sets/records are buffered and this could be used for data coming from solid state memory. A second cache where, if using exadata or solid-state memory, holding single records. {I might come back to this in a later blog, this one is already getting bloated}.

This leads on to the physical side of solid-state discs. They currently conform to the 3.5” or 2.5” hard disc form factor but there is no need for them to do so. One friend commented that, with USB memory sticks, you could stick a female port on the back of a memory stick and a joint and you could just daisy-chain the USB sticks into each other, as a long snake. And then decorate your desk with them. Your storage could be looped around the ceiling as bunting. Being serious, though, with solid state storage then you could have racks or rows of chips anywhere in the server box. In something like a laptop the storage could be an array 2mm high across the bottom the chasis. For the server room you could have a 1u “server” and inside it a forest of chips mounted vertically, like row after row of teeth, with a simple fan at front and back to cool the teeth (if needed at all). And, as I said last time, with the solid state being so much smaller and no need to keep to the old hard disk format, you could squeeze a hell of a lot of storage into a standard server box.

If you pulled the storage locally into your server, you would be back to the world of localised storage, but then LANs and WANs are so much faster now that if you had 10TB of storage local to your server, you could probably share it with other machines in the network relatively easily and yet have it available to the local server with as many and as fat a set of internal interfaces as you could get your provider to manage.

I’m going to, at long last, wrap up this current instalment on my thoughts with a business one. I am convinced that soon solid-state storage is going to be so far superior a proposition to traditional disks that demand will explode. And so it won’t get cheaper. I’m wondering if manufacturers will hit a point where they can sell as much as they can easily make and so hold the price higher. After all, what was the argument for Compact Discs to cost twice as much to produce as old cassette tapes, even when they had been available for 5 years? What you can get away with charging for it.

Memory Changes Everything July 12, 2010

Posted by mwidlake in Architecture, performance.
Tags: , , ,

I’ve got this USB memory stick which I use to carry around my scripts, documents, presentations, Oracle manuals and enough music to keep me going for a few days. It is on an 8GB Gizzmo Junior and it is tiny. By tiny I mean as wide as my little finger, the length of a matchstick and about the same thickness of said matchstick. So small that I did indeed lose the damn thing for 6 months before I realised it had got trapped behind a credit card in my wallet.

It cost me ten British pounds about 15 months ago (less than most 4GB USB sticks seem to cost now, but then it is nothing more than the memory chip and connectors wrapped in plastic) and it highlights how cheap solid-state “storage” is becoming.

Connected to this, I was looking at buying a new PC this week and this machine comes with 10 USB slots, if you include the ones on the supplied monitor and stubs on the motherboard.
10 USB slots, 8GB gizzmo memory sticks… That would be 80GB of cheap and fast storage. Now get a few USB hubs and bulk-buy a few dozen cheap USB2 sticks and you could soon have a solid-state database of a few hundred GB for a thousand pounds. Then of course you can have fun seeing where the pinch-points in the system are (USB2 has a maximum speed per port and going USB3 right now is going to break that 1 grand barrier. But give it a year…).

This really started me thinking about when memory-based storage would take over from spinning disk as the best option for enterprise-level storage and my gut feeling is in about 5 years. I think it will be both technically possible and financially viable in much less than that, say as little as 2 years, but the cost of solid-state storage per MB will still be higher than disk by then but potentially much faster. A few considerations going through my mind were:-

  • Disk is getting a lot slower in relation to acreage. By this I mean that, for a single disc drive, capacity is doubling about every 18 months but seek time has hardly reduced in a decade and transfer rate (reading from the physical platters to the units buffer) is again almost stationary, at about 120MB/s for 10,000rpm disk and up towards 180 for those very expensive and noisy 15,000 rpm disks. Being a tad ridiculous to make the point, with modern 3TB disks you could build most Oracle database on one disc. Let’s make it two in a raid 10 configuration for redundancy. My point is, your 3TB database could well be being run right now, for real, across say 5 physical disks with a total sustainable physical throughput of around 500MB a second.
  • Solid state storage seems to be halving in price in more like 8-10 months.
  • IO subsystems are made faster by using RAID so that several physical discs can contribute to get towards the 300MB or so speed of the interface – but solid state is already that fast.
  • IO subsystems are made faster by building big caches into them and pre-fetching data that “might” be requested next. Oh, that is kind of solid state storage already.
  • Solid state storage, at least the cheap stuff in your USB stick, has the problem that you can only write to each bit a thousand or so times before it starts to get unreliable. But physical disk has exactly the same issue.
  • There are new methods of solid-state memory storage coming along – “New Scientist” had a nice article on it a few months ago, and these versions will be even higher density and more long-term reliable.
  • Seek time on solid-state memory is virtually zero, so random IO is going to be particularly fast compared to spinning disk.

Solid state memory needs less power, and thus less cooling, is silent, is potentially denser and is less vulnerable to temperature and humidity fluctuations. I can see it not needing to be kept in a specialist server room with the need for all that air con and ear defenders when you go in the room.
Just somewhere with normal air con and a lock on the door should suffice.
We do not need Solid State storage to match the size of current disks or even be as cheap to take over. As I have already pointed out, it is not acreage you need with physical disks but enough spindles and caches to make it fast enough in relation to the space. Further, we can afford to pay more for solid state if we do not need to keep it in such expensive clean-room like environments.

I can see that in a couple of years for a given computer system, say a mixed-workload order processing system, to support the storage needs we will have maybe a dozen solid-state chunks of storage, perhaps themselves consisting of several small units of memory in some sort of raid for resilience, all able to flood the IO channels into our processing server and the issue will be getting the network and io channels into the server to go fast enough. So don’t, stick all the storage directly into the server. You just got rid of half your SAN considerations.

I’m going to stop there. Partly because I have run out of time and partly because, in checking out what I am writing, I’ve just spotted someone did a better job of this before me. Over to James Morle who did a fantastic post on this very topic back in May. Stupid me for not checking out his blog more often. Jame also mentions that often it is not total throughput you are interested in at all but IOPS. That zero latency of solid-state memory is going to be great for supporting very high IOPS.

Friday Philosophy – Madness demands Attention May 28, 2010

Posted by mwidlake in Architecture, Friday Philosophy.
Tags: , ,

Many years ago I had a good friend who was a psychiatric nurse. We were talking about his job once and he was saying how some patients just took up much more time than others. These were generally the ones who would be deemed “the most mad” in a non-clinical manner {and is pretty much how my friend the psychiatric nurse put it}. These patient’s actions or need for intervention would put demands on the staff far more than other patients. As a result, all the nurses tended to get to know (or know of) such patients better than others.

I thought of this the other day when a few of us were talking about some awful bit of application we were concerned about. This thing inserts rows from a MSSQL database into a table in an Oracle database. Triggers on the initial table fire and populate another table, in a 1-to-many relationship. This second table also has a trigger on it that further inserts into another set of tables. A regular process then aggregates this data – and sticks it back into the MSSQL database it came from.

Said process is done as a single transaction for all rows inserted for the day. Irrespective of the growth in rows. Or the fact that one source “application” has grown to 10 and soon will grow to 50. All rows in one transaction. The intermediate tables are never cleared out and get bigger and bigger. No one else needs any of this data in the Oracle side of the system.

There are several “madnesses” to this process – why put it into Oracle only to put it back in the source system, why use a busy production system to hold and process transient data, why no clean up, why are the records not processed in batches, the cascading triggers magnifies transaction data volumes…

This process is well-known in our group. I’ve been involved. Both the guys I was talking to have been involved. I can see from my desk 4 or 5 other people who have been roped into bullying this process thought before now. In fact, I reckon half the department have had to work on this damned thing at some point.

Can you see why I was reminded of my conversation with my old nurse friend?

The application is simply mad. And as a result it is demanding not only on our database but on all of the team, as so many of us have had to get involved working on it. We have all got to know it so well.

I’m glad to say that treatment for the application is planned and, hopefully, it will soon be a lot happier.

So will we.

Depth of Indexes on VLDBs November 18, 2009

Posted by mwidlake in performance, Uncategorized, VLDB.
Tags: , ,

In a couple of recent posts on partitions, on single row and range scan access I have started exploring the performance of partitions. I will continue on the main theme soon, honest.

I feel a little bad that I have not given concrete examples so far {creating very large objects for the sake of demonstrations and proof-of-concepts is a bit of an issue with VLDBs, due to the space and time requirements to do it}. So, here is just some real-world information on index depth, or BLEVEL of indexes on Very Large DataBases.

The BLEVEL is the number of levels in the index. Minus one. See this post for a full description, but in brief, if an index has a BLEVEL of 3, that means there is a root node, a level of branch nodes, a second level of branch nodes and then the leaf nodes. 4 levels. 4 IOs to check if the index holds the value you are looking for. And then one or more IOs against the table to get the actual records from the table, if required (one block if there is only one row, one or more blocks if there are several rows).

The below is from a 25TB datawarehouse system. I am looking for the maximum index depth on the database:-

select max(blevel) from dba_indexes


OK, how many indexes are this deep?…

select owner,index_name from dba_indexes
where blevel=4

—————————— ———————-

1 row selected.

Oh. I have a pretty massive database and the deepest index I have is BLEVEL 4 and there is one of them. So the index height is 5. And access to that table for a given referral ID is going to take 5 IOs to work down the index and one against the table.

What is more, the index is an IOT

—————————— —————————

so, in fact, to get to the table record it is just the 5 IOs as the table record is held in the IOT.

So with a 25TB data warehouse, the deepest normal indexes are BLEVEL 3.

select index_name,num_rows/leaf_blocks
from dba_indexes
where BLEVEL =3
order by num_rows desc

INDEX_NAME                             NUM_ROWS  LEAF_BLOCKS
------------------------------ ---------------- ------------
W_WL_SH_API_XIIIIIIIIIII_ID     4,585,108,192   13,406,015
W_WL_SHIIIIIIIIIIIXXXX_ID    4,564,350,002   16,892,898
W_WL_SIIIIIIIIIIIIIIIIIIIIII_ID     4,422,775,820   17,189,536
W_WL_SHIIIIIIIIIIIIIIIIIIIIII_ID    4,182,087,545   12,243,438
W_WL_GX_RESIIIIIIIIIII_LOG_ID1     3,690,374,216   14,613,388
IDX_W_WL_LIIIIIIIIIIIESS_ID          351,507,905    1,958,109
WL_LOGIN_SUIIIIIIIIIIIIE_ID          348,004,861    1,550,180
IND_IIII_LEG_ID                     312,727,000      972,690
IND_TMLR_MARIIIIIIIIIII_ID           306,988,247    1,121,592
PK_TBL_IIIIIIIIIIIUNNER              305,566,240      939,775
PK_IIIIIIIIIIIGS                    284,208,000      863,000

I asked an old friend, Tony Webb, to check out the depth of indexes on the largest databases they have, which range from a few TB to several 10s of TB. {one  is 100TB plus, but a lot of the data is in LOBs, so the actual number of records is only billions, not 10’s of billions :-) } No index had a BLEVEL greater than 3.

Hi Martin,

I ran the first query on DAVE, DEE(our biggest db), DOZY and MICK. The max depth results were 3,2,3 and 3. Our largest database only had 2. In fact some of the other largish dbs also only have a max depth of 2.

That enough for your blog Martin?

Now, some of you may be smelling a Rat, and you would be right. At both sites, the VLDBs have been designed with partitioning as a key part of the Physical Implementation. All the largest objects are partitioned. What impact does partitioning have on BLEVEL?

The depth of the index segment is dependent on the number of keys per block.

Let us say you have 8k blocks {this is a questionably low block size for a data warehouse, but it is what some of these systems have} and your index is on a couple of columns, totalling 20 bytes per key on average, including rowid. Allowing block overhead, that is approx 8000/20 entries = 400 max….

Blow that, that’s whooly theory, I’ll look at my real 25TB database for number of rows per leaf block… 

select leaf_blocks, num_rows\leaf_blocks rows_per_leaf
from dba_ind_statistics
where num_rows &gt; 100000000
order by leaf_blocks desc

------------ -------------
  30,211,949    325.602273
  28,717,447    262.762331
  26,308,608    382.505959
  25,429,862    385.047045
  24,054,626     309.92864
  23,660,747    382.571434
  22,458,058    411.063205
  22,316,976    346.620625
  18,875,827    379.952198
  17,451,901    338.600909
  17,189,536     257.29466

From the above let us take a low figure of 300 average enteries per block. For a BLEVEL 0 , a one block index, that would be 300 rows that are referenced. For a level 1 index, 300 entries in the root block will point to 300 leaf blocks, so that will be 300*300 rows…90,000 entires.

Go down another BLEVEL to 2 (and a height of 3) and you have a root node referencing 300 Branch nodes, referenceing 300 Leaf nodes each referencing 300 records. 300*300*300 = 27 million.

Another level (BLEVEL 3), another factor of 300 and that is 8.1 billion.

BLEVEL 0= 300
BLEVEL 1 = 90,000
BLEVEL 2 = 27,000,000
BLEVEL 3 = 8,100,000,000
BLEVEL 4 = 6,480,000,000,000 (6,480 billion entries).

You can see that with an 8k block size and being very pessimistic about the number of entries per block (I used a low average to calculate enttries per block where as in reality most indexes will fill the block at the root and branch nodes to close to full before throwing a new level) your have to increase your data volume by amost 300 to throw a new level of index.

On average, knowing nothing about the actual number of records in an index, you would have to reduce your number of indexed rows by 150 times (half of 300) to have a 50% chance of lowering the BLEVEL of a locally partitioned index by one. ie 150 or more partitions.

with a 32k block size, you would have to reduce your data volume by more like 620 times to drop the BLEVEL (not 600 as you have less wastage from block overhead with larger blocks).

To reduce it by two it would be 300*150 I think, wiht 8k block size {can any statisticians, like Mr Lewis, help me out on this one?}. If I am right, that is 45,000 partitions. In that 25TB datawarehouse, no table has more than 4,000 partitions.

That is why I say, when you partition a table and have locally partitioned indexes, you might reduce the index level by 1. Might.

However, none of these systems has more than 4000 partitions per table. That might sound a lot, but it is only going to reduce an index BLEVEL by 1. Almost certainly it will, but if you have read the previous postings, that is not going to really help index lookups be that more efficient :-)

I’ll just add a couple of last comments.

  • If you have an index BLEVEL of 4 or 5 or more and do not have a VLDB, you might want to look at why {hints, it could be an IOT, it could be an index on several concatenated VARCHAR2 columns, it could be you are using 2k block size, it could be that you regularly delete a large pecentage of your data and then re-insert more data, it could be you have a one-table database, hehe.}.
  • Point one does not mean you should rebuild said index. I almost never rebuild indexes. The need to do so is one of those Oracle myths. There are situations where index rebuilds are advantageous, but they are pretty damned rare. If you regularly rebuild all your indexes or you are thinking of doing so, can I suggest you book a “meeting” for an afternoon and sit down with a cup of tea and google/bing/yahoo or whatever and check out index rebuilding first? Ignore any sites you land on offering database health checks or training courses on cruise liners.

That’s all for now.

Partitions are Not for Performance October 29, 2009

Posted by mwidlake in Architecture, performance, VLDB.
Tags: , , ,

There is a myth that Partitions in Oracle magically aid SQL Query performance, even a general assumption that the main role of partitioning is to aid such query performance. This is not so. There are certainly many cases where Partitioning can aid performance but in my experience they just as often hinder rather than help overall performance.

The myth is so strong that when asked at interview, most DBAs {and even some performance specialists} will site query performance as the main (and occasionally only) benefit of partitioning.

Why can partitions hinder SQL query performance? Let’s say for example that you have a 10 million row table and an index on that table. That index has a B-Level of 3, which means it has a root block, one level of branch nodes, a second layer of branch nodes and then the leaf-node level. To access a row in the table via the index Oracle needs to read the root block, two branch blocks and then a leaf block to get the rowid of the record. This allows the table block {and from that the row} to be read. This is depicted in the below diagram, the numbered orange squares represent the blocks as selected in turn from the index and then table:


That is 5 I/O operations to access that row.

Now we partition the table into 5 partitions, 2 million rows each. The index is locally partitioned. If you are lucky, you may, just may, end up with local index partitions with a B-level 1 less then the original table, so in our case a B-level of 2. The often suggest process is now that one partition is considered and thet the CBO will read one root node, a branch level block, a leaf block and then the block from the partition.


4 I/Os and a saving of 20% {I’m ignoring for now caching and whether it is physical or logical IO}.

A saving of 20% IF {and only if} you have local index partitions with a lower B-Level than the equivalent non-partitioned index. And the access is to one partition alone.

However, I keep seeing situations where the index look up does not include the partition key. So you get the below situation:


Lacking the partition key, the CBO cannot exclude any partitions – so it has to scan each one. For most partitions, maybe for all but one, no records are found, but the index has to be checked with 3 IO operations each. so in my example 16 I/Os are done to get the one record of interest.

16 I/O operations instead of 5.

The situation is often not spotted, at least initially, as the time taken to carry out the extra local index partition scans is “small”, especially for specific lookups. Usually any testing is done on a table with only a small numer of partitions.

I remember well the first time I came across this {on an Oracle 9.0 database I think}, there was well over 100 partitions and a process that checked many thousands of individual records had slowed down significantly, taking 4 or 5 times as long as before.

An indicator that the problem is occurring is when a single record lookup against the index and then table is taking perhaps several dozen to several hundred consistent gets rather than the 5 or 6 it should. Also, you will see the partition iterator in the explain plan. In that first case where I came across the issue, consistent gets of about 380 per record fetched were being seen for a select that returned 1 record 99% of the time from a single table lookup. I’ve since seen the same problem on tables with over a thousand partitions, each local index being scanned for a total of almost 5000 consistent gets per record.

You may think that this would be an unusual situation as access against very large tables is either full/partial table scans or lookups on the PK/with a WHERE clause including the partitioning key – but it is actually very common. Partitioned tables are being used more and more in large but generally OLTP applications or sets of records are identified in a datawarehouse that are then checked more specifically with generally row-specific logic. With VLDBs which have many, many partitioned tables with many, many partitions each, the problem is common and often not recognized in the fog of other issues and massive I/O levels.

I’ve only covered a general performance issue with partitions here, I’ll expand on the theme and this simplistic example in the next post.

And yes, there are many ways partitioning CAN aid performance. I aim to get to those too. I really love partitioning.


Get every new post delivered to your Inbox.

Join 185 other followers