jump to navigation

Will the Single Box System make a Comeback? December 8, 2011

Posted by mwidlake in Architecture, future, Hardware.
Tags: , ,
15 comments

For about 12 months now I’ve been saying to people(*) that I think the single box server is going to make a comeback and nearly all businesses won’t need the awful complexity that comes with the current clustered/exadata/RAC/SAN solutions.

Now, this blog post is more a line-in-the-sand and not a well researched or even thought out white paper – so forgive me the obvious mistakes that everyone makes when they make a first draft of their argument and before they check their basic facts, it’s the principle that I want to lay down.

I think we should be able to build incredible powerful machines based on PC-type components, machines capable of satisfying the database server requirements of anything but the most demanding or unusual business systems. And possibly even them. Heck, I’ve helped build a few pretty serious systems where the CPU, memory and inter-box communication is PC-like already. If you take the storage component out of needing to be centralise (and this shared), I think that is a major change is just over the horizon.

At one of his talks at the UKOUG conference this year, Julian Dyke showed a few tables of CPU performance, based on a very simple PL/SQL loop test he has been using for a couple of years now. The current winner is 8 seconds by a… Core i7 2600K. ie a PC chip and one that is popular with gamers. It has 4 cores and runs two threads per core, at 3.4GHz and can boost a single core to 3.8 GHz. These modern chips are very powerful. However, chips are no longer getting faster so much as wider – more cores. More ability to do lots of the same thing at the same speed.

Memory prices continue to tumble, especially with smart devices and SSD demands pushing up the production of memory of all types. Memory has fairly low energy demands so you can shove a lot of it in one box.

Another bit of key hardware for gamers is the graphics card – if you buy a top-of-the-range graphics card for a PC that is a couple of years old, the graphics card probably has more pure compute grunt than your CPU and a massive amount of data is pushed too and fro across the PCIe interface. I was saying something about this to some friends a couple of days ago but James Morle brought it back to mind when he tweeted about this attempt at a standard about using PCI-e for SSD. A PCI-e 16X interface has a theoretical throughput of 4000MB per second – each way. This compares to 600MB for SATA III, which is enough for a modern SSD. A single modern SSD. {what I am not aware of is the latency for PCI-e but I’d be surprised if it was not pretty low}. I think you can see where I am going here.

Gamers and image editors have probably been most responsible for pushing along this increase in performance and intra-system communication.

SSD storage is being produced in packages with a form factor and interface to enable an easy swap into the place of spinning rust, with for example a SATA3 interface and 3.5inch hard disk chassis shape. There is no reason that SSD (or other memory-based) storage cannot be manufactured in all sorts of different form factors, there is no physical constraint of having to house a spinning disc. Density per dollar of course keeps heading towards the basement. TB units will soon be here but maybe we need cheap 256GB units more than anything. So, storage is going to be compact and able to be in form factors like long, thin slabs or even odd shapes.

So when will we start to see cheap machines something like this: Four sockets for 8/16/32 core CPUs, 128GB main memory (which will soon be pretty standard for servers), memory-based storage units that clip to the external housing (to provide all the heat loss they require) that combine many chips to give 1Gb IO rates, interfaced via the PCIe 16X or 32X interface. You don’t need a HBA, your storage is internal. You will have multipath 10GbE going in and out of the box to allow for normal network connectivity and backup, plus remote access of local files if need be.

That should be enough CPU, memory and IO capacity for most business systems {though some quote from the 1960′s about how many companies could possible need a computer spring to mind}. You don’t need shared storage for this, in fact I am of the opinion that shared storage is a royal pain in the behind as you are constantly having to deal with the complexity of shared access and maximising contention on the flimsy excuse of “sweating your assets”. And paying for the benefit of that overly complex, shared, contended solution.

You don’t need a cluster as you have all the cpu, working memory and storage you need in a 1U server. “What about resilience, what if you have a failure?”. Well, I am swapping back my opinion on RAC to where I was in 2002 – it is so damned complex it causes way more outage than it saves. Especially when it comes to upgrades. Talking to my fellow DBA-types, the pain of migration and the number of bugs that come and go from version to version, mix of CRS, RDBMS and ASM versions, that is taking up massive amounts of their time. Dataguard is way simpler and I am willing to bet that for 99.9% of businesses other IT factors cause costly system outages an order of magnitude more times than the difference between what a good MAA dataguard solution can provide you compared to a good stretched RAC one can.

I think we are already almost at the point where most “big” systems that use SAN or similar storage don’t need to be big. If you need hundreds of systems, you can virtualize them onto a small number of “everything local”
boxes.

A reason I can see it not happening is cost. The solution would just be too cheap, hardware suppliers will resist it because, hell, how can you charge hundreds of thousands of USD for what is in effect a PC on steroids? But desktop games machines will soon have everything 99% of business systems need except component redundancy and, if your backups are on fast SSD and you a way simpler Active/Passive/MAA dataguard type configuration (or the equivalent for your RDBMS technology) rather than RAC and clustering, you don’t need that total redundancy. Dual power supply and a spare chunk of solid-state you can swap in for a failed raid 10 element is enough.

Lack of Index and Constraint Comments November 24, 2011

Posted by mwidlake in Architecture, database design, development.
Tags: , , , ,
10 comments

Something I’ve just reminded myself of is that under Oracle you cannot add a comment on an index or a constraint. You can only add comments on tables, views, materialized views, columns of those object types and a couple of esoteric things like Operators, Editions and Indextypes.

Here is an example of adding comments to tables and columns:

set pause off feed off
drop table mdw purge;
create table mdw(id number,vc1 varchar2(10));
comment on table mdw is 'Martin Widlake''s simple test table';
comment on column mdw.id is 'simple numeric PK sourced from sequence mdw_seq';
comment on column mdw.vc1 is'allow some random text up to 10 characters';
--
desc user_tab_comments

 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ------------------------------------
 TABLE_NAME                                            NOT NULL VARCHAR2(30)
 TABLE_TYPE                                                     VARCHAR2(11)
 COMMENTS                                                       VARCHAR2(4000)

--
select * from dba_tab_comments where table_name='MDW'
/
OWNER                          TABLE_NAME                     TABLE_TYPE
------------------------------ ------------------------------ -----------
COMMENTS
----------------------------------------------------------------------------------------------------
MDW                            MDW                            TABLE
Martin Widlake's simple test table

select * from dba_col_comments where table_name='MDW'
order by column_name
/
OWNER                          TABLE_NAME                     COLUMN_NAME
------------------------------ ------------------------------ --------------
COMMENTS
----------------------------------------------------------------------------------------------------
MDW                            MDW                            ID
simple numeric PK sourced from sequence mdw_seq
MDW                            MDW                            VC1
allow some random text up to 10 characters
-- now to add a big comment so need to use the '-' line continuation character in sqlplus
--
comment on table mdw is 'this is my standard test table.-
 As you can see it is a simple table and has only two columns.-
 It will be populated with 42 rows as that is the solution to everything.'
/
select * from dba_tab_comments where table_name='MDW'
OWNER                          TABLE_NAME                     TABLE_TYPE
------------------------------ ------------------------------ -----------
COMMENTS
----------------------------------------------------------------------------------------------------
MDW                            MDW                            TABLE
this is my standard test table.  As you can see it is a simple table and has only two columns.  It w
ill be populated with 42 rows as that is the solution to everything.
--
/

Adding comments on tables, views and columns seems to have dropped out of fashion over the years but I think it is still a very useful feature of oracle and I still do add them (though I am getting a little slack about it myself over the last 3 or 4 years, which I must stop).

Comments are great, you can put 4000 characters of information into the database about each table, view and column. This can be a brief description of the object, a full explanation of what a column is to hold or even a list of typical entries for a column or table.

But you can’t add a comment on indexes or constraints. Why would I want to? Well, constraints and indexes should only be there for a reason and the reason is not always obvious from either the names of the columns or the name of the constraint or index, especially where you have a naming standard that forces you to name indexes and constraints after the columns they reference.

When you design a database, do a schema diagram or an ERD, you label your relationships between entities/tables. It tells you exactly what the relationship is. You might create an index to support a specific method of access or particular business function. You might alter the index in a way not immediately obvious to the casual observer, such as to allow queries that use the index to avoid having to visit the table. All of those things will, of course, be fully documented in the maintained project documentation in the central repository, available and used by all…

If I was able to add comments to constraints and indexes within the database then they would there. You move the system from one platform to the other, they are there. If for any wildly unlikely reason the central documentation lets you down, the information is always there in the database and easy to check. You may not be able to track down the original design documents but you have the database in front of you, so comments in that will persist and be very easy to find.

Lacking the ability to add comments on indexes and constraints, I have to put them at the table level, which I always feel is a kludge. I might actually raise an enhancement request for this, but as Oracle 12 is already nailed down, it will have to wait until Oracle 14. (A little bird told me Larry said there would be no Oracle 13…).

IOT P6(a) Update November 8, 2011

Posted by mwidlake in Architecture, development, performance, Testing.
Tags: , , , ,
3 comments

In my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, much slower than on a larger system I had used for professional testing. A major contributing factor was that the insert was working on the whole of the IOT as data was created. The block buffer cache was not large enough to hold the whole working set (in this case the whole IOT) once it grew beyond a certain size. Once it no longer fitted in memory, Oracle had to push blocks out of the cache and then read them back in next time they were needed, resulting in escalating physical IO.

I’ve just done another test which backs up this claim. I altered my test database so that the block buffer cache was larger, 232MB compared to 100MB in my first tests. The full IOT is around 200MB

Bottom line, the creation of the IOT was greatly sped up (almost by a factor of 4) and the physical IO dropped significantly, by a factor of 20. As a result, the creation of the IOT was almost as fast as the partitioned IOT. It also shows that the true overhead on insert of using an IOT is more like a factor of 2 to 4 as opposed 6 to 8.

You can see some of the details below. Just to help you understand them, it is worth noting that I had added one new, larger column to the test tables (to help future tests) so the final segments were a little larger (the IOT now being 210MB as opposed to 180MB in the first tests) and there was a little more block splitting.

                        Time in Seconds
Object type           Run with       Run with
                     100MB cache    232MB cache
------------------  ------------    -----------   
Normal Heap table          171.9          119.4   
IOT table                1,483.8          451.4     
Partitioned IOT            341.1          422.6 

-- First reading 100MB cache
-- second reading 232MB cache 
STAT_NAME                            Heap    	IOT	      IOT P
-------------------------------- ---------- -----------  ----------
CPU used by this session            5,716         7,222       6,241
                                    5,498         5,967       6,207

DB time                            17,311       148,866      34,120
                                   11,991        45,459      42,320

branch node splits                     25            76          65
                                       25            82         107

leaf node 90-10 splits                752         1,463       1,466
                                      774         1,465       1,465

leaf node splits                    8,127        24,870      28,841
                                    8,162        30,175      40,678

session logical reads           6,065,365     6,422,071   6,430,281
                                6,150,371     6,544,295   6,709.679

physical read IO requests             123        81,458       3,068
                                      36          4,012       1,959

physical read bytes             2,097,152   668,491,776  25,133,056
                                1,400,832    34,037,760  16,048,128

user I/O wait time                    454       139,585      22,253
                                       39        34,510      19,293

The heap table creation was faster with more memory available. I’m not really sure why, the cpu effort was about the same as before and though there was some reduction in physical IO with the larger cache, I suspect it might be more to do with both the DB and the machine having been recently restarted.

All three tests are doing a little more “work” in the second run due to that extra column and thus slightly fewer rows fitting in each block (more branch node and leaf node splits), but this just highlights even more how much the IOT performance has improved, which correlates with a massive drop in physical IO for the IOT creation. If you check the session logical reads they are increased by a very small, consistent amount. Physical read IO requests have dropped significantly and, in the case of the IOT, plummeted.

I believe the 90:10 leaf node splits are consistent as that will be the maintaining of the secondary index on ACCO_TYPE and ACCO_ID, which are populated in order as the data is created (derived from rownum).

What this second test really shows is that the efficiency with which you are able to make use of the database cache is incredibly significant. Efficiently accessing data via good indexes or tricks like IOTs and hash tables is important but it really helps to also try and consider how data is going to be recycled within the cache or used, pushed out and then reused. A general principle for batch-type work seems to me to be that if you can process it in chunks that can sit in memory, rather than the whole working set, there are benefits to be gained. Of course, partitioning can really help with this.

{If anyone is wondering why, for the heap table, the number of physical IO requests has dropped by 70% but the actual number of bytes has dropped by only 30%, I’m going to point the finger to some multi-block read scan going on, either in recursive code or, more likely, my code that actually gathers those stats! That would also help explain the drop in user IO wait time for the heap run.}

Just for completeness, here is a quick check of my SGA components for the latest tests, just to show I am using the cache size I claim. All of this is on Oracle 11.1 enterprise edition, on a tired old Windows laptop. {NB new laptop arrived today – you have no idea how hard it has been to keep doing this blog and not play with the new toy!!!}. If anyone wants the test scripts in full, send me a quick email and I’ll provide them.:

-- sga_info.sql
-- Martin Widlake /08
-- summary
set pages 32
set pause on
col bytes form 999,999,999,999,999 head byts___g___m___k___b
spool sga_info.lst
select * 
from v$sgainfo
order by name
/
spool off
clear col
--
NAME                             byts___g___m___k___b RES
-------------------------------- -------------------- ---
Buffer Cache Size                         243,269,632 Yes
Fixed SGA Size                              1,374,892 No
Free SGA Memory Available                           0
Granule Size                                4,194,304 No
Java Pool Size                              4,194,304 Yes
Large Pool Size                             4,194,304 Yes
Maximum SGA Size                          401,743,872 No
Redo Buffers                                6,103,040 No
Shared IO Pool Size                                 0 Yes
Shared Pool Size                          142,606,336 Yes
Startup overhead in Shared Pool            50,331,648 No
Streams Pool Size                                   0 Yes

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,
12 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Friday Philosophy – The Dying Art of Database Design? September 9, 2011

Posted by mwidlake in Architecture, development, Friday Philosophy, rant.
Tags: , , ,
35 comments

How many people under the age of {Martin checks his age and takes a decade or so off} ohh, mid 30′s does any database design these days? You know, asks the business community what they want the system to do, how the information flows through their business, what information they need to report on. And then construct a logical model of that information? Judging by some of the comments I’ve had on my blog in the last couple of years and also the meandering diatribes of bitter, vitriolic complaints uttered by fellow old(er) hacks in the pub in the evening, it seems to be coming a very uncommon practice – and thus a rare and possibly dying skill.

{update – this topic has obviously been eating at my soul for many years. Andrew Clark and I had a discussion about it in 2008 and he posted a really good article on it and many, many good comments followed}

Everything seems to have turned into “Ready, Fire, Aim”. Ie, you get the guys doing the work in a room, develop some rough idea of what you want to develop (like, look at the system you are replacing), start knocking together the application and then {on more enlightened projects} ask the users what they think. The key points are the that development kicks off before you really know what you need to produce, there is no clear idea of how the stored data will be structured and you steer the ongoing development towards the final, undefined, target. I keep coming across applications where the screen layouts for the end users seem to almost be the design document and then someone comes up with the database – as the database is just this bucket to chuck the data into and scrape it out of again.

The functionality is the important thing, “we can get ‘someone’ to make the database run faster if and when we have a problem”.

Maybe I should not complain as sometimes I am that ‘someone’ making the database run faster. But I am complaining – I’m mad as hell and I ain’t gonna take it anymore! Oh, OK, in reality I’m mildly peeved and I’m going to let off steam about it. But it’s just wrong, it’s wasting people’s time and it results in poorer systems.

Now, if you have to develop a simple system with a couple of screens and a handful of reports, it might be a waste of time doing formal design. You and Dave can whack it together in a week or two, Chi will make the screens nice, it will be used by a handful of happy people and the job is done. It’s like building a wall around a flower bed. Go to the local builders merchants, get a pallet of bricks, some cement and sand (Ready), dig a bit of a trench where you want to start(Aim) and put the wall up, extending it as you see fit (Fire). This approach won’t work when you decide to build an office block and only a fool from the school of stupid would attempt it that way.

You see, as far as I am concerned, most IT systems are all about managing data. Think about it. You want to get your initial information (like the products you sell), present it to the users (those customers), get the new (orders) data, pass it to the next business process (warehouse team) and then mine the data for extra knowledge (sales patterns). It’s a hospital system? You want information about the patients, the staff, the beds and departments, tests that need doing, results, diagnoses, 15,000 reports for the regulators… It’s all moving data. Yes, a well design front end is important (sometimes very important) but the data is everything. If the database can’t represent the data you need, you are going to have to patch an alteration in. If you can’t get the data in quick enough or out quick enough, your screens and reports are not going to be any use. If you can’t link the data together as needed you may well not be able to DO your reports and screens. If the data is wrong (loses integrity) you will make mistakes. Faster CPUS are not going to help either, data at some point has to flow onto and off disks. Those slow spinning chunks of rust. CPUS have got faster and faster, rust-busting has not. So data flow is even more important than it was.

Also, once you have built your application on top of an inadequate database design, you not only have to redesign it, you have to:

  • do some quick, hacky  fixes to get by for now
  • migrate the existing data
  • transform some of it (do some data duplication or splitting maybe)
  • alter the application to cope
  • schedule all of the above to be done together
  • tie it in with the ongoing development of the system as hey, if you are not going to take time to design you are not going to take time to assess things before promising phase 2.

I’m utterly convinced, and experience backs this up, that when you take X weeks up front doing the database design, you save 5*X weeks later on in trying to rework the system, applying emergency hacks and having meetings about what went wrong. I know this is an idea out of the 80′s guys, but database design worked.

*sigh* I’m off to the pub for a pint and to reminisce about the good-old-days.

IOT Part 5 – Primary Key Drawback – and Workaround August 17, 2011

Posted by mwidlake in Architecture, development, performance, Uncategorized.
Tags: , , , ,
18 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

One of the drawbacks of IOTs is that they have to be organised by the primary key of the table. If your table does not have a primary key, it cannot be Index Organized.

I would argue that any table that holds persistent data (ie it is not transient data about to be loaded into the database proper or a temporary working set) should have a Primary Key. If I am working on a system and come across a table without a Primary Key I immediately challenge it. {There are occasional, valid reasons for a persistent table to lack a PK, but I confess I am struggling right now to come up with one – but I digress}. I’m a big fan of database-enforced referential integrity.

The problem is, if you you are making a table into an Index Organized Table so that the records are clustered to match how you process the data, it could well be that the primary key is not related to how you want to order the data. Let me give you an example. {Oh, and for brevity, I’ll put the SQL statements to create the examples at the end of this post}.

mdw11> desc ACCOUNT
 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ----------------------
 ACCO_TYPE                                             NOT NULL NUMBER(2)  ---PKK
 ACCO_ID                                               NOT NULL NUMBER(10) ---PK
 NAME                                                  NOT NULL VARCHAR2(100)
 DATE_1                                                NOT NULL DATE
 NUM_1                                                          NUMBER(2)
 NUM_2                                                          NUMBER(2)

mdw11> desc TRANSACTION_HEAP
 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ----------------------
 TRAN_TYPE                                             NOT NULL NUMBER(2)  ---PK
 TRAN_ID                                               NOT NULL NUMBER(10) ---PK
 ACCO_TYPE                                             NOT NULL NUMBER(2)
 ACCO_ID                                               NOT NULL NUMBER(10)
 CRE_DATE                                              NOT NULL DATE
 VC_1                                                  NOT NULL VARCHAR2(100)
 DATE_1                                                         DATE
 NUM_1                                                          NUMBER(2)
 NUM_2                                                          NUMBER(2)

This is a classic parent-child relationship, each account has a set of transactions. I’ve expanded on my prior example by:

  • changing the parent to be called ACCOUNT and giving it a two-part Primary Key, ACCO_TYPE and ACCO_ID.
  • Changing the child to be called TRANSACTION and given it a Primary Key of TRAN_TYPE and TRAN_ID.
  • In a real system I would create a foreign key from TRANSACTION.ACCO_TYPE,ACCO_ID to the ACCOUNT table primary key.

Note that the Primary Key on the TRANSACTION table is NOT based on the account columns. Maybe in theory the primary key on the transaction table would be the account columns and the cre_date – if the cre_date held a datetime AND two records could not be created on the same second.  If we used a timestamp then you might be able to argue no record would be created in the same fraction of a second – except that often transactions get given a fixed time. Midnight springs to mind (consider when you would add the accrued interest on a savings account). So, a new surrogate Primary Key is intoduced, a transaction type and ID. TRAN_TYPE and TRAN_ID are the primary key of the TRANSACTION table.

I’d say that I see such two-part primary keys more often then single column primary keys these days. Possibly because so many databases receive information from other systems or even applications on the same database.

As before, I create 10,000 parent records (ACCOUNT) and 10,000 random child records (TRANSACTION_HEAP) each day for 100 days. 

Also as before, I want to select information grouped by account. I want all the transactions for an account, not all transactions on a day or for a range of transaction IDs. Hopefully this is a scenario most of you will recognise. 

Selecting a sum of one of the non-indexed columns and a count of records for a given account takes quite a bit of effort on the part of the HEAP table:

select sum(num_1), count(*) from transaction_heap th where acco_type=10 and acco_id=123

SUM(NUM_1)   COUNT(*)
---------- ----------
      1201        116
Elapsed: 00:00:02.68

Execution Plan
---------------------------------------------------------------------------------------
| Id  | Operation          | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                  |     1 |    10 |  3466   (1)| 00:00:52 |
|   1 |  SORT AGGREGATE    |                  |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL| TRANSACTION_HEAP |   100 |  1000 |  3466   (1)| 00:00:52 |
---------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      13929  consistent gets
      13921  physical reads

Of course, it has to do a full table scan as my Primary Key is on two columns that have nothing to do with the query. I can repeat this statement as often as I like, it takes the same number of physical reads and consistent gets as it is not caching the information.

I add an index on the ACCO_TYPE, ACCO_ID and CRE_DATE columns and re-run the query:

select sum(num_1),count(*) from transaction_heap th where acco_type=10 and acco_id=123

SUM(NUM_1)   COUNT(*)
---------- ----------
      1201        116
Elapsed: 00:00:00.01

Execution Plan
---------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name               | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                    |     1 |    10 |   103   (0)| 00:00:02 |
|   1 |  SORT AGGREGATE              |                    |     1 |    10 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| TRANSACTION_HEAP   |   100 |  1000 |   103   (0)| 00:00:02 |
|*  3 |    INDEX RANGE SCAN          | TRHE_ACCO_CRDA_IDX |   100 |       |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        120  consistent gets
          0  physical reads

I ran it twice to get rid of the parse overhead, but the first time it did a load of physical reads to support those 120 consistent gets.

I could recreate the TRANSACTION_HEAP table as an IOT of course – but it will be organized by the TRAN_TYPE and TRAN_ID columns. That is useless to me. Even if I add a secondary index on the ACCO_TYPE, ACCO_ID and CRE_DATE columns it will at best be no better than the above HEAP table and, because the secondary index will hold rowid guesses and will sometimes have to use the primary key information to walk down the index, it will be worse. {I am not sure I have explained that bit yet about row guesses. Post 6?}

So, if you want the information organized in an order that is not helped by the Primary Key of the table, an IOT is useless to you. You cannot achieve that physical record grouping by the IOT method.

I am going to do something else though. I’m going to sort of change the rules to work around the issue.

As far as the physical implementation is concerned, a Primary Key is in effect just a unique index and two rules. The rules are that all the columns in the Primary Key must be mandatory and there can only be one PK on a table. I can have as many unique indexes as I like, so long as the key combinations lead to no duplicate rows. I can alter my Primary Key – it is not set in stone.

Before I go any further I am going to stress that I am about to abuse the concept of the Primary Key. I’d need to do a seperate blog to fully justify saying what a Primary Key is, but part of the concept is that no column must be derivable from other columns in the PK and it must be the minimum number of columns required to make the key unique.

We want to group the data by the account columns and the creation date. So let’s define a Primary Key that is ACCO_TYPE, ACCO_ID, CRE_DATE and whatever else we need to guarantee the key is unique. In our case that would be TRAN_TYPE and TRAN_ID – the current Primary Key! If I knew I would always want all records for the account, I could drop the CRE_DATE out of my fake Primary Key, but I know that the creation date is very often important. You may want activity for the last month, last quarter, a stated date or even an exact datetime. For all those cases, including the CRE_DATE column is highly beneficial.

So, I create TRANSACTION_IOT below and populate it with data.

desc transaction_iot
 Name                                                        Null?    Type
 ----------------------------------------------------------- -------- --------------
 TRAN_TYPE                                                   NOT NULL NUMBER(2)
 TRAN_ID                                                     NOT NULL NUMBER(10)
 ACCO_TYPE                                                   NOT NULL NUMBER(2)
 ACCO_ID                                                     NOT NULL NUMBER(10)
 CRE_DATE                                                    NOT NULL DATE
 VC_1                                                        NOT NULL VARCHAR2(100)
 DATE_1                                                               DATE
 NUM_1                                                                NUMBER(2)
 NUM_2                                                                NUMBER(2)

--
--

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      TRANSACTION_IO      1000,000                94 YES NO  160811 23:05 NO     1000000
         T
INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
TRIO_PK         IOT NO  UNI  2     21,433    1058,381            0          1          1 160811 23:05
TRIO_TRAN_UQ    NOR NO  UNI  2      4,386    1000,000      999,405          1          1 160811 23:05

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
TRIO_PK                      TRANSACTION_IOT  1   ACCO_TYPE
TRIO_PK                      TRANSACTION_IOT  2   ACCO_ID
TRIO_PK                      TRANSACTION_IOT  3   CRE_DATE
TRIO_PK                      TRANSACTION_IOT  4   TRAN_TYPE
TRIO_PK                      TRANSACTION_IOT  5   TRAN_ID
TRIO_TRAN_UQ                 TRANSACTION_IOT  1   TRAN_TYPE
TRIO_TRAN_UQ                 TRANSACTION_IOT  2   TRAN_ID

Now let’s select our data from that IOT.

select sum(num_1),count(*) from transaction_IOT th where acco_type=10 and acco_id=123

SUM(NUM_1)   COUNT(*)
---------- ----------
      1030         97
Elapsed: 00:00:00.00

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |    10 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |    10 |            |          |
|*  2 |   INDEX RANGE SCAN| TRIO_PK |   100 |  1000 |     5   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          5  consistent gets
          0  physical reads

5 consistent gets. It has walked down the IOT and scanned 3 blocks to collect that data. Our IOT based on an abused Primary Key does the job of supporting range scans efficiently, with the benefits to the Block Buffer Cache I refered to in IOT4

That “Primary Key” I created is NOT a real Primary key. It is not the minimum number of columns I need to uniquely identify a column. My Primary key is on ACCO_TYPE, ACCO_ID, CRE_DATE,TRAN_TYPE and TRAN_ID – the account, the datetime of the transaction and the transaction. What if I was to alter the datetime by a second? I could create a record with the same account, the same transaction_id as an existing record but a second into the future. That is just wrong. After all, the whole point of the TRAN_TYPE and TRAN_ID is to uniquely identify a record. If created the new record I stated above, there would be two records for the one TRAN_TYPE/TRAN_ID.

I protect against this ability to create incorrect records by creating a UNIQUE KEY against the table also, against columns TRAN_TYPE and TRAN_ID. This is unique index TRIO_TRAN_UQ as displayed in the information above. A Primary Key is usually the referenced parent of any referential integrity, ie foreign keys, between this table and any children. However, a Unique Key can also be the target of Referential Integrity. I cannot create a record in TRANSACTION_IOT with the same TRAN_TYPE/TRAN_ID as already exists due to this unique constraint:

insert into transaction_iot_p
values
(2,163 -- existing transaction type and id
,10,11111
,sysdate,'ASCAFWEWEHGWSHERJH',SYSDATE,7,7)
/

insert into transaction_iot_p
*
ERROR at line 1:
ORA-00001: unique constraint (MDW.TIP_TRAN_UQ) violated

Elapsed: 00:00:00.34

So, I have my IOT to support querying code and I have my Unique Constraint to police my original Primary Key and be used as the target for any Foreign Key requirements I might need. This is not a perfect solution – the design will look a little strange to anyone who looks at this database and the Unique Key is supported by a secondary index on an IOT which can have some issues. But it does work.

My “primary key” is no longer a true Primary Key. It is just a tool for allowing me to organise the data physically in a way that will support my application. That is what I meant about changing the rules.

I am willing to abuse a Primary Key in this way because of the performance benefits. It is a solution for a system where most of the query access is against a set of records which would be scatter-gunned across a table if you did not use some sort of physical grouping. If you are reading this and thinking “oh, I am not sure about you doing that to a Primary Key Martin” then you are probably OK to consider this solution. If you can’t see a problem with it then you are either very used to turning off referential integrity and understand the consequences – or you simply do not understand what RI does for your database. If you are in the latter camp, do not even consider doing this. If you are one of those people who works on data warehouse and for whom is it just part of the DW process to turn off RI as that is what you do for data warehouses – DON’T do this!

OK, I’m nearly at the end of this topic but I want to touch on partitioning. You can range partitition an Index Organized Table from 9i I think. It is certainly supported in Oracle 10 upwards. Partitioning is important in this technique because a unique index must contain the partition key if the index is to be locally partitioned – otherwise the index must be global, ie the one index object references all the partitions across the table.

Below is my table creation statement for the IOT organized by the account, creation date and transaction. The table is ranged partitioned by CRE_DATE, into months.

create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,acco_type number(2)     not null
,acco_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_TYPE,ACCO_ID,CRE_DATE,TRAN_TYPE,TRAN_ID) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (TRAN_TYPE,TRAN_ID)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110601 values less than (to_date('01-06-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20110701 values less than (to_date('01-07-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20110801 values less than (to_date('01-08-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)
/

You can see the definition of my fake Primary Key and the fact that it does not have a tablespace defined for it – as the ‘organization index’ statement lower down causes the table to be an IOT and the segment will go into the “table” tablespace.
I then state my Unique Index to police the integrity of my table – TIP_TRAN_UQ
I then state the partition clause, ‘partition by range (cre_date)’ followed by my initial partition definitions. It’s as simple as that to partition an IOT.

What gets created? A set of four segments for the IOT, which are primary key index segments of course, not table segments:

@seg_dets
Enter value for seg_name: tip_pk
Enter value for owner: mdw

OWNER    SEG_NAME        SEG TS_NAME     BYTES_K    BLOCKS exts   INI_K   NXT_K
-------- --------------- --- -------- ---------- --------- ---- ------- -------
MDW      TIP_PK RM201106 IP  DATA_01      45,056     5,632   59      64    1024
         01
MDW      TIP_PK RM201107 IP  DATA_01      60,416     7,552   74      64    1024
         01
MDW      TIP_PK RM201108 IP  DATA_01      61,440     7,680   75      64    1024
         01
MDW      TIP_PK RMTOP    IP  USERS        34,816     4,352   49      64    1024

Note that the SEG (type) is “IP” – my script decodes the type into a short mnemonic and IP is Index Partition. You can see the tablespaces those segments are in and the size of the segments. What about that unique index I created?

@seg_dets
Enter value for seg_name: tip_tran_uq
Enter value for owner: mdw

OWNER    SEG_NAME        SEG TS_NAME     BYTES_K    BLOCKS exts   INI_K   NXT_K
-------- --------------- --- -------- ---------- --------- ---- ------- -------
MDW      TIP_TRAN_UQ     IND INDEX_01     35,840     4,480   50      64    1024

It is a single segment, a normal index. I cannot have it as a locally partitioned index as it is a unique index and lacks the partitioning key in it’s definition.

This could be a problem. The usual reason you partition a table is because it is too large to comfortably be held as a single segment {and also for the benefit of partition exclusion, but you don’t usually need that on small tables!}. This means that the global index to support that primary key is going to be large. Now, I made a “mistake” when I created my partitioned IOT – I did not create a partition for this month, some data has gone into the MAXVALUE partition (see the size of the segment above, 34K and 49 extents). If I split that last partition to create a new partition for this month and a new MAXVALUE partition, I will invalidate the global index and I will have to rebuild it. Very large indexes can take a long time and a heck of a lot of temporary space to gather and sort the data. That could be an ongoing maintenance nightmare.

In a recent implementation I did using IOTs I did not create a global unique index to replace the original foreign key. I create a non-unique, locally partitioned index to support some queries using those columns and the table had no children so no Foreign Keys were needed. But there was something else I needed to do as I had removed the referential integrity rules for that table. Remember I sad I am a fan of database enforced referential integrity? Now I “know” the application will not create data that will break the removed Primary Key rule, I “know” I documented what I had done. And I know that in 12 months time there will almost certainly be data that will have duplicate values for that Primary Key if it is not enforced somehow, because it always happends. I need to implement a little script to regularly check for duplicate TRAN_TYPE/TRAN_ID conmbinations being created. If you remove RI from a relational database, you should replace it in some way. Otherwise, you will pretty soon have a non-relational database.

That’s it for this topic. The below is my example script for creating most of the above, in case anyone wants it or wants to verify what I have said.

-- test_iot2.sql
-- create test tables to show how you can work around the PK issue and
-- partition an IOt - and the possible impact on my PK workaround.
spool test_iot2.lst
--
set feed on timi on pause off
--
drop table account purge;
drop table transaction_heap purge;
drop table transaction_iot purge;
drop table transaction_iot_p purge;
--
-- create 10,000 parent records
create table mdw.account
(ACCO_type  number(2)     not null 
,ACCO_id       number(10)    not null 
,name     varchar2(100) not null
,date_1   date          not null
,num_1    number(2)
,num_2    number(2) 
,constraint ACCO_pk primary key(ACCO_type,ACCO_id) 
 using index tablespace index_01
)
tablespace data_01
/
insert into account
select 10 
,rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level <= 5000
/
insert into account
select 15 
,rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level <= 5000
/
--
-- create the table to hold the children as a heap table
create table transaction_heap
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint trhe_pk primary key(tran_type,tran_id) 
 using index tablespace index_01
)
tablespace data_01
/
--
create index trhe_ACCO_crda_idx
on transaction_heap(ACCO_type,ACCO_id,cre_date)
tablespace index_01
/
-- populate the Heap table
-- 100 days, 10000 people
declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into transaction_heap
    (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,date_1,num_1,num_2)
  select mod(rownum,3)+1 
   ,((i-1)*v_num)+rownum
   , 5+(trunc(dbms_random.value(1,3))*5)
   ,trunc(dbms_random.value(1,v_num/2))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
--
--
--
create table transaction_IOT
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint trio_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint trio_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
/
--
-- populate the IOT table
-- 100 days, 10000 people
declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into transaction_IOT
    (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,date_1,num_1,num_2)
  select mod(rownum,3)+1 
   ,((i-1)*v_num)+rownum
   , 5+(trunc(dbms_random.value(1,3))*5)
   ,trunc(dbms_random.value(1,v_num/2))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
create table transaction_IOT_P
(tran_type number(2)     not null
,tran_id   number(10)    not null
,ACCO_type number(2)     not null
,ACCO_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint tip_pk primary key(ACCO_type,ACCO_id,cre_date,tran_type,tran_id) 
--  using index tablespace index_01
,constraint tip_tran_uq unique (tran_type,tran_id)
  using index tablespace index_01
)
organization index
tablespace data_01
partition by range  (cre_date)
(partition rm20110601 values less than (to_date('01-06-2011','DD-MM-YYYY')) 
  tablespace data_01
,partition rm20110701 values less than (to_date('01-07-2011','DD-MM-YYYY'))
  tablespace data_01
,partition rm20110801 values less than (to_date('01-08-2011','DD-MM-YYYY'))
  tablespace data_01
,PARTITION RMTOP  VALUES LESS THAN (MAXVALUE)
  tablespace USERS
)
/
-- populate the IOT_P table
-- 100 days, 10000 people
declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into transaction_IOT_P
    (tran_type,tran_id,ACCO_type,ACCO_id,cre_date,vc_1,date_1,num_1,num_2)
  select mod(rownum,3)+1 
   ,((i-1)*v_num)+rownum
   , 5+(trunc(dbms_random.value(1,3))*5)
   ,trunc(dbms_random.value(1,v_num/2))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/
commit;
--
exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'ACCOUNT')
exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'TRANSACTION_HEAP')
exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'TRANSACTION_IOT')
exec dbms_stats.gather_table_stats(ownname=>USER,tabname=>'TRANSACTION_IOT_P')
--
select * from transaction_iot_p
where rownum < 10
/
insert into transaction_iot_p
values
(2,163 -- existing transaction type and id
,1,11111
,sysdate,'ASCAFWEWEHGWSHERJH',SYSDATE,7,7)
/
insert into transaction_iot_p
values
(3,163 -- new transaction type and id
,1,11111 -- but the whole of the rest of the record is the same.
,sysdate,'ASCAFWEWEHGWSHERJH',SYSDATE,7,7)
/
--
BEGIN
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
END;
/
--
spool off

Friday Philosophy – Oracle Performance Silver Bullet August 5, 2011

Posted by mwidlake in Architecture, Friday Philosophy, performance.
Tags: , , ,
15 comments

Silver Cartridge and Bullet

For as long as I have been working with Oracle technology {which is now getting towards 2 decades and isn’t that pause for thought} there has been a constant search for Performance Silver Bullets – some trick or change or special init.ora parameter {alter system set go_faster_flag=’Y’} you can set to give you a guaranteed boost in performance. For all that time there has been only one.

There are a few performance Bronze Bullets…maybe Copper Bullets. The problem is, though, that the Oracle database is a complex piece of software and what is good for one situation is terrible for another. Often this is not even a case of “good 90% of the time, indifferent 9% of the time and tragic 1% of the time”. Usually it is more like 50%:30%:20%.

Cartridge with copper bullet &spent round

I’ve just been unfair to Oracle software actually, a lot of the problem is not with the complexity of Oracle, it is with the complexity of what you are doing with Oracle. There are the two extremes of OnLine Transaction Processing (lots of short running, concurrent, simple transactions you want to run very quickly by many users) and Data Warehouse where you want to process a vast amount of data by only a small number of users. You may well want to set certain initialisation parameters to favour quick response time (OLTP) or fastest processing time to completion (DW). Favouring one usually means a negative impact on the other. Many systems have both requirements in one… In between that there are the dozens and dozens of special cases and extremes that I have seen and I am just one guy. People get their database applications to do some weird stuff.

Partitioning is a bronze bullet. For many systems, partitioning the biggest tables makes them easier to manage, allows some queries to run faster and aids parallel activity. But sometimes (more often than you might think) Partitioning can drop rather than increase query or DML performance. In earlier versions of Oracle setting optimizer_index_caching and optimizer_index_cost_adj was often beneficial and in Oracle 9/8/7 setting db_file_multiblock_read_count “higher” was good for DWs….Go back to Oracle 7 and doing stuff to increase the buffer cache hit ratio towards 98% was generally good {and I will not respond to any comments citing Connors magnificent “choose your BCHR and I’ll achieve it” script}.
You know what? There was an old trick in Oracle 7 you could maybe still look at as a bronze bullet. Put your online redo logs and key index tablespaces on the fastest storage you have and split your indexes/tables/partitions across the faster/slower storage as is fit. Is all your storage the same speed? Go buy some SSD and now it isn’t….

Cartridge with Wooden Bullet

Then there are bronze bullets that you can use that very often improve performance but the impact can be catastrophic {Let’s call them wooden bullets :-) }. Like running your database in noarchivelog mode. That can speed up a lot of things, but if you find yourself in the situation of needing to do a recovery and you last cold backup is not recent enough – catastrophe. A less serious but more common version of this is doing things nologging. “oh, we can just re-do that after a recovery”. Have you done a test recovery that involved that “oh, we can just do it” step? And will you remember it when you have a real recovery situation and the pressure is on? Once you have one of these steps, you often end up with many of them. Will you remember them all?

How many of you have looked at ALTER SYSTEM SET COMMIT_WRITE=’BATCH,NOWAIT’? It could speed up response times and general performance on your busy OLTP system. And go lose you data on crash recovery. Don’t even think about using this one unless you have read up on the feature, tested it, tested it again and then sat and worried about could possibly go wrong for a good while.

That last point is maybe at the core of all these Performance Bronze Bullets. Each of these things may or may not work but you have to understand why and you have to understand what the payback is. What could now take longer or what functionality have I now lost? {hint, it is often recovery or scalability}.

So, what was that one Silver Bullet I tantalizingly left hanging out for all you people to wait for? You are not going to like this…

Look at what your application is doing and look at the very best that your hardware can do. Do you want 10,000 IOPS a second and your storage consists of less than 56 spindles? Forget it, your hardware cannot do it. No matter what you tune or tweak or fiddle with. The one and only Performance Silver Bullet is to look at your system and your hardware configuration and work out what is being asked and what can possibly be delivered. Now you can look at:

  • What is being asked of it. Do you need to do all of that (and that might involve turning some functionality off, if it is a massive drain and does very little to support your business).
  • Are you doing stuff that really is not needed, like management reports that no one has looked at in the last 12 months?
  • Is your system doing a heck of a lot to achieve a remarkably small amount? Like several hundred buffer gets for a single indexed row? That could be a failure to do partition exclusion.
  • Could you do something with physical data positioning to speed things up, like my current blogging obsession with IOTs?
  • You can also look at what part of your hardware is slowing things down. Usually it is spindle count/RAID level, ie something dropping your IOPS. Ignore all sales blurb from vendors and do some real-world tests that match what you app is or wants to do.

It’s hard work but it is possibly the only Silver Bullet out there. Time to roll up our sleeves and get cracking…

{Many Thanks to Kevin Closson for providing all the pictures – except the Silver Bullet, which he only went and identified in his comment!}

Server Bought for the 1 Grand Challenge December 6, 2010

Posted by mwidlake in Architecture, One Grand Server, performance.
Tags: ,
3 comments

What seems like a couple of months ago I suggested the idea of The Fastest Oracle Server for a Grand. It turns out this was actually over 1/3 of a year ago! {such is the rapid passing of normal time}. Anyway, I’ve decided to give this a go.

The intention is that I am going to build a server based on PC technology which costs less than £1,000 and see how fast I can make it go. Of course “how fast” is a “piece of string” question – it depends on what you put into the Oracle database, how you want to use or manipulate the data and how business-ready the system is. I’m intending to build something that looks very, very un-business ready. That is, not a lot of redundancy. Before anyone wants to shoot me down for that (a) I am not running a bank or anything to do with finance (b) why are banks systems that only deal with cash so much more regulated and goverend than medical systems that are only relied on to keep you alive? (c) some of the biggest systems I know of are actually running on something close to PC kit.

I’m quietly confident I can build something that out-performs systems consisting 100 times as much. Now, that is a massive claim and I won’t be too sad if I fall short, but you can do a lot with modest kit. I worked for a charity for 6 years and boy did I see some clever stuff done on the sort of budget many organisation spend on office stationary.

So, what have I got so far? I confess I held off until I saw some new technology appear in a price band I could squeeze in. Namely USB3 and SATA3. There is always something just around the corner but I wanted those as I want to maximise the impact of solid state storage. So, my base server is:

  • Asus P7P55D-E motherboard supporting DDR3, USB3 and SATA3
  • Intel i5 760 2.8HHz chip
  • 8GB memory
  • 1TB samsung 7200rpm SATAII disk
  • AZCool Infinity 800W PSU
  • Coolmaster Elite RC-335 case

I chose the motherboard as it was getting good reviews and had the SATA3 and USB3 ports. I chose the case as it was large enough to take many hard drives, small enough to lug about and was a nice case. I stuck to 8GB RAM as RAM is expensive at the moment, but as it is in 2GB chunks I might regret that choice as all my slots are full. Many people forget the PSU but it’s like the tyers on your car. Those tyers keep you stuck to the road, a PSU keeps you powered. It might be utilitarian but they are vital and often overlooked. The hard disc is pretty good, but very likely to be swapped out (I don’t mind sticking it in another system). The CPU is a proper quad core CPU. I had plenty of scope to go bigger and better on the CPU but for grunt for cash, it seems presently to be the sweet spot.

The basic unit is not overclocked. I will increase the cooling and overclocking will be an option. It comes with 64 bit windows but linux is almost certainly going to be the faster option. No monitor is included but hey, it’s a database server, you don’t need fancy graphics. That old CRT in the corner will do! The server does have a rather nice nVidia GeForce GTX 460 in it but I am cutting out the cost of that. The server is currently the best gaming machine I have but that will end when I get time to start working on the Oracle side.

Total cost, £615 or so. That is like $615 seeing as we get so ripped off in the UK for IT kit. I can now go spend money on more fast hard discs, SSDs, even fast USB memory sticks. Any suggestions, I am happy to listen.

The biggest question is – When am I going to get time to work on this damn thing?

Database Sizing – How much Disk do I need? (The Easy Way) November 11, 2010

Posted by mwidlake in Architecture, development, VLDB.
Tags: , , , ,
7 comments

How much Disk do I need for my new Oracle database? Answer:-

  • 8-10 times the volume of raw data for an OLTP system
  • 2-4 times the raw data volume for a Data Warehouse.
  • The bigger the database, the nearer you will be to the lower multiplication factors.

{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}

Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…

In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.

First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.

Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.

If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.

If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.

The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:

  • The size of the Oracle binaries and the OS does not change as the database gets bigger.
  • The size of swap space does not increase in line wiht the database as, generally speaking, if you increase the database size from 100GB to 1TB you do not have the luxury of increasing the system memory of your server. It probably doubles.
  • Very large databases tend to have something making them big, like images or embedded documents, which are not indexed. Thus the ratio of table segments to index segments increases.
  • If you have a very large database you start removing indexes (often those that support constraints) to aid performance of data load and management, again improving the ratio of table segments to index segments.
  • Backups become partial or incremental to reduce the size and duration of the backup.
  • As mentioned before, the sheer size of system is such that you just take more care over cleaning up work areas, paring down the archived redo log areas (those files to compress well) and other areas.
  • If things get extreme or you have been doing this for donkeys years {note to none-UK people, this means many, many years} you start altering PCTFREE and checking over extent sizes.

My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).

I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.

How Big is a Person? November 5, 2010

Posted by mwidlake in Architecture.
Tags: , ,
5 comments

How big are you in the digital world?

By this, I mean how much space do you (as in, a random person) take up in a database? If it is a reasonably well designed OLTP-type database a person takes up 4K. OK, around 4K.

If your database is holding information about people and something about them, then you will have about 4K of combined table and index data per person. So if your database holds 100,000 customers, then your database is between 200MB and 800MB, but probably close to 400MB. There are a couple of situations I know of where I am very wrong, but I’ll come to that.

How do I know this? It is an accident of the projects and places I have worked at for 20 years and the fact that I became strangely curious about this. My first job was with the NHS and back then disk was very, very expensive. So knowing how much you needed was important. Back then, it was pretty much 1.5K per patient. This covered personal details (names, addresses, personal characteristics), GP information, stays at hospitals, visits to outpatient clinics etc,. It also included the “reference “ data, ie the information about consultants, wards and departments, lookups etc. If you included the module for lab tests it went up to just over 2K. You can probably tell that doing this sizing was a job I handled. This was not Oracle, this was a database called MUMPS and we were pretty efficient in how we held that data.

When I moved to work on Oracle-based hospital systems, probably because I had done the data sizing in my previous job and partly because I was junior and lacked any real talent, I got the job to do the table sizings again, and a laborious job it was too. I did it very conscientiously, getting average lengths for columns, taking into account the length bytes, row overhead, block overhead, indexes etc etc etc. When we had built the database I added up the size of all the tables and indexes, divided by the number of patients and… it was 2K. This was when I got curious. Had I wasted my time doing the detailed sizings?

Another role and once again I get the database sizing job, only this time I wrote a little app for it. This company did utilities systems, water, gas, electricity. My app took into account everything I could think of in respect of data sizing, from the fact that the last extent would on average be 50% empty to the tablespace header. It was great. And pointless. Sum up all the tables and indexes on one of the live systems and divide by the number of customers and it came out at 2-3K per customer. Across a lot of systems. It had gone up a little, due to more data being held in your average computer system.

I’ve worked on a few more person-based systems since and for years I could not help myself, I would check the size of the data compared to the number of people. The size of the database is remarkably consistent. It is slowly going up because we hold more and more data, mostly because it is easier to suck up now as all the feeds are electronic and there is no real cost in taking in that data and holding it. Going back to the hospital systems example, back in 1990 it used to be that you would hold the fact a lab test had been requested and the key results information – like the various cell counts for a blood test. This was because sometimes you had to manually enter the results. Now the test results come off another computer and you get everything.

I said there were exceptions. There are three main ones:

  • You are holding a very large number of transaction records for the person. Telephony systems are one of the worst examples of this. Banking, credit cards and other utility systems match the 4K rule.
  • You hold images or other “unstructured” chunks of data for people. In hospital systems this would cover x-rays, ultrasound scans etc. But if you drop them out of the equation (and this is easy as they often are held in separate sub-systems) it remains a few K per person. CVs push it up as they are often in that wonderfully bloaty Word format.
  • You are holding mostly pointers to another system, in which case it can be a lot less than 4K per person. I had to size a system recently and I arrogantly said “4K per person”. It turned out to be less than 1K, but then this system turned out to actually hold most person data in one key data store and “my” system only held transaction information. I bet that datastore was about 4K per person

I have to confess that I have not done this little trick of adding up the size of all the tables and indexes and dividing by the number of people so often over the last couple of years, but the last few times I checked it was still 3-4K – though a couple of times I had to ignore a table or two holding unstructured data.
{The massive explosion in the size of database is at least partly down to holding pictures – scanned forms, photos of products, etc, but when it comes down to the core part of the app for handling people, it seems to have stayed at 4K. The other two main aspects driving up database size seem to me to be the move from regional companies and IT systems to national and international ones, and that fact that people collect and keep all and every piece of information, be it any good for anything or not}.

I’d love to know if your person-based systems come out at around 4K per person but I doubt if many of you would be curious enough to check – I think my affliction is a rare one.

Follow

Get every new post delivered to your Inbox.

Join 156 other followers