jump to navigation

Will the Single Box System make a Comeback? December 8, 2011

Posted by mwidlake in Architecture, future, Hardware.
Tags: , ,
15 comments

For about 12 months now I’ve been saying to people(*) that I think the single box server is going to make a comeback and nearly all businesses won’t need the awful complexity that comes with the current clustered/exadata/RAC/SAN solutions.

Now, this blog post is more a line-in-the-sand and not a well researched or even thought out white paper – so forgive me the obvious mistakes that everyone makes when they make a first draft of their argument and before they check their basic facts, it’s the principle that I want to lay down.

I think we should be able to build incredible powerful machines based on PC-type components, machines capable of satisfying the database server requirements of anything but the most demanding or unusual business systems. And possibly even them. Heck, I’ve helped build a few pretty serious systems where the CPU, memory and inter-box communication is PC-like already. If you take the storage component out of needing to be centralise (and this shared), I think that is a major change is just over the horizon.

At one of his talks at the UKOUG conference this year, Julian Dyke showed a few tables of CPU performance, based on a very simple PL/SQL loop test he has been using for a couple of years now. The current winner is 8 seconds by a… Core i7 2600K. ie a PC chip and one that is popular with gamers. It has 4 cores and runs two threads per core, at 3.4GHz and can boost a single core to 3.8 GHz. These modern chips are very powerful. However, chips are no longer getting faster so much as wider – more cores. More ability to do lots of the same thing at the same speed.

Memory prices continue to tumble, especially with smart devices and SSD demands pushing up the production of memory of all types. Memory has fairly low energy demands so you can shove a lot of it in one box.

Another bit of key hardware for gamers is the graphics card – if you buy a top-of-the-range graphics card for a PC that is a couple of years old, the graphics card probably has more pure compute grunt than your CPU and a massive amount of data is pushed too and fro across the PCIe interface. I was saying something about this to some friends a couple of days ago but James Morle brought it back to mind when he tweeted about this attempt at a standard about using PCI-e for SSD. A PCI-e 16X interface has a theoretical throughput of 4000MB per second – each way. This compares to 600MB for SATA III, which is enough for a modern SSD. A single modern SSD. {what I am not aware of is the latency for PCI-e but I’d be surprised if it was not pretty low}. I think you can see where I am going here.

Gamers and image editors have probably been most responsible for pushing along this increase in performance and intra-system communication.

SSD storage is being produced in packages with a form factor and interface to enable an easy swap into the place of spinning rust, with for example a SATA3 interface and 3.5inch hard disk chassis shape. There is no reason that SSD (or other memory-based) storage cannot be manufactured in all sorts of different form factors, there is no physical constraint of having to house a spinning disc. Density per dollar of course keeps heading towards the basement. TB units will soon be here but maybe we need cheap 256GB units more than anything. So, storage is going to be compact and able to be in form factors like long, thin slabs or even odd shapes.

So when will we start to see cheap machines something like this: Four sockets for 8/16/32 core CPUs, 128GB main memory (which will soon be pretty standard for servers), memory-based storage units that clip to the external housing (to provide all the heat loss they require) that combine many chips to give 1Gb IO rates, interfaced via the PCIe 16X or 32X interface. You don’t need a HBA, your storage is internal. You will have multipath 10GbE going in and out of the box to allow for normal network connectivity and backup, plus remote access of local files if need be.

That should be enough CPU, memory and IO capacity for most business systems {though some quote from the 1960’s about how many companies could possible need a computer spring to mind}. You don’t need shared storage for this, in fact I am of the opinion that shared storage is a royal pain in the behind as you are constantly having to deal with the complexity of shared access and maximising contention on the flimsy excuse of “sweating your assets”. And paying for the benefit of that overly complex, shared, contended solution.

You don’t need a cluster as you have all the cpu, working memory and storage you need in a 1U server. “What about resilience, what if you have a failure?”. Well, I am swapping back my opinion on RAC to where I was in 2002 – it is so damned complex it causes way more outage than it saves. Especially when it comes to upgrades. Talking to my fellow DBA-types, the pain of migration and the number of bugs that come and go from version to version, mix of CRS, RDBMS and ASM versions, that is taking up massive amounts of their time. Dataguard is way simpler and I am willing to bet that for 99.9% of businesses other IT factors cause costly system outages an order of magnitude more times than the difference between what a good MAA dataguard solution can provide you compared to a good stretched RAC one can.

I think we are already almost at the point where most “big” systems that use SAN or similar storage don’t need to be big. If you need hundreds of systems, you can virtualize them onto a small number of “everything local”
boxes.

A reason I can see it not happening is cost. The solution would just be too cheap, hardware suppliers will resist it because, hell, how can you charge hundreds of thousands of USD for what is in effect a PC on steroids? But desktop games machines will soon have everything 99% of business systems need except component redundancy and, if your backups are on fast SSD and you a way simpler Active/Passive/MAA dataguard type configuration (or the equivalent for your RDBMS technology) rather than RAC and clustering, you don’t need that total redundancy. Dual power supply and a spare chunk of solid-state you can swap in for a failed raid 10 element is enough.

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,
12 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Friday Philosophy – The One Absolute Requirement for System Success October 14, 2011

Posted by mwidlake in development, Friday Philosophy, Perceptions.
Tags: , , , ,
5 comments

What must you always achieve for an IT system to be a success?

  • Bug free? Never happens.
  • Within budget/time frame? That would be nice.
  • Includes critical business functionality? Please define critical.
  • Secure? Well, it’s important for many systems but then it is often lacking (even when it is important).
  • That it is to specification? Well we all know that’s wrong.

There is only one thing that an IT system must always achieve to be a success.

User Acceptance.

For an individual system other considerations may well be very important, but the user acceptance is, I think, non-negotiable.

{Slight update – This blog was written in 2011 with corporate IT systems in mind, where the use of the system is almost if not explicitly mandated. With the explosion of Apps and web-based systems over the last 5 years, for a lot of IT systems the users, the customers, have to be attracted to the system. We now call User Acceptance User Experience and it is, if anything, even more vital. If the user does not find the system easy to use and of some benefit, they will click off somewhere else.}

The user must get enough out of using the system for it to be worth their while, otherwise at best they will resent using it and at worst… Well, at worst they will use it but will put in any old rubbish to fulfill the dictate that it be used. In this worst case you would be better off if they did not use the system at all. Here are a couple of examples from my working past.

In the first one, I was involved in extending a hospital management system so that it kept track of the expected arrival and departure times for ward patients, allowing a predication of when beds would become available and calculation of expected occupancy rates. Yes, this was a while ago (maybe 1990) and on an a system that was old then. The information was needed by someone with the title “bed sister” {or something similar – and would now be titled “head of occupancy management!…} so that they could better manage the flow of patients and keep a higher bed usage ratio. Was this to make the hospital more efficient? No, it was to satisfy a politically demanded report to the NHS executive. Oh, the overall intention was to increase efficiency – but the report soon became more important than any true efficiency. So, we added columns in tables and field on screens and prompts for the ward staff to fill in the required new information.

And they didn’t.

The nurses were busy, they were pretty demoralized {due to having recently been used by the government as a way to control public sector pay} and they had more nursing duties to do than they could manage. They were not going to waste a couple of minutes trying to check when Mrs Jenkins was going to be sent home when Mrs Leonard needed a bed pan. The nursing staff were given a hospital-wide telling off, this information had to be entered. So they put in the data – but guessed wildly, just to get management off their backs. Thus the design of the system enhancement was fine, the report was logically accurate, only the correct staff could run it, but we failed to achieve User Acceptance and thus the system was a failure.

So I added something else. It was a very crude screen (this was 1990) that showed a “diagram” of a ward – Down the left and right side of a VT220 screen you saw little oblong boxes labelled with a bed number and in it the patient name, the consultant’s initials, a medical specialty code and the arrival and departure date-time. This was based on some information we already had plus the new information we wanted and the screen was quite basic, limited and slow to draw. But it was useful to the ward staff. They could find any patient, they knew who to call if there was an emergency {not the actual consultant of course, but their secretary :-)}, they could check when they expected a patient to leave, they could even see when a new patient was expected. They used it. They put in the expected departure times {sobering thought, this might not be expected leaving alive}, patients would be booked into empty beds when they became available (often before they actually arrived) and the bed nurse could plan new arrivals and the oh-so-important report could be run. We had achieved User Acceptance and the system became a success.

The second example is also from healthcare but from a different system in a different hospital. We were putting together a system to schedule outpatient clinics. We knew what we were doing, it’s pretty simple. You have some medical staff holding the clinic (a consultant and probably a senior house officer), a period for the clinic (3 or 4 hours) and a set of people to be seen, say 40. We gave some flexibility in slot lengths (some people need 5 minutes, some 15), checked for and stopped double booking and verified against other data such as correct speciality for the outpatient’s clinical data. We did not go and ask the patient admin staff about the spec, we knocked up the design and the screens and asked them to test. After all, I was very experienced now, I’d been doing these systems for 3, 4 years…

They very quickly came back to us and said it was rubbish. Oh dear.

We went and saw them. “What’s the problem?” There were a few – but the main one was that you could not double book a slot. Why would you want to do that? Do two patients really want to be consulted at the same time with the same doctor? {NB this can happen in some special cases – such as fertility treatment, psychology and a couple of others, but not Urology or Orthopedics!}
“Err, maybe, it might happen, can we just be able to double book?” OK, we could maybe alter things to allow two patients to be seen at the same time… The patient admin staff are not looking happy. The hospital liaison is looking confused & concerned and interrupts – “You can’t do that! Patient confidentiality can’t be broken!” he says. It got worse. “We need to be able to book… several people in one slot”. How many? “We need to book all the patients into the first slot, with the consultant, so the letters go out to them saying come to see Mr Winders at 1pm”. The admin staff are now looking very shifty.

If any of you have worked in the health service you are probably way ahead of me. The admin staff needed to book all the patients in at this first slot so that the ill and suffering would all turn up at the start of the session. The consultant would then run through his notes and see the two or three he was interested in – and then go and play golf. The Senior House Officer (or whoever) would then work through the rest of the patients for the following three or four hours. If you have ever had to turn up at the start of a consultancy session and sat there for three hours, now you know why. You see, back then, the consultant was only a very small step away from deity level (and I leave it to you to decide if it was a step up or down). What they said went and if they wanted to go and play golf or store 200 medical records in the boot of their car or refuse to speak to “that stupid idiot in renal medicine” then you worked around it. {I’m assured that things are a lot better now, but I’d love to know how it really is}.

We had designed a sensible solution. The users of the system needed a non-sensible {to our mind} solution. Even the NHS liaison chap had never appreciated exactly how much the consultants abused the system, he thought they just booked the people s(he) wanted at the start of the session, but no. The consultant decided that day who was interesting and as a result every patient had to be there at the start.

I count myself lucky that I learnt from direct experience so soon in my working life that (a) you have to deliver what the user will accept and (b) the only way to know what they want is to show them the system and talk with them.

Friday Philosophy – The start of Computing October 7, 2011

Posted by mwidlake in Friday Philosophy, history.
Tags: , , ,
add a comment

This week I finally made a visit to Bletchley Park in the middle of England. Sue and I have been meaning to go there for several years, it is the site of the British code-breaking efforts during the second world war and, despite difficulties getting any funding, there has been a growing museum there for a number of years. {Hopefully, a grant from the Heritage Lottery Fund, granted only this month, will secure it’s future}.

Why is Bletchley Park so significant? Well, for us IT-types it is significant because Alan Turing did a lot of work there and it was the home of Colossus, one of the very first electrical, programmable computers. More generally of interrest, their efforts and success in cracking enemy ciphers during WW2 were incredibly important and beneficial to the UK and the rest of the allies.

In this post, I am not going to touch on Colossus or Alan Turing, but rather a machine called the “Bombe”. The Bombe was used to help discover the daily settings of the German Enigma machines, used for decrypting nearly all German and Italian radio messages. All the Bombes were destroyed after the war (at least, all the UK ones were) to help keep secret the work done to crack the cyphers – but at Bletchley Park the volunteers have recreated one. Just like the working model of Babbage’s Difference Engine, it looks more like a work of art than a machine. Here is a slightly rough video I took of it in action:

My slightly rough video of the bombe

{OK, if you want a better video try a clearer video by someone else.}

I had a chat with the gentleman you see in both videos about the machine and he explained something that the tour we had just been on did not make clear – the Bombe is a parallel processing unit. Enigma machines have three wheels. There are banks of three coloured disks in the bombe (see the picture below). eg, in the middle bank the top row of disks are black, middle are yellow and bottom are red. Each vertical set of three disks, black-yellow-red, is the equivalent of a single “enigma machine”. Each trio of disks is set to different starting positions, based on educated guesses as to what the likely start positions for a given message might be. The colour of the disk matches, I think, one of the known sets of wheels the enigma machines could be set up with. The machine is then set to run the encrypted message through up to 36 “Enigmas” at once. If the output exceeds a certain level of sense (in this case quite crucially, no letter is every encrypted back to itself) then the settings might be correct and are worth further investigation. This machine has been set up with the top set of “Enigmas” not in place, either to demonstrate the workings or because the machine is set up for one of the more complex deciphering attempts where only some of the banks can be used.

This is the bombe seen from the front

The reason the chap I was talking to really became fascinated with this machine is that, back in about 1999, a home PC programmed to do this work was no faster than the original electro-mechanical machines from 1944 were supposed to have taken. So as an engineer he wanted to help build one and find out why it was so fast. This struck a chord with me because back in the late 1990’s I came across several examples of bespoke computers designed to do specific jobs (either stuff to do with natural gas calorific value, DNA matching or protein folding), but by 2000, 2002 they had all been abandoned as a general PC could be programmed to be just as fast as these bespoke machines – because bespoke means specialist means longer and more costly development time means less bangs for your buck.

Admittedly the Bombe is only doing one task, but it did it incredibly fast, in parallel, and as a part of the whole deciphering process that some of the best minds of their time had come up with (part of the reason the Bletchley Park site was chosen was that it was equidistant between Oxford and Cambridge and, at that time, there were direct train links. {Thanks, Dr Beeching}. ).

Tuning and reliability was as important then as it is now. In the below picture of the back of the machine (sorry about the poor quality, it was dim in that room), you can see all the complex wiring in the “door” and, in the back of the machine itself, those three rows of bronze “pipes” are in fact…Pipes. Oil pipes. This is a machine, they quickly realised that it was worth a lot of effort to keep those disks oiled, both for speed and reliability.

All the workings of the Bombe from the back

Talking of reliability, one other thing my guide said to me. These machines are complex and also have some ability to cope with failures or errors built into them. But of course, you needed to know they were working properly. When these machines were built and set up, they came with a set of diagnostic tests. These were designed to push the machine, try the edge cases and to be as susceptible to mechanical error as possible. The first thing you did to a new or maintained machine was run your tests.

1943, you had awesome parallel processing, incredible speed and test-driven development and regression testing. We almost caught up with all of this in the early 21st Century.

In Defense of Agile Development (and Their Ilk) September 21, 2011

Posted by mwidlake in development, Management.
Tags: , , , ,
10 comments

In my previous post I asked the question “why doesn’t Agile work?”. I’m not sure the nuance of the question came over correctly.

I’d just like to highlight that the question I asked was “Why does agile not work”. It was not “Why is Agile rubbish“. I’ve said a few times in the past couple of weeks that I like the ideology of Agile and I am (and have been for years and years) a strong proponent of prototyping, cyclic development, test driven design and many other things that are part of the Agile or XP methodologies.

That distinction in the title is a really important distinction and one I’d hoped I’d made clear in my post. Looking back at my post though, I think it is clear I failed :-(. I highlighted reasons why I think Agile does not work and in my head I was thinking “if we avoid these, Agile could work” – but when you write something down it does not matter what is in your head if it does not reach the paper.

I’m actually frustrated that in the last few years I have not seen Agile really succeed and also that this must be the normal situation, going on the response you get when the topic of Agile comes up with fellow technicians and comments on my own blog.

However, on that post about Agile two people who’s opinion I deeply respect came back at me to say “Agile does work!”. Cary Millsap, who many of you will have heard of as the “Method R” guy and the person behind Oracle Flexible Architecture. And Mike Cox, who most of you won’t have heard of but Mike taught me a lot about sensible development back in the 90’s. He’s one of the best developers I have ever had the pleasure of working with and I know he has had great success with Agile and RED. I’m not sure if they read my post as “Agile is Rubbish” or they are, like me, simply frustrated that it can work but so often does not.

So I’ve been thinking about this a lot this weekend and I was helped by Cary’s paper on the topic that he mentioned in his comment. I’d highly recommend downloading it as it is an excellent description of not only why Agile can help but describes how and some of the pitfalls {I’d started my own post on that, but go read Cary’s}. I should add, you can see Cary present his case for Agile at the UKOUG conference this year.

So where does this bring me to? Well, I think “Is Agile good or bad” has become almost an “IT religion” topic, people love it or loath it and it is based on what they have seen of the methodology in real life. No, that’s wrong, it is based on what they have seen that has been labelled with that methodology in real life. Or worse, it is based on anecdotal opinion of those around them. The thing is, if you look at what XP is supposed to consist of or what Agile Programming is supposed to consist of, most of us would agree that a great deal of it makes sense in many situations. I’d disagree with some of the details in Cary’s paper but overall I’m in strong agreement. Sadly, What Agile and XP is supposed to be is not well matched by what you see on the ground in most cases. So even if these methodologies are right for the situation, what has been implemented is not the methodology but probably more a slap-dash process that simply jettisons documentation, design and proper testing. This whole thread sprung from my lamenting the demise of database design and several of the comments highlighted that the introduction of Agile seemed to equate, at least in part, with the demise of design. As MIke and Cary say, and as I think anyone who has successfully utilized Agile would say, Design is an integral part of Agile and XP methodology.

Agile can and does work. But many things can and do work, such as taking regular exercise to keep healthy or regularly maintaining your house to keep it weathertight. Like Agile, both take effort but the overall benefit is greater than the cost. And like Agile, do it wrong and you can make things worse. If your window frames are starting to rot and you just slap a new layer of top-coat on them all you will do is seal in the damp and rot and hide the problem – until the glass falls out. Going for a regular 5 mile run is good for you – but not if you are 10 stone (60KG) overweight and have not run in years. A 5 mile run is also not a good idea if you want to be a long-jumper. Right training (methodology) for the right aim. Also, just like keeping healthy, house maintenance or anything that takes effort but works, proponents tend towards extremism – probably as a reaction to the constant {perceived} pig-headedness of critics or the failure of people to just do what now seems so sensible to them {think reformed smokers}. I’ll have to buy Cary and Mike pints to make up for that jibe now, and promise them it was not aimed at them personally…

Sadly, the reality is, Agile does not work 90% of the time it is tried. So, does that mean Agile is actually rubbish? Or at least, not fit for purpose, because many companies are not able to use it? Companies are there to achieve something and the IT systems are part of achieving that something. If Agile cannot aid that IT department then Agile is the wrong way for that department and company.

*sigh* I’ve gone on and on about this and still not got to my own main point, which is this.

– Can we identify reasons for Agile and XP Failing.
– Having identified the Reasons, can we fix them in simple ways?
– Can we create some simple guidelines as to when a project should be more Agile and when it should be more Up-Front design.

I’d love to know people’s opinions on those three points above.

Friday Philosophy – Why doesn’t Agile work? September 16, 2011

Posted by mwidlake in development, Friday Philosophy, Management.
Tags: , ,
13 comments

Why doesn’t Agile Development Methodology seem to work?

I’m going say right here at the start that I like much of what is in Agile, for many, many years I’ve used aspects of Rapid Application Development {which Agile seems to have borrowed extensively from} to great success. However, after my post last week on database design, many of the comments were quite negative about Agile – and I had not even mentioned it in my post!

To nail my flag to the post though, I have not seen an Agile-managed project yet that gave me confidence that Agile itself was really helping to produce a better product, a product more quickly and most certainly not a final system that was going to be easy to maintain. Bring up the topic of Agile with other experienced IT people and I would estimate 90% of the feedback is negative.

That last point about ongoing maintenance of the system is the killer one for me. On the last few projects I have been on where the culture was Agile-fixated I just constantly had this little voice in my head going:

“How is anyone going to know why you did that in six months? You’ve just bolted that onto the side of the design like a kludge and it really is a kludge. When you just said in the standup meeting that we will address that issue ‘later’, is that the same “later” that accounts for the other half-dozen issues that seem to have been forgotten?”.

From what I can determine after the fact, that voice turns out to be reason screaming out against insanity. A major reason Agile fails is that it is implemented in a way that has no consideration for post-implementation.

Agile, as it is often implemented, is all about a headlong rush to get the job done super-quick. Ignore all distractions, work harder, be completely focused and be smarter. It really does seem to be the attitude by those who impose Agile that by being Agile your staff will magically come up with more innovative solutions and will adapt to any change in requirements just because they work under an agile methodology. Go Agile, increase their IQ by 10 points and their work capacity by 25%. Well, it doesn’t work like that. Some people can in fact think on their feet and pull solutions out of thin air, but they can do that irrespective of the methodology. People who are more completer-finishers, who need a while to change direction but boy do they produce good stuff, have you just demoralized and hamstrung them?Agile does not suit the way all people work and to succeed those people it does not suit need to be considered.

The other thing that seems to be a constant theme under Agile is utterly knackered {sorry, UK slang, knackered means tired, worn out and a bit broken} staff. Every scrum is a mad panic to shove it all out of the door and people stop doing other things to cope. Like helping outside the group or keeping an eye on that dodgy process they just adopted as it needed doing. Agile fails when it is used to beat up team. Also, I believe Agile fails when those ‘distractions’ are ignored by everyone and work that does not fall neatly into a scrum is not done.

I suppose it does not help that my role has usually been one that is more Production Support than development and Agile is incompatible with production support. Take the idea of the scrum, where you have x days to analyse, plan, design, unit test and integrate the 6 things you will do in this round. On average I only spend 50% of my time dealing with urgent production issues, so I get allocated several tasks. Guess what, if I end up spending 75% of my time that week on urgent production issues, and urgent production issues have to take priority, I can screw up the scrum all on my own. No, I can’t pass my tasks onto others in the team as (a) they are all fully assigned to their tasks and (b) passing over a task takes extra time. Agile fails when it is used for the wrong teams and work type.

I’ve come to the conclusion that on most projects Agile has some beneficial impact in getting tasks done, as it forces people to justify what they have done each and every day, encourages communication and gives the developers a reason to ignore anything else that could be distracting them as it is not in the scrum. Probably any methodology would help with all of that.

My final issue with Agile is the idiot fanatics. At one customer site I spent a while at, they had an Agile Coach come around to help the team to become more agile. I thought this was a little odd as this team was actually doing a reasonable job with Agile, they had increased productivity and had managed to avoid the worst of the potential negative impacts. This man came along and patronisingly told us we were doing OK, but it was hard for us to raise our game like this, we just needed help to see the core values of Agile and, once we did, once we really believed in it, productivity would go up 500% {That is a direct quote, he actually said “productivity will go up by 500%”}. He was sparkly-eyed and animated and full of the granite confidence of the seriously self-deluded. I think he managed to put back the benefits of Agile by 50%, such was the level of “inspiration” he gave us. Agile fails when it is implemented like a religion. It’s just a methodolgy guys.

I find it all quite depressing as I strongly suspect that, if you had a good team in a positive environment, doing a focused job, Agile could reap great rewards. I’m assured by some of my friends that this is the case. {update – it took my good friend Mike less than an hour to chime in with a comment. I think I hit a nerve}.

Friday Philosophy – The Dying Art of Database Design? September 9, 2011

Posted by mwidlake in Architecture, development, Friday Philosophy, rant.
Tags: , , ,
35 comments

How many people under the age of {Martin checks his age and takes a decade or so off} ohh, mid 30’s does any database design these days? You know, asks the business community what they want the system to do, how the information flows through their business, what information they need to report on. And then construct a logical model of that information? Judging by some of the comments I’ve had on my blog in the last couple of years and also the meandering diatribes of bitter, vitriolic complaints uttered by fellow old(er) hacks in the pub in the evening, it seems to be coming a very uncommon practice – and thus a rare and possibly dying skill.

{update – this topic has obviously been eating at my soul for many years. Andrew Clark and I had a discussion about it in 2008 and he posted a really good article on it and many, many good comments followed}

Everything seems to have turned into “Ready, Fire, Aim”. Ie, you get the guys doing the work in a room, develop some rough idea of what you want to develop (like, look at the system you are replacing), start knocking together the application and then {on more enlightened projects} ask the users what they think. The key points are the that development kicks off before you really know what you need to produce, there is no clear idea of how the stored data will be structured and you steer the ongoing development towards the final, undefined, target. I keep coming across applications where the screen layouts for the end users seem to almost be the design document and then someone comes up with the database – as the database is just this bucket to chuck the data into and scrape it out of again.

The functionality is the important thing, “we can get ‘someone’ to make the database run faster if and when we have a problem”.

Maybe I should not complain as sometimes I am that ‘someone’ making the database run faster. But I am complaining – I’m mad as hell and I ain’t gonna take it anymore! Oh, OK, in reality I’m mildly peeved and I’m going to let off steam about it. But it’s just wrong, it’s wasting people’s time and it results in poorer systems.

Now, if you have to develop a simple system with a couple of screens and a handful of reports, it might be a waste of time doing formal design. You and Dave can whack it together in a week or two, Chi will make the screens nice, it will be used by a handful of happy people and the job is done. It’s like building a wall around a flower bed. Go to the local builders merchants, get a pallet of bricks, some cement and sand (Ready), dig a bit of a trench where you want to start(Aim) and put the wall up, extending it as you see fit (Fire). This approach won’t work when you decide to build an office block and only a fool from the school of stupid would attempt it that way.

You see, as far as I am concerned, most IT systems are all about managing data. Think about it. You want to get your initial information (like the products you sell), present it to the users (those customers), get the new (orders) data, pass it to the next business process (warehouse team) and then mine the data for extra knowledge (sales patterns). It’s a hospital system? You want information about the patients, the staff, the beds and departments, tests that need doing, results, diagnoses, 15,000 reports for the regulators… It’s all moving data. Yes, a well design front end is important (sometimes very important) but the data is everything. If the database can’t represent the data you need, you are going to have to patch an alteration in. If you can’t get the data in quick enough or out quick enough, your screens and reports are not going to be any use. If you can’t link the data together as needed you may well not be able to DO your reports and screens. If the data is wrong (loses integrity) you will make mistakes. Faster CPUS are not going to help either, data at some point has to flow onto and off disks. Those slow spinning chunks of rust. CPUS have got faster and faster, rust-busting has not. So data flow is even more important than it was.

Also, once you have built your application on top of an inadequate database design, you not only have to redesign it, you have to:

  • do some quick, hacky  fixes to get by for now
  • migrate the existing data
  • transform some of it (do some data duplication or splitting maybe)
  • alter the application to cope
  • schedule all of the above to be done together
  • tie it in with the ongoing development of the system as hey, if you are not going to take time to design you are not going to take time to assess things before promising phase 2.

I’m utterly convinced, and experience backs this up, that when you take X weeks up front doing the database design, you save 5*X weeks later on in trying to rework the system, applying emergency hacks and having meetings about what went wrong. I know this is an idea out of the 80’s guys, but database design worked.

*sigh* I’m off to the pub for a pint and to reminisce about the good-old-days.

Friday Philosophy – Tainted by the Team August 26, 2011

Posted by mwidlake in development, Friday Philosophy, humour, Management, rant.
Tags: , , , ,
3 comments

A while ago whilst working on one project, a colleague came back to his desk next to mine and exclaimed “I hate working with that team! – they are so bad that it makes everyone who works with them look incompetent!”

Now there is often an argument to be made that working with people who are not good at their job can be great for you, as you always looks good in comparison {it’s like the old adage about hanging around with someone less attractive than you – but I’ve never found anyone I can do that with…}. It is to an extent true of course, and though it can seem a negative attitude, it is also an opportunity to teach these people and help them improve, so everyone potentially is a winner. I actually enjoy working with people who are clueless, so long as they will accept the clues. You leave them in a better state than when you joined them.

However, my friend was in the situation where the team he was dealing with was so lacking in the skills required that if you provided them with code that worked as specified, which passed back the values stated in the correct format derived from the database with the right logic… their application code would still fall over with exceptions – because it was written to a very, very “strict” interpretation of the spec.

In one example, the specification for a module included a “screen shot” showing 3 detail items being displayed for the parent object. So the application team had written code to accept only up to 3 detail items. Any more and it would crash. Not error, crash. The other part of the application, which the same people in the application team had also written, would let you create as many detail items for the parent as you liked. The data model stated there could be many more than 3 detail items. I suppose you could argue that the specification for the module failed to state “allow more than three items” – but there was a gap in the screen to allow more data, there was the data model and there was the wider concept of the application. In a second example, the same PL/SQL package was used to populate a screen in several modes. Depending on the mode, certain fields were populated or not. The application however would fail if the variables for these unused fields were null. Or it would fail if they were populated. The decision for each one depended on the day that bit of the module had been written, it would seem. *sigh*

The situation was made worse by the team manager being a skilled political animal, who would always try to shift any blame to any and all other teams as his first reaction. In the above examples he tried to immediately lay the blame with my colleague and then with the specification, but my colleague had managed to interpret the spec fine (he did the outrageous thing of asking questions if he was not sure or checked the data model). Further, this manager did not seem to like his people asking us questions, as he felt it would make it look like they did not know what they were doing. Oddly enough they did NOT know what they were doing. Anyway, as a consequence of the manager’s hostile attitude, the opportunity to actually teach the poor staff was strictly limited.

That was really the root of the problem, the manager. It was not the fault of the team members that they could not do the job – they had not had proper training, were unpracticed with the skills, siloed into their team, not encouraged to think beyond the single task in front of them and there was no one available to show them any better. The issue was that they were being made to do work they were not able to do. The problem, to my mind, was with the manager and with the culture of that part of the organisation that did not deal with that manager. He obviously did not believe that rule one of a good manager is to look after the best interests of your team. It was to protect his own backside.

But the bottom line was that this team was so bad that anything they were involved in was a disaster and no one wants to be part of a disaster. If you worked with them, you were part of the disaster. So we took the pragmatic approach. When they had the spec wrong, if we would alter our code to cope, we would alter our code. And document that. It gave us a lot of work and we ended up having a lot of “bugs” allocated to our team. But it got the app out almost on time. On-going maintencance could be a bit of an issue but we did what we could on our side to spell out the odditites.

I still know my friend from above and he still can’t talk about it in the pub without getting really quite agitated :-)

IOT Part 4 – Greatly Boosting Buffer Cache Efficiency August 8, 2011

Posted by mwidlake in development, performance.
Tags: , , ,
11 comments

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
……..>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

So far I have covered the basics of Index Organized Tables, created some examples and shown how IOTs can help reduce the cost of single row selects and then how they can greatly aid range scans. Follow the above links for details.

Now I’d like to show that the use of IOTs has the potential to make your block buffer cache (BBC) far more efficient. Going to disc is very,very slow compared to going to memory {NB solid state storage improves this situation but does not remove it}. The block buffer cache has always been critical to oracle SQL Select performance as it allows you to access data in memory rather than disc and in general the more block buffer cache you have the faster your system will be.
{I am of the opinion that the BBC is even more important now than ever. As hard discs get larger we are seeing fewer and fewer spindles per GB of storage and, in essence, disc storage is effectively getting slower – because more data is hosted on the same number of spindles and those spindles are not themselves getting faster – I digress, for more details see posts Big Discs are Bad and IOPs and Form Factors}

In the scenario I’ve covered in my previous posts on IOTs we have a system where child data is coming in for many parents every day for 100 days. With a heap table the data pours into the growing end of the table, usually a record or two per parent each day and no guarantee that if two records come in, they will be put into the same block.

So, when you select a child record for a parent you get the situation shown below:

For many systems, the Block Buffer Cache is holding a lot of data no queries asked for -collateral data

When oracle needs to collect a record from the table, it has to read the whole block. Oracle only reads in tablespace data in whole blocks. That record comes with many other records in it that you did not ask for or want. I refer to this as Collateral Data – innocent bystander data that has got pulled into the BBC just because it was in the same block as required data. The larger the block size, the more collateral data there is.

To get all the child records for the parent, you need to read all those individual table blocks holding one or two records of interest. For our 100 child records you will probably need to read in close to 100 table blocks. Your Block Buffer Cache is filling up with of blocks where only one row out of each block is “of interest”. If that is one row out of 80 in a block, you are effectively wasting 98.75% of the space that table takes up in the block buffer cache.

With an IOT the situation is very different. We have already seen in my previous post on reducing IO that for a range scan on the IOT, oracle does not need to go and collect records from blocks scattered throughout the table. It simply collects the IOT leaf blocks holding the relevant data. Not only does this require less IO, it also results in the fetched blocks mostly holding the required data. The percentage of collateral data is greatly reduced:

IOTs are a powerful tool in reducing collateral data and using the BBC more efficiently

Thus instead of 100 table blocks that mostly hold collateral data, you have 2 or 3 blocks holding mostly the data you are interested in. Your wastage, the collateral data, is about 33-50%. With my example tables from post IOT2, it is indeed an average of 2-3 IOT blocks holding all the data for a single parent and 100 heap table blocks holding the same data.

You can think of it another way.

With my IOT I use only 3% of the memory to cache a single set of records for a parent compared to that needed with a normal HEAP table.

Let’s extend that idea a little. Let’s say I have 100,000 customer and 5% of the customers are active.
Each customer has on average 500 * 200-byte activity records for 100K of data
Each 8K Heap Table block holds 40 records, a very inefficient IOT holds 20 records per block {I’ll go into the details of this in a later post, but I am being a little unfair on the IOT here}.
There is some grouping of records for the customer so that those 500 records are over 400 blocks.
The whole working set of 400 blocks * (100,000*0.05) customers *8K = 2,000,000K

Now replace the table with an IOT keyed on customer ID.
Those 400 records would be spread over (400/20)+2 blocks. The +2 is the start and end of the range.
That is 22 *(100,000*0.05) *8K = 110,000K

Both would also need the overhead of an index structure to be cached also, for the Heap table it is the Primary Key index, for the IOT it is the rest of the IOT structure.

So you would need 2GB or so of Block Buffer Cache dedicated to caching the working set held as a heap table and 110MB of Block Buffer Cache dedicated to caching the IOT equivalent.

How many of you spotted that the space needed to “cache the heap table working set” is actually twice the size of the table? Well, that is because with a 5% working set and 40 records per block, there is a fair chance that some of those Collateral Data records in each block are for the 5% working set. Over all, almost every block will hold data for two or three active users. This is one of the complications of working out how efficient you block buffer cache is likely to be and I’ll have to leave that to another post. In reality you would need to hold 99% of the heap table in memory to cache the whole working set, so still 1GB. The IOT is still far more efficient.

I’ll just finish by saying that on one project I worked on we would have needed several hundred GB of Block Buffer Cache to hold the working set of the main tables if held as heap tables. That volume of memory was simply not available. With IOTs this reduced to about 40GB. This was available. The majority of this working set was able to stay in their SGA and it meant that so long as the instance stayed up, that working set of data mostly stayed in cache. I was able to see on the live system that processing of the data for these active customers was mostly being supported by consistent gets and less than 1% disk reads.

So, by using IOTS I reduced consistent gets dramatically, made more efficient use of the block buffer cache and, as a result of that, reduced the number of physical reads needed to support the consistent gets.

IOT 2 – First examples and proofs July 26, 2011

Posted by mwidlake in development, performance.
Tags: , , , ,
8 comments

<.. IOT1 – Basics
..>IOT3 – Great reductions in IO for IOTs
….>IOT4 – Boosting Buffer Cache Efficiency
……>IOT5 – Primary Key issues
……….>IOT6a – Slowing Down Insert
…………>IOT6(B) – OLTP Inserts

In my first post on IOTs I ran through the basics of what they are. Here I am going to create some test tables and show you a few things.

I am going to create a simple PARENT table with 9,999 records and then two CHILD tables. CHILD_HEAP, a normal table, and CHILD_IOT, an Index Organized Table. They have the same columns and will hold very similar data.

All of this is on Oracle 11.1 but is exactly the same on 10.2. 8K block size, tablespaces are auto segment space managed.

Here are the creation statements:

--first create the parent table, keyed by ID.
-- The other columns are not significant, they just represent "information"
create table mdw.parent
(id       number(10)    not null 
,name     varchar2(100) not null
,date_1   date          not null
,num_1    number(2)
,num_2    number(2) 
,constraint pare_pk primary key(id) 
 using index tablespace index_01
)
tablespace data_01
/
--
--Now put my 9999 parents into the table.
insert into parent
select rownum
,dbms_random.string('U',mod(rownum,10)+50)
,sysdate-(mod(rownum,500)+1000)
,mod(rownum,99)+1
,trunc(dbms_random.value(0,100))
from dual connect by level < 10000
/
--
-- create the table to hold the children as a heap table
create table child_heap
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chhe_pk primary key(pare_id,cre_date) 
 using index tablespace index_01
)
tablespace data_01
/
--
-- create the table to hold the children as an IOT table
create table child_iot
(pare_id   number(10)    not null
,cre_date  date          not null
,vc_1      varchar2(100) not null
,date_1    date
,num_1     number(2)
,num_2     number(2)
,constraint chio_pk primary key(pare_id,cre_date) 
-- using index tablespace index_01 -- CANNOT STATE for IOT. State in table definition
)
ORGANIZATION INDEX -- This is it. This makes the table an IOT
tablespace data_01
/

There are only two differences between the statements creating the CHILD_HEAP and the CHILD_IOT tables.

The main one is the inclusion of the line ORGANIZATION INDEX and is what instructs Oracle to create the table as an IOT. Note that it does not state the index and you cannot state the index. The IOT is created based on the Primary Key.
The other change is that you now cannot state the tablespace for the Primary Key index. I’ve not played with this at all but I don’t think you can state anything with the “using index” as the table storage clauses are used for the Primary Key index. I personally find this a little illogical as it is the index segment that is created, but I guess others would find it more natural that you still state this at the table level.

When I create IOTs on a real system, I put the IOT in a table tablespace {I still maintain table and index tablespaces, for reasons I won’t go into here}. I put it there as it holds the actual data. If I lose that Primary Key index I am losing real data, not duplicated data.

I then populated the two CHILD tables with data. The method of creating this test data is very important.

I am simulating a very common situation, where data is coming in for a set of Parents (think customers, accounts, scientific instruments, financial entities) and the data is coming in as a record or set of records each day. ie not where the parent and all of it’s child records are created at one time, like an order and it’s order lines. I am simulating where the child data is created a few records at a time, not all in one go.

The code is simple. it loops for one hundred days and for each day it creates 10,000 records for random parents. On each day any given parent will have none, one or several records. On average, each parent will end up with 100 records, but some will have more and some less. The key thing is that the data for any given parent is created a record at a time, with lots of records created for other parents before the next record for that given parent.

The two tables will have the same pattern of data but not identical data. {I could have seeded the random number generator to make the two data sets the same but this will do}. Below is the statement for one table, you just change the table name to populate each table. {BTW I like using the from dual connect by level <=x method of getting the number of rows desired – it is fast and is neat, once you have seen it once}.

declare
v_num number :=10000; -- number of people
v_str varchar2(60);
begin
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
for i in 1..100 loop --days to do
  v_str:=dbms_random.string('U',60);
  insert into CHILD_HEAP
    (pare_id,cre_date,vc_1,date_1,num_1,num_2)
  select
    trunc(dbms_random.value(1,v_num))
   ,sysdate-(100-i) + (rownum/(60*60*24) )
   ,substr(v_str,1,51+mod(rownum,10))
   ,sysdate-(100-i) + ((mod(rownum,30)+1)/3)
   ,mod(rownum,20)+1
   ,mod(rownum,99)+1
  from dual connect by level <=v_num;
end loop;
dbms_output.put_line (to_char(SYSTIMESTAMP,'HH24:MI:SS.FF'));
end;
/

I then gathered objects stats on the tables.
Let’s check the size of the tables:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CHILD%';

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHILD_HEAP      TABLE           DATA_01              12288

1 row selected.

ONE row? Where is the other table, where is CHILD_IOT? It does not exists.

Remember from my first post that I made the comment I would have prefered it if Index Organized Tables had been called something like ‘Table Containing Indexes’? The table data has been placed in the Primary Key index and the table segment does not even exist. If you start using IOTs this will catch you out periodically – it does me anyway and I’ve been using them on and off for years :-).

Let’s look at the size of the primary key indexes:

select segment_name, segment_type,tablespace_name,blocks
from dba_segments where owner=USER and segment_name like 'CH%PK'
and segment_name not like '%ORD%'

SEGMENT_NAME    SEGMENT_TYPE    TABLESPACE_NAME     BLOCKS
--------------- --------------- --------------- ----------
CHHE_PK         INDEX           INDEX_01              4224
CHIO_PK         INDEX           DATA_01              19456

2 rows selected.

Note that the Primary Key index for CHILD_HEAP, CHHE_PK, is there and is 4,224 blocks in size, and the CHILD_IOT Primary Key, CHIO_PK, is a lot larger at 19,456 blocks. In fact, not only is the CHIO_PK index larger than the CHILD_HEAP table, it is larger than the combined size of the CHILD_HEAP table and CHHE_PK index combines. So much for me saying last post that IOTs can save disk space? I’ll come back to that in a later post…

Here are some other stats from one of my scripts:

mdw11> @tab_sci_own
owner for Table: mdw
Name for Table: child_heap

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_HEAP          1000,000      12,137    83 YES NO  250711 22:01 NO     1000000


INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHHE_PK         NOR NO  UNI  2      4,034    1000,000      995,857          1          1 250711 22:02

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHHE_PK                      CHILD_HEAP       1   PARE_ID
CHHE_PK                      CHILD_HEAP       2   CRE_DATE

--
--
owner for Table: mdw
Name for Table: child_iot

OWNER    TABLE_NAME          NUM_ROWS      BLOCKS AVG_L GLS ULS LST_ANL      PRT  SAMP_SIZE
-------- -------------- ------------- ----------- ----- --- --- ------------ --- ----------
MDW      CHILD_IOT           1000,000                83 YES NO  250711 22:03 NO     1000000

INDEX_NAME      TYP PRT UNQ BL     L_BLKS   DIST_KEYS       CLUSTF     LB_KEY     DB_KEY LST_ANL
--------------- --- --- --- -- ---------- ----------- ------------ ---------- ---------- ------------
CHIO_PK         IOT NO  UNI  2     17,855     910,881            0          1          1 250711 22:03

INDEX_NAME                   TABLE_NAME       PSN COL_NAME
---------------------------- ---------------- --- ------------------------------------------------
CHIO_PK                      CHILD_IOT        1   PARE_ID
CHIO_PK                      CHILD_IOT        2   CRE_DATE

Note the lack of BLOCKS for the CHILD_IOT table and the CLUSTERING_FACTOR of 0 for the CHIO_PK.

The clustering factor is the number of times Oracle, when scanning the whole index in order, would have to swap to a different Table block to look up the table record for each index entry. If it is close to the number of blocks in the table, then the clustering factor is low and the order of records in the table matches the order of entries in the index. This would make index range scans that need to visit the table reasonably efficient.

If the clustering factor is close to the number of records in the table then it means there is no correlation between index order and table row order and such index ranges scans that have to visit the table would be inefficient. Again, this is significant and will be the major topic of the next post.

The depth of the index does not change, being 3 in each case (BL or blevel 2)

So, can we see evidence of the theoretical efficiency of looking up single records via the IOT that I mentioned in the fist post? Here we go {oh, usual disclaimer, I run the code twice and show the second run, to remove the parsing overhead}:

-- First the Heap table
select * from child_HEAP where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 20:13:21','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- ------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11  LUTFHOCIJNYREYICQNORREAJOVBRIHFVLXNIGIVZDMFJCTGYFWC
25-JUN-11         11         16
1 row selected.

Execution Plan
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |     1 |    83 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| CHILD_HEAP |     1 |    83 |     3   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | CHHE_PK    |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets


--and now the IOT table

select * from child_IOT where PARE_ID=1234
AND cre_date=to_date('24-JUN-11 21:23:41','DD-MON-YY HH24:MI:SS')

   PARE_ID CRE_DATE  VC_1
---------- --------- -------------------------------------------------------
DATE_1         NUM_1      NUM_2
--------- ---------- ----------
      1234 24-JUN-11
CSIGBHSXWNDDTCFRCNWYPRNLEQWPCRYTXQQZHACDEXHOBEYXLNYBHRUHJ
27-JUN-11          7         52
1 row selected.

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |    83 |     2   (0)| 00:00:01 |
|*  1 |  INDEX UNIQUE SCAN| CHIO_PK |     1 |    83 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

{I had to look up the exact values of CRE_DATE of a couple of records to do the above queries}

To look up a single row with the heap table you can see that the explain plan was to carry out a unique scan on the primary key and then look up the row via the rowid and took 4 consistent gets. 3 to walk down the index and get the rowid, one to look up the row block.

For the IOT table the explain plan reveals that there was simply an index unique scan of the Primary Key, nothing more. All data for the row was there in the index entry rather than the rowid. Thus only 3 consistent gets were required.

For single row lookups on the Primary Key, IOTS are more efficient than traditional Heap tables with a Primary Key index. {Please, no one point out that if all the columns you need are in the index you also do not need to go to the table, that is a different topic}.

Quite a few people have shown this efficiency before but the next step is far, far more interesting and shows a much more significant impact of IOTs. That is the topic of the next post :-).

For now, I am going to finish off with what happens with range scans as I suggested they could slow down with an IOT.
Below, I select count(*) for just one of the parent values.

select count(*) from child_heap where pare_id = 2

  COUNT(*)
----------
        98

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHHE_PK |   100 |   400 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets

--
--

select count(*) from child_iot where pare_id = 2

  COUNT(*)
----------
        93

Execution Plan
-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     1 |     4 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |         |     1 |     4 |            |          |
|*  2 |   INDEX RANGE SCAN| CHIO_PK |   100 |   400 |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets

Both statements carry out a range scan on the Primary Key of the table. For the normal HEAP table this takes 3 consistent gets, which is no suprise as we have an 8k block size and only 100 rows for a given parent, they happen to fit into one block of the index. So Oracle works down the depth of the index and looks at one block.

For the IOT the scan works down the index but has to scan three blocks. Even though there are fewer entries, 93 compared to 98, they span three blocks and thus the total number of consistent gets is 5.

Admittedly I was a little lucky in my example above. Sometimes the entries for one parent will scan 2 blocks for the heap table’s Primary Key and occasionally the entries for the IOT will fit into 2 blocks. But if you look at the number of leaf blocks in the earlier stats (4,034 for the normal and 17,855 for the IOT, both for 10,000 entries) usually the 100 or so entries for single parent in the normal index will all fall into one block and the entries for the IOT will fall into between 2 and 3 blocks.

A select count(*) will full scan the smallest segment that can satisfy the query. Let’s try it:

mdw11> select count(*) from child_heap

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |   989   (1)| 00:00:15 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHHE_PK |  1000K|   989   (1)| 00:00:15 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          2  db block gets
       4109  consistent gets
       4088  physical reads

mdw11> select count(*) from child_iot

  COUNT(*)
----------
   1000000

Execution Plan
-------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |  4359   (1)| 00:01:05 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| CHIO_PK |  1000K|  4359   (1)| 00:01:05 |
-------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      19298  consistent gets
      19246  physical reads

The number of consistent gets (and physical reads) are close to the number of leaf blocks in the two segments, though higher. This is because Oracle is scanning the whole index, leaf blocks and branch blocks. The scan is far more expensive for the IOT, simply as the index is so much larger. I’ve not shown timings but on my little laptop, the count(*) takes about 3 seconds on CHILD_HEAP and about 5 seconds on the CHILD_IOT.

That is enough for one post.

Follow

Get every new post delivered to your Inbox.

Join 185 other followers