jump to navigation

The “as a Service” paradigm. October 27, 2015

Posted by mwidlake in Architecture, Hardware, humour.
Tags: , , ,

For the last few days I have been at Oracle Open World 2015 (OOW15) learning about the future plans and directions for Oracle. I’ve come to a striking realisation, which I will reveal at the end.

The message being pressed forward very hard is that of compute services being provided “As A Service”. This now takes three flavours:

  1. Being provided by a 3rd party’s hardware via the internet, ie in The Cloud.
  2. Having your own hardware controlled and maintained by you but providing services with the same tools and quick-provisioning ideology as “cloud”. This is being called On Premise (or just “On Prem” if you are aiming to annoy the audience), irrespective of the probably inaccuracy of that label (think hosting & dedicated compute away from head office)
  3. A mix of the two where you have some of your system in-house and some of it floating in the Cloud. This is called Hybrid Cloud.

There are many types of  “as a Service offerings, the main ones probably being

  • SaaS -Software as a Service
  • PaaS – Platform as a Service
  • DBaas – Database as a a Service
  • Iass – Infrastructure as a Service.

Whilst there is no denying that there is a shift of some computer systems being provided by any of these, or one of the other {X}aaS offerings, it seems to me that what we are really moving towards is providing the hardware, software, network and monitoring required for an IT system. It is the whole architecture that has to be considered and provided and we can think of it as Architecture as a Service or AaaS. This quick provisioning of the architecture is a main win with Cloud, be it externally provided or your own internal systems.

We all know that whilst the provision time is important, it is really the management of the infrastructure that is vital to keeping a service running, avoiding outages and allowing for upgrades. We need a Managed Infrastructure (what I term MI) to ensure the service provided is as good as or better than what we currently have. I see this as a much more important aspect of Cloud.

Finally, it seems to me that the aspects that need to be considered are more than initially spring to mind. Technically the solutions are potentially complex, especially with hybrid cloud, but also there are complications of a legal, security, regulatory and contractual aspect. If I have learnt anything over the last 2+ decades in IT it is that complexity of the system is a real threat. We need to keep things simple where possible – the old adage of Keep It Simple, Stupid is extremely relevant.

I think we can sum up the whole situation by combining these three elements of architecture, managed infrastructure and simplicity into one encompassing concept, which is:




And yes, that was a very long blog post for a pretty weak joke. 5 days of technical presentations and non-technical socialising does strange things to your brain

Friday Philosophy – Building for the Future August 14, 2015

Posted by mwidlake in Architecture, development, Friday Philosophy.
Tags: , ,

I started my Oracle working life as a builder – a Forms & Reports Builder (briefly on SQL*Forms V2.3 but thankfully within a month or two we moved up to SQL*Forms V3, SQL*reportwriter V1.1 and SQL*Menu 5 – who remembers SQL*Menu?). Why were we called Builders? I guess as you could get a long way with those tools by drawing screens, utilising the (pretty much new) RI in the underlying Oracle V7 to enforce simple business rules and adding very simple triggers – theoretically not writing much in the way of code. It was deemed to be more like constructing stuff out of bits I guess. But SQL*Forms V3 had PL/SQL V1 built in and on that project we used it a *lot*.

I had been an “Analyst Programmer” for 3 years before then and I’ve continued to be a developer/programmer/constructor-of-code on and off over the intervening couple of decades. I’m still a developer at times. But sometimes I still think of it as being a “builder” as, if you do it write {sorry, little word-play joke there} you are using bits of existing stuff and code designs/patterns you know work well and constructing your system. The novel part, the bit or bits that have never been done before (at least by me), the “architecting” of those units into something interestingly different or the use of improved programming features or techniques vary from almost-none to a few percent. That is the part which I have always considered true “Software Development”.

So am I by implication denigrating the fine and long-standing occupation of traditional builders? You know, men and women who know what a piece of two-by-four is and put up houses that stay put up? No. Look at the below.

This is part of my neighbour Paul’s house. He is a builder and the black part in the centre with the peaked roof is an extension he added a few years back, by knocking his garage down. The garage was one of three, my two were where the garage doors you can see are and to the left. So he added in his two-story extension, with kitchen below and a very nice en-suite bedroom above, between his house and my ratty, asbestos-riddle garages. Pretty neat. A few years later he knocked down my garages and built me a new one with a study on top (without the asbestos!) and it all looks like it was built with his extension. Good eh? But wait, there is more. You will have noticed the red highlight. What is that white thing?

Closer in - did he forget some plumbing?

Closer in – did he forget some plumbing?

This pipe goes clean through the house

This pipe goes clean through the house

When I noticed that white bit after Paul had finished his extension I figured he had planned more plumbing than he put in. I kept quiet. Then, when he had built my new garage and study, I could not help ask him about the odd plumbing outlet. So he opened it. And it goes through the dividing wall all the way through to the other side of the house. Why?

“Well Martin, putting in cables and pipes and s**t into an existing house that go from one side to the other, especially when there is another building next door, as a real pain in the a**e. It does my ‘ead in. So when a build something that is not detached, I put in a pipe all the way through. Now if I need to run a cable from one side of the house to the other, I have my pipe and I know it is straight, clean, and sloping every so slightly downwards”. Why downwards? “Water Martin. You don’t want water sitting in that pipe!”.

I’ve noticed this about builders. When I’ve had work done that is good, there is at least one person on the team who thinks not just about how to erect or do what needs to be done today, they do indeed think about what you will need after the build is done, or in a few years. Such as hanging doors so they do not smack into the cupboards you will put in next… *sigh*. Paul is the thinking guy in his little team. I suspect one of the others is pretty smart too.

But isn’t this what the architect is for? To think about living with the building? Well, despite the 7 years plus needed to become a true architect (as that term really means, not as some stolen label for software designers with too much ego) I’ve had builders spot the pragmatic needs a couple of times that the architect missed.

And as I think we would all agree, a good software developer always has an eye on future maintenance and modification of the software they develop. And they want to create something that fits in the existing system and looks right. So just like my builder neighbour does.

I’m not a software architect. I’m a code builder. And I’m proud of it.

With Modern Storage the Oracle Buffer Cache is Not So Important. May 27, 2015

Posted by mwidlake in Architecture, Hardware, performance.
Tags: , , , ,

With Oracle’s move towards engineered systems we all know that “more” is being done down at the storage layer and modern storage arrays have hundreds of spindles and massive caches. Does it really matter if data is kept in the Database Buffer Cache anymore?

Yes. Yes it does.

Time for a cool beer

Time for a cool beer

With much larger data sets and the still-real issue of less disk spindles per GB of data, the Oracle database buffer cache is not so important as it was. It is even more important.

I could give you some figures but let’s put this in a context most of us can easily understand.

You are sitting in the living room and you want a beer. You are the oracle database, the beer is the block you want. Going to the fridge in the kitchen to get your beer is like you going to the Buffer Cache to get your block.

It takes 5 seconds to get to the fridge, 2 seconds to pop it open with the always-to-hand bottle opener and 5 seconds to get back to your chair. 12 seconds in total. Ahhhhh, beer!!!!

But – what if there is no beer in the fridge? The block is not in the cache. So now you have to get your car keys, open the garage, get the car out and drive to the shop to get your beer. And then come back, pop the beer in the fridge for half an hour and now you can drink it. That is like going to storage to get your block. It is that much slower.

It is only that much slower if you live 6 hours drive from your beer shop. Think taking the scenic route from New York to Washington DC.

The difference in speed really is that large. If your data happens to be in the memory cache in the storage array, that’s like the beer already being in a fridge – in that shop 6 hours away. Your storage is SSD-based? OK, you’ve moved house to Philadelphia, 2 hours closer.

Let's go get beer from the shop

Let’s go get beer from the shop

To back this up, some rough (and I mean really rough) figures. Access time to memory is measured in Microseconds (“us” – millionths of a second) to hundreds of Nanoseconds (“ns” – billionths of a second). Somewhere around 500ns seems to be an acceptable figure. Access to disc storage is more like Milliseconds (“ms” – thousandths of a second). Go check an AWR report or statspack or OEM or whatever you use, you will see that db file scattered reads are anywhere from low teens to say 2 or 3 ms, depending on what your storage and network is. For most sites, that speed has hardly altered in years as, though hard discs get bigger, they have not got much faster – and often you end up with fewer spindles holding your data as you get allocated space not spindles from storage (and the total sustainable speed of hard disc storage is limited to the total speed of all the spindles involved). Oh, the storage guys tell you that your data is spread over all those spindles? So is the data for every system then, you have maximum contention.

However, memory speed has increased over that time, and so has CPU speed (though CPU speed has really stopped improving now, it is more down to More CPUs).

Even allowing for latching and pinning and messing around, accessing a block in memory is going to be at the very least 1,000 times faster than going to disc, maybe 10,000 times. Sticking to a conservative 2,000 times faster for memory than disc , that 12 seconds trip to the fridge equates to 24,000 seconds driving. That’s 6.66 hours.

This is why you want to avoid physical IO in your database if you possibly can. You want to maximise the use of the database buffer cache as much as you can, even with all the new Exadata-like tricks. If you can’t keep all your working data in memory, in the database buffer cache (or in-memory or use the results cache) then you will have to do that achingly slow physical IO and then the intelligence-at-the-hardware comes into it’s own, true Data Warehouse territory.

So the take-home message is – avoid physical IO, design your database and apps to keep as much as you can in the database buffer cache. That way your beer is always to hand.


Update. Kevin Fries commented to mention this wonderful little latency table. Thanks Kevin.

“Here’s something I’ve used before in a presentation. It’s from Brendan Gregg’s book – Systems Performance: Enterprise and the Cloud”

An Oracle Instance is Like An Upmarket Restaurant January 28, 2015

Posted by mwidlake in Architecture.
Tags: ,

I recently did an Introduction to Oracle presentation, describing how the oracle instance worked – technically, but from a very high level. In it I used the analogy of a restaurant, which I was quite happy with. I am now looking at converting that talk into a set of short articles and it struck me that the restaurant analogy is rather good!

Here is a slide from the talk:

Simple partial overview of an Oracle Instance

Simple partial overview of an Oracle Instance

As a user of the oracle instance, you are the little, red blob at the bottom left. You (well, your process, be it SQL*Plus, SQL*Developer, a Java app or whatever) do nothing to the database directly. It is all done for you by the Oracle Sever Process – and this is your waiter.

Now, the waiter may wait on many tables (Multi-threaded server) but this is a very posh restaurant, you get your own waiter.

You ask the waiter for food and the waiter goes off and asks the restaurant to provide it. There are many people working in the restaurant, most of them doing specific jobs and they go off and do whatever they do. You, the customer, have no idea who they are or what they do and you don’t really care. You don’t see most of them. You just wait for your food (your SQL results) to turn up. And this is exactly how an Oracle Instance works. Lots of specific processes carry out their own tasks but they are coordinated and the do the job without most of us having much of an idea what each bit does. Finally, some of the food is ready and the waiter delivers the starter to you – The server process brings you the first rows of data.

Let’s expand the analogy a bit, see how far we can take it.

When you arrived at the restaurant, the Maître d’ greets you and allocates you to your waiter. This is like the Listener process waiting for connection requests and allocating you a server process. The Listener Process listens on a particular port, which is the front door to the restaurant. When you log onto an oracle database your session is created, ie your table is laid. If someone has only just logged off the database their session might get partially cleared and re-used for you (you can see this as the SID may well get re-used), as creating a session is a large task for the database. If someone had just left the restaurant that table may have a quick brush down and the cutlery refreshed, but the table cloth, candle and silly flower in a vase stay. Completely striping a table and relaying it takes more time.

The restaurant occupies a part of the building, the database occupies part of the server. Other things go in the server, the restaurant is in a hotel.

The PMON process is the restaurant manager or Head of House maybe and SMON is the kitchen manager, keeping an eye on the processes/staff and parts of the restaurant they are responsible for. To be candid, I don’t really know what PMON and SMON do in detail and I have no real idea how you run a large kitchen.

There are lots of other processes, these are equivalent to the Sous-chef, Saucier, commis-chef, Plonger (washes up, the ARC processes maybe?), Ritisseur, Poissonier, Patissier etc. They just do stuff, let’s not worry about the details, we just know there are lots of them making it all happen and we the customer or end user never see them.

The PGA is the table area in the restaurant, where all the dishes are arranged and provided to each customer? That does not quite work as the waiter does not sit at our table and feed us.

The SGA is the kitchen, where the ingredients are gathered together and converted into the dishes – the data blocks are gathered in the block buffer cache and processed. The Block Buffer Cache are the tables and kitchen surfaces, where all the ingredients sit. The Library cache is, yes, the recipes. They keep getting re-used as our kitchen only does certain recipes, it’s a database with a set of standard queries. It’s when some fool orders off-menu that it all goes to pot.

Food is kept in the larder and fridges – the tablespaces on disc. You do not prepare the dishes in the larder or fridge, let alone eat food out of them (well, some of the oracle process might nick the odd piece of cooked chicken or chocolate). everything is brought into the kitchen {the SGA} and processed there, on the kitchen tables.

The orders for food are the requests for change – the redo deltas. Nothing is considered ordered until it is on that board in the kitchen, that is the vital information. All the orders are preserved (so you know what was ordered, you can do the accounts and you can re-stock). The archived redo. You don’t have to keep this information but if you don’t, it’s a lot harder to run the restaurant and you can’t find out what was ordered last night.

The SCN is the clock on the wall and all orders get the time they were place on them, so people get their food prepared in order.

When you alter the ingredients, eg grate some of the Parmesan cheese into a sauce, the rest of the cheese (which, being an ingredient is in the SGA) is not put back into the fridge immediately, ie put back into storage. It will probably be used again soon. That’ll push it up the LRU list. Eventually someone will put it back, probably the Garçon de cuisine (the kitchen boy). A big restaurant will gave more then one Garçon de cuisine, all with DBW1 to x written on the back of their whites, and they take the ingredients back to the larder or kitchen when they get around to it – or are ordered to do so by one of the chefs.

Can we pull in the idea of RAC? I think we can. We can think of it as a large hotel complex which will have several restaurants, or at least places to eat. They have their own kitchens but the food is all stored in the central store rooms of the hotel complex. I can’t think what can be an analogy of block pinging as only a badly designed or run restautant would for example only have one block of Parmesan cheese – oh, maybe it IS a lot like some of the RAC implementations I have seen :-)

What is the Sommelier (wine waiter) in all of this? Suggestions on a post card please.

Does anyone have any enhancements to my analogy?

How do you Explain Oracle in 50 Minutes? December 2, 2014

Posted by mwidlake in Architecture, conference, UKOUG.
Tags: , ,

I’ve done a very “brave”* thing. I’ve put forward a talk to this year’s UKOUG Tech14 conference titled “How Oracle Works – in under 50 minutes”. Yes, I really was suggesting I could explain to people how the core of Oracle functions in that time. Not only that, but the talk is aimed at those new to Oracle technology. And it got accepted, so I have to present it. I can’t complain about that too much, I was on the paper selection committee…

* – “brave”, of course, means “stupid” in this context.

As a result I am now strapped to the chair in front of my desk, preparing an attempt to explain the overall structure of an Oracle instance, how data moves in out of storage, how ACID works and a few other things. Writing this blog is just avoidance behaviour on my part as I delay going back to it.

Is it possible? I’m convinced it is.

If you ignore all the additional bits, the things that not all sites use, such as Partitioning, RAC, Resource Manager, Materialized Views etc, etc, etc, then that removes a lot. And if not everyone uses it, then it is not core.
There is no need or intention on my part to talk about details of the core – for example, how the Cost Based aspect of the optimizer works, Oracle permissions or the steps needed for instance recovery. We all use those but the details are ignored by some people for their whole career {not usually people who I would deem competent, despite them holding down jobs as Oracle technicians, but they do}.

You are left with a relatively small set of things going on. Don’t get me wrong, it is still a lot of stuff to talk about and is almost certainly too much for someone to fully take in and digest in the time I have. I’m going to have to present this material as if I am possessed. But my intention is to describe a whole picture that makes sense and will allow people to understand the flow. Then, when they see presentations on aspects of it later in the conference, there is more chance it will stick. I find I need to be taught something 3 or 4 times. The first time simply opens my mind to the general idea, the second time I retain some of the details and the third or forth time I start integrating it into what I already new.

My challenge is to say enough so that it makes sense and *no more*. I have developed a very bad habit of trying to cram too much into a presentation and of course this is a real danger here. I’m trying to make it all visual. There will be slides of text, but they are more for if you want to download the talk after the conference. However, drawing pictures takes much, much, much longer than banging down a half dozen bullet points.

One glimmer in the dark is that there is a coffee break after my session. I can go right up to the wire and then take questions after I officially stop, if I am not wrestled to the ground and thrown out the room.

If anyone has any suggestions or comments about what I should or should not include, I’d love to hear them.

This is all part of my intention to provide more conference content for those new to Oracle. As such, this “overview” talk is at the start of the first day of the main conference, 10am Monday. I have to thank my fellow content organisers for allowing me to stick it in where I wanted it. If you are coming to the conference and don’t know much Oracle yet – then I am amazed you read my blog (or any other blog other than maybe AskTom). But if you have colleagues or friends coming who are still relatively new to the tech, tell them to look out for my talk. I really hope it will help them get that initial understanding.

I had hoped to create a fully fledged thread of intro talks running through all of Monday and Tuesday, but I brought the idea up too late. We really needed to promote the idea at the call for papers and then maybe sources a couple of talk. However, using the talks that were accepted we did manage to get a good stab at a flow of intro talks through Monday. I would suggest:

  • 08:50 – Welcome and Introduction
    • Get there in time for the intro if you can, as if you are newish to the tech you are probably newish to a conference
  • 09:00 RMAN the basics, by Michael Abbey.
    • If you are a DBA type, backup/recovery is your number one concern.
  • 10:00 – How Oracle Works in 50 Minutes
    • I think I have said enough!
  • 11:30 – All about Joins by Tony Hasler
    • Top presenter, always good content
  • 12:30 – Lunch. Go and talk to people, lots of people, find some people you might like to talk with again. *don’t stalk anyone*
  • 13:20 – Go to the Oracle Keynote.
    • Personally, I hate whole-audience keynotes, I am sick of being told every year how “there has never been a better time to invest in oracle technology” – but this one is short and after it there is a panel discussion by technical experts.
  • 14:30 is a bit tricky. Tim Hall on Analytical Functions is maybe a bit advanced, but Tim is a brilliant teacher and it is an intro to the subject. Failing that, I’d suggest the Oracle Enterprise Manager round table hosted by Dev Nayak as Database-centric oracle people should know OEM.
  • 16:00 – Again a bit tricky for someone new but I’d plump for The role of Privileges and Roles in Oracle 12C by Carl Dudley. He lectures (lectured?) in database technology and knows his stuff, but this is a New Feature talk…
  • 17:00 – Tuning by Explain Plan by Arian Stijf
    • This is a step-by-step guide to understanding the most common tool used for performance tuning
  • 17:50 onwards – go to the exhibition drinks, the community drinks and just make friends. One of the best thing to come out of conferences is meeting people and swapping stories.

I better get back to drawing pictures. Each one takes me a day and I need about 8 of them. Whoops!

Will the Single Box System make a Comeback? December 8, 2011

Posted by mwidlake in Architecture, future, Hardware.
Tags: , ,

For about 12 months now I’ve been saying to people(*) that I think the single box server is going to make a comeback and nearly all businesses won’t need the awful complexity that comes with the current clustered/exadata/RAC/SAN solutions.

Now, this blog post is more a line-in-the-sand and not a well researched or even thought out white paper – so forgive me the obvious mistakes that everyone makes when they make a first draft of their argument and before they check their basic facts, it’s the principle that I want to lay down.

I think we should be able to build incredible powerful machines based on PC-type components, machines capable of satisfying the database server requirements of anything but the most demanding or unusual business systems. And possibly even them. Heck, I’ve helped build a few pretty serious systems where the CPU, memory and inter-box communication is PC-like already. If you take the storage component out of needing to be centralise (and this shared), I think that is a major change is just over the horizon.

At one of his talks at the UKOUG conference this year, Julian Dyke showed a few tables of CPU performance, based on a very simple PL/SQL loop test he has been using for a couple of years now. The current winner is 8 seconds by a… Core i7 2600K. ie a PC chip and one that is popular with gamers. It has 4 cores and runs two threads per core, at 3.4GHz and can boost a single core to 3.8 GHz. These modern chips are very powerful. However, chips are no longer getting faster so much as wider – more cores. More ability to do lots of the same thing at the same speed.

Memory prices continue to tumble, especially with smart devices and SSD demands pushing up the production of memory of all types. Memory has fairly low energy demands so you can shove a lot of it in one box.

Another bit of key hardware for gamers is the graphics card – if you buy a top-of-the-range graphics card for a PC that is a couple of years old, the graphics card probably has more pure compute grunt than your CPU and a massive amount of data is pushed too and fro across the PCIe interface. I was saying something about this to some friends a couple of days ago but James Morle brought it back to mind when he tweeted about this attempt at a standard about using PCI-e for SSD. A PCI-e 16X interface has a theoretical throughput of 4000MB per second – each way. This compares to 600MB for SATA III, which is enough for a modern SSD. A single modern SSD. {what I am not aware of is the latency for PCI-e but I’d be surprised if it was not pretty low}. I think you can see where I am going here.

Gamers and image editors have probably been most responsible for pushing along this increase in performance and intra-system communication.

SSD storage is being produced in packages with a form factor and interface to enable an easy swap into the place of spinning rust, with for example a SATA3 interface and 3.5inch hard disk chassis shape. There is no reason that SSD (or other memory-based) storage cannot be manufactured in all sorts of different form factors, there is no physical constraint of having to house a spinning disc. Density per dollar of course keeps heading towards the basement. TB units will soon be here but maybe we need cheap 256GB units more than anything. So, storage is going to be compact and able to be in form factors like long, thin slabs or even odd shapes.

So when will we start to see cheap machines something like this: Four sockets for 8/16/32 core CPUs, 128GB main memory (which will soon be pretty standard for servers), memory-based storage units that clip to the external housing (to provide all the heat loss they require) that combine many chips to give 1Gb IO rates, interfaced via the PCIe 16X or 32X interface. You don’t need a HBA, your storage is internal. You will have multipath 10GbE going in and out of the box to allow for normal network connectivity and backup, plus remote access of local files if need be.

That should be enough CPU, memory and IO capacity for most business systems {though some quote from the 1960’s about how many companies could possible need a computer spring to mind}. You don’t need shared storage for this, in fact I am of the opinion that shared storage is a royal pain in the behind as you are constantly having to deal with the complexity of shared access and maximising contention on the flimsy excuse of “sweating your assets”. And paying for the benefit of that overly complex, shared, contended solution.

You don’t need a cluster as you have all the cpu, working memory and storage you need in a 1U server. “What about resilience, what if you have a failure?”. Well, I am swapping back my opinion on RAC to where I was in 2002 – it is so damned complex it causes way more outage than it saves. Especially when it comes to upgrades. Talking to my fellow DBA-types, the pain of migration and the number of bugs that come and go from version to version, mix of CRS, RDBMS and ASM versions, that is taking up massive amounts of their time. Dataguard is way simpler and I am willing to bet that for 99.9% of businesses other IT factors cause costly system outages an order of magnitude more times than the difference between what a good MAA dataguard solution can provide you compared to a good stretched RAC one can.

I think we are already almost at the point where most “big” systems that use SAN or similar storage don’t need to be big. If you need hundreds of systems, you can virtualize them onto a small number of “everything local”

A reason I can see it not happening is cost. The solution would just be too cheap, hardware suppliers will resist it because, hell, how can you charge hundreds of thousands of USD for what is in effect a PC on steroids? But desktop games machines will soon have everything 99% of business systems need except component redundancy and, if your backups are on fast SSD and you a way simpler Active/Passive/MAA dataguard type configuration (or the equivalent for your RDBMS technology) rather than RAC and clustering, you don’t need that total redundancy. Dual power supply and a spare chunk of solid-state you can swap in for a failed raid 10 element is enough.

Lack of Index and Constraint Comments November 24, 2011

Posted by mwidlake in Architecture, database design, development.
Tags: , , , ,

Something I’ve just reminded myself of is that under Oracle you cannot add a comment on an index or a constraint. You can only add comments on tables, views, materialized views, columns of those object types and a couple of esoteric things like Operators, Editions and Indextypes.

Here is an example of adding comments to tables and columns:

set pause off feed off
drop table mdw purge;
create table mdw(id number,vc1 varchar2(10));
comment on table mdw is 'Martin Widlake''s simple test table';
comment on column mdw.id is 'simple numeric PK sourced from sequence mdw_seq';
comment on column mdw.vc1 is'allow some random text up to 10 characters';
desc user_tab_comments

 Name                                                  Null?    Type
 ----------------------------------------------------- -------- ------------------------------------
 TABLE_NAME                                            NOT NULL VARCHAR2(30)
 TABLE_TYPE                                                     VARCHAR2(11)
 COMMENTS                                                       VARCHAR2(4000)

select * from dba_tab_comments where table_name='MDW'
OWNER                          TABLE_NAME                     TABLE_TYPE
------------------------------ ------------------------------ -----------
MDW                            MDW                            TABLE
Martin Widlake's simple test table

select * from dba_col_comments where table_name='MDW'
order by column_name
OWNER                          TABLE_NAME                     COLUMN_NAME
------------------------------ ------------------------------ --------------
MDW                            MDW                            ID
simple numeric PK sourced from sequence mdw_seq
MDW                            MDW                            VC1
allow some random text up to 10 characters
-- now to add a big comment so need to use the '-' line continuation character in sqlplus
comment on table mdw is 'this is my standard test table.-
 As you can see it is a simple table and has only two columns.-
 It will be populated with 42 rows as that is the solution to everything.'
select * from dba_tab_comments where table_name='MDW'
OWNER                          TABLE_NAME                     TABLE_TYPE
------------------------------ ------------------------------ -----------
MDW                            MDW                            TABLE
this is my standard test table.  As you can see it is a simple table and has only two columns.  It w
ill be populated with 42 rows as that is the solution to everything.

Adding comments on tables, views and columns seems to have dropped out of fashion over the years but I think it is still a very useful feature of oracle and I still do add them (though I am getting a little slack about it myself over the last 3 or 4 years, which I must stop).

Comments are great, you can put 4000 characters of information into the database about each table, view and column. This can be a brief description of the object, a full explanation of what a column is to hold or even a list of typical entries for a column or table.

But you can’t add a comment on indexes or constraints. Why would I want to? Well, constraints and indexes should only be there for a reason and the reason is not always obvious from either the names of the columns or the name of the constraint or index, especially where you have a naming standard that forces you to name indexes and constraints after the columns they reference.

When you design a database, do a schema diagram or an ERD, you label your relationships between entities/tables. It tells you exactly what the relationship is. You might create an index to support a specific method of access or particular business function. You might alter the index in a way not immediately obvious to the casual observer, such as to allow queries that use the index to avoid having to visit the table. All of those things will, of course, be fully documented in the maintained project documentation in the central repository, available and used by all…

If I was able to add comments to constraints and indexes within the database then they would there. You move the system from one platform to the other, they are there. If for any wildly unlikely reason the central documentation lets you down, the information is always there in the database and easy to check. You may not be able to track down the original design documents but you have the database in front of you, so comments in that will persist and be very easy to find.

Lacking the ability to add comments on indexes and constraints, I have to put them at the table level, which I always feel is a kludge. I might actually raise an enhancement request for this, but as Oracle 12 is already nailed down, it will have to wait until Oracle 14. (A little bird told me Larry said there would be no Oracle 13…).

IOT P6(a) Update November 8, 2011

Posted by mwidlake in Architecture, development, performance, Testing.
Tags: , , , ,

In my last post, IOT part 6, inserts and updates slowed down, I made the point that IOT insert performance on a relatively small Oracle system was very slow, much slower than on a larger system I had used for professional testing. A major contributing factor was that the insert was working on the whole of the IOT as data was created. The block buffer cache was not large enough to hold the whole working set (in this case the whole IOT) once it grew beyond a certain size. Once it no longer fitted in memory, Oracle had to push blocks out of the cache and then read them back in next time they were needed, resulting in escalating physical IO.

I’ve just done another test which backs up this claim. I altered my test database so that the block buffer cache was larger, 232MB compared to 100MB in my first tests. The full IOT is around 200MB

Bottom line, the creation of the IOT was greatly sped up (almost by a factor of 4) and the physical IO dropped significantly, by a factor of 20. As a result, the creation of the IOT was almost as fast as the partitioned IOT. It also shows that the true overhead on insert of using an IOT is more like a factor of 2 to 4 as opposed 6 to 8.

You can see some of the details below. Just to help you understand them, it is worth noting that I had added one new, larger column to the test tables (to help future tests) so the final segments were a little larger (the IOT now being 210MB as opposed to 180MB in the first tests) and there was a little more block splitting.

                        Time in Seconds
Object type           Run with       Run with
                     100MB cache    232MB cache
------------------  ------------    -----------   
Normal Heap table          171.9          119.4   
IOT table                1,483.8          451.4     
Partitioned IOT            341.1          422.6 

-- First reading 100MB cache
-- second reading 232MB cache 
STAT_NAME                            Heap    	IOT	      IOT P
-------------------------------- ---------- -----------  ----------
CPU used by this session            5,716         7,222       6,241
                                    5,498         5,967       6,207

DB time                            17,311       148,866      34,120
                                   11,991        45,459      42,320

branch node splits                     25            76          65
                                       25            82         107

leaf node 90-10 splits                752         1,463       1,466
                                      774         1,465       1,465

leaf node splits                    8,127        24,870      28,841
                                    8,162        30,175      40,678

session logical reads           6,065,365     6,422,071   6,430,281
                                6,150,371     6,544,295   6,709.679

physical read IO requests             123        81,458       3,068
                                      36          4,012       1,959

physical read bytes             2,097,152   668,491,776  25,133,056
                                1,400,832    34,037,760  16,048,128

user I/O wait time                    454       139,585      22,253
                                       39        34,510      19,293

The heap table creation was faster with more memory available. I’m not really sure why, the cpu effort was about the same as before and though there was some reduction in physical IO with the larger cache, I suspect it might be more to do with both the DB and the machine having been recently restarted.

All three tests are doing a little more “work” in the second run due to that extra column and thus slightly fewer rows fitting in each block (more branch node and leaf node splits), but this just highlights even more how much the IOT performance has improved, which correlates with a massive drop in physical IO for the IOT creation. If you check the session logical reads they are increased by a very small, consistent amount. Physical read IO requests have dropped significantly and, in the case of the IOT, plummeted.

I believe the 90:10 leaf node splits are consistent as that will be the maintaining of the secondary index on ACCO_TYPE and ACCO_ID, which are populated in order as the data is created (derived from rownum).

What this second test really shows is that the efficiency with which you are able to make use of the database cache is incredibly significant. Efficiently accessing data via good indexes or tricks like IOTs and hash tables is important but it really helps to also try and consider how data is going to be recycled within the cache or used, pushed out and then reused. A general principle for batch-type work seems to me to be that if you can process it in chunks that can sit in memory, rather than the whole working set, there are benefits to be gained. Of course, partitioning can really help with this.

{If anyone is wondering why, for the heap table, the number of physical IO requests has dropped by 70% but the actual number of bytes has dropped by only 30%, I’m going to point the finger to some multi-block read scan going on, either in recursive code or, more likely, my code that actually gathers those stats! That would also help explain the drop in user IO wait time for the heap run.}

Just for completeness, here is a quick check of my SGA components for the latest tests, just to show I am using the cache size I claim. All of this is on Oracle 11.1 enterprise edition, on a tired old Windows laptop. {NB new laptop arrived today – you have no idea how hard it has been to keep doing this blog and not play with the new toy!!!}. If anyone wants the test scripts in full, send me a quick email and I’ll provide them.:

-- sga_info.sql
-- Martin Widlake /08
-- summary
set pages 32
set pause on
col bytes form 999,999,999,999,999 head byts___g___m___k___b
spool sga_info.lst
select * 
from v$sgainfo
order by name
spool off
clear col
NAME                             byts___g___m___k___b RES
-------------------------------- -------------------- ---
Buffer Cache Size                         243,269,632 Yes
Fixed SGA Size                              1,374,892 No
Free SGA Memory Available                           0
Granule Size                                4,194,304 No
Java Pool Size                              4,194,304 Yes
Large Pool Size                             4,194,304 Yes
Maximum SGA Size                          401,743,872 No
Redo Buffers                                6,103,040 No
Shared IO Pool Size                                 0 Yes
Shared Pool Size                          142,606,336 Yes
Startup overhead in Shared Pool            50,331,648 No
Streams Pool Size                                   0 Yes

IOT Part 6 – Inserts and Updates Slowed Down (part A) November 1, 2011

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , , ,

<..IOT1 – the basics
<….IOT2 – Examples and proofs
<……IOT3 – Significantly reducing IO
<……..IOT4 – Boosting Buffer Cache efficiency
<……….IOT5 – Primary Key Drawback
…………>IOT6(B) – OLTP Inserts

A negative impact of using Index Organized Tables is that inserts are and updates can be significantly slowed down. This post covers the former and the reasons why – and the need to always run tests on a suitable system. (I’m ignoring deletes for now – many systems never actually delete data and I plan to cover IOTs and delete later)

Using an IOT can slow down insert by something like 100% to 1000%. If the insert of data to the table is only part of a load process, this might result in a much smaller overall impact on load, such as 25%. I’m going to highlight a few important contributing factors to this wide impact spread below.

If you think about it for a moment, you can appreciate there is a performance impact on data creation and modification with IOTs. When you create a new record in a normal table it gets inserted at the end of the table (or perhaps in a block marked as having space). There is no juggling of other data.
With an IOT, the correct point in the index has to be found and the row has to be inserted at the right point. This takes more “work”. The inserting of the new record may also lead to an index block being split and the extra work this entails. Similar extra work has to be carried out if you make updates to data that causes the record to move within the IOT.
Remember, though, that an IOT is almost certainly replacing an index on the heap table which, unless you are removing indexes before loading data and recreating them after, would have to be maintained when inserting into the Heap table. So some of the “overhead” of the IOT would still occur for the heap table in maintaining the Primary Key index. Comparing inserts or updates between a heap table with no indexes and an IOT is not a fair test.

For most database applications data is generally written once, modified occasionally and read many times – so the impact an IOT has on insert/update is often acceptable. However, to make that judgement call you need to know

  • what the update activity is on the data you are thinking of putting into an IOT
  • the magnitude of the impact on insert and update for your system
  • the ratio of read to write.

There is probably little point putting data into an IOT if you constantly update the primary key values (NB see IOT-5 as to why an IOT’s PK columns might not be parts of a true Primary Key) or populate previously empty columns or hardly ever read the data.

There is also no point in using an IOT if you cannot load the data fast enough to support the business need. I regularly encounter situations where people have tested the response of a system once populated but fail to test the performance of population.

Now to get down to the details. If you remember the previous posts in this thread (I know, it has been a while) then you will remember that I create three “tables” with the same columns. One is a normal heap table, one is an Index Organized Table and one is a partitioned Index Organized Table, partitioned into four monthly partitions. All tables have two indexes on them, the Primary Key index (which is the table in the case of the IOTs) and another, roughly similar index, pre-created on the table. I then populate the tables with one million records each.

These are the times, in seconds, to create 1 million records in the the HEAP and IOT tables:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        171.9  
IOT table               1483.8

This is the average of three runs to ensure the times were consistent. I am using Oracle V11.1 on a machine with an Intel T7500 core 2 Duo 2.2GHz, 2GB memory and a standard 250GB 5000RPM disk. The SGA is 256MB and Oracle has allocated around 100MB-120MB to the buffer cache.

We can see that inserting the 1 million rows into the IOT takes 860% the time it does with a heap table. That is a significant impact on speed. We now know how large the impact is on Insert of using an IOT and presumably it’s all to do with juggling the index blocks. Or do we?

This proof-of-concept (POC) on my laptop {which you can also run on your own machine at home} did not match with a proof-of-concept I did for a client. That was done on V10.2.0.3 on AIX, on a machine with 2 dual-core CPUS with hyper-threading (so 8 virtual cores), 2GB SGA and approx 1.5GB buffer cache, with enterprise-level storage somewhere in the bowels of the server room. The results on that machine to create a similar number of records were:

                  Time in Seconds
Object type         Run_Normal
------------------  ----------
Normal Heap table        152.0  
IOT table                205.9

In this case the IOT inserts required 135% the time of the Heap table. This was consistent with other tests I did with a more complex indexing strategy in place, the IOT overhead was around 25-35%. I can’t go into too much more detail as the information belongs to the client but the data creation was more complex and so the actual inserts were only part of the process – this is how it normally is in real life. Even so, the difference in overhead between my local-machine POC and the client hardware POC is significant, which highlights the impact your platform can have on your testing.

So where does that leave us? What is the true usual overhead? Below are my more full results from the laptop POC.

                        Time in Seconds
Object type         Run_Normal    Run_quiet    Run_wrong_p
------------------  ----------    ---------    -----------
Normal Heap table        171.9        81.83         188.27  
IOT table               1483.8      1055.35        1442.82
Partitioned IOT          341.1       267.83         841.22 

Note that with the partitioned IOT the creation took 341 second, the performance ratio to a heap table is only 198% and is much better than the normal IOT. Hopefully you are wondering why!

I’m running this test on a windows laptop and other things are going on. The timings for Run_Quiet are where I took steps to shut down all non-essential services and applications. This yielded a significant increase for all three object types but the biggest impact was on the already-fastest Heap table.

The final set of figures is for a “mistake”. I created the partitions wrong such that half the data went into one partition and the rest into another and a tiny fraction into a third, rather than being spread over 4 partitions evenly. You can see that the Heap and normal IOT times are very similar to the Run_Normal results (as you would expect as these test are the same) but for the partitioned IOT the time taken is half way towards the IOT figure.

We need to dig into what is going on a little further to see where the effort is being spent, and it turns out to be very interesting. During my proof-of-concept on the laptop I grabbed the information from v$sesstat for the session before and after each object creation so I could get the figures just for the loads. I then compared the stats between each object population and show some of them below {IOT_P means Partitioned IOT}.

STAT_NAME                            Heap    	IOT	        IOT P
------------------------------------ ---------- -------------  -----------
CPU used by this session                  5,716         7,222        6,241
DB time                                  17,311       148,866       34,120
Heap Segment Array Inserts               25,538            10           10

branch node splits                           25            76           65
leaf node 90-10 splits                      752         1,463        1,466
leaf node splits                          8,127        24,870       28,841

consistent gets                          57,655       129,717      150,835
cleanout - number of ktugct calls        32,437        75,201       88,701
enqueue requests                         10,936        28,550       33,265

file io wait time                     4,652,146 1,395,970,993  225,511,491
session logical reads                 6,065,365     6,422,071    6,430,281
physical read IO requests                   123        81,458        3,068
physical read bytes                   2,097,152   668,491,776   25,133,056
user I/O wait time                          454       139,585       22,253
hot buffers moved to head of LRU         13,077       198,214       48,915
free buffer requested                    64,887       179,653      117,316

The first section shows that all three used similar amounts of CPU, the IOT and partitioned IOT being a little higher. Much of the CPU consumed was probably in generating the fake data.The DB Time of course pretty much matches the elapsed time well as the DB was doing little else.
It is interesting to see that the Heap insert uses array inserts which of course are not available to the IOT and IOT_P as the data has to be inserted in order. {I think Oracle inserts the data into the heap table as an array and then updates the indexes for all the entries in the array – and I am only getting this array processing as I create the data as an array from a “insert into as select” type load. But don’t hold me to any of that}.

In all three cases there are two indexes being maintained but in the case of the IOT and IOT_P, the primary key index holds the whole row. This means there has to be more information per key, less keys per block and thus more blocks to hold the same data {and more branch blocks to reference them all}. So more block splits will be needed. The second section shows this increase in branch node and leaf block splits. Double the branch blocks and triple the leaf block splits. This is probably the extra work you would expect for an IOT. Why are there more leaf block splits for the partitioned IOT? The same data of volume ends up taking up more blocks in the partitioned IOT – 200MB for the IOT_P in four partitions of 40-60MB as opposed to a single 170MB for the IOT. The larger overall size of the partition is just due to a small overhead incurred by using partitions and also a touch of random fluctuation.

So for the IOT and IOT_P there is about three times the index-specific work being done and a similar increase in related statistics such as enqueues, but not three times as it is not just index processing that contribute to these other statistics. However, the elapsed time is much more than three times as much. Also, the IOT_P is doing more index work than the IOT but it’s elapsed time is less. Why?

The fourth section shows why. Look at the file io wait times. This is the total time spent waiting on IO {in millionths of a second} and it is significantly elevated for the IOT and to a lesser degree for the IOT_P. Physical IO is generally responsible for the vast majority of time in any computer system where it has not been completely avoided.
Session logical reads are only slightly elevated, almost negligably so but the number of physical reads to support it increases from 123 for the Heap table insert to 81,458 for the IOT and 3,068 for the IOT_P. A clue as to why comes from the hot buffers moved to head of LRU and free buffer requested statistics. There is a lot more activity in moving blocks around in the buffer cache for the IOT and IOT_P.

Basically, for the IOT, all the blocks in the primary key segment are constantly being updated but eventually they won’t all fit in the block buffer cache – remember I said the IOT is eventually 170MB and the buffer cache on my laptop is about 100MB – so they are flushed down to disk and then have to be read back when altered again. This is less of a problem for the IOT_P as only one partition is being worked on at a time (the IOT_P is partitioned on date and the data is created day by day) and so more of it (pretty much all) will stay in memory between alterations. The largest partition only grows to 60MB and so can be worked on in memory.
For the heap, the table is simply appended to and only the indexes have to be constantly updated and they are small enough to stay in the block buffer cache as they are worked on.

This is why when I got my partitioning “wrong” the load took so much longer. More physical IO was needed as the larger partition would not fit into the cache as it was worked on – A quick check shows that logical reads and in fact almost all statistics were very similar but 26,000 IO requests were made (compared to 81,458 for the IOT and 3,068 for the correct IOT_P).

Of course, I set my SGA size and thus the buffer cache to highlight the issue on my laptop and I have to say even I was surprised by the magnitude of the impact. On the enterprise-level system I did my client’s proof of concept on, the impact on insert was less because the buffer cache could hold the whole working set, I suspect the SAN had a considerable cache on it, there was ample CPU resource to cope with the added latching effort and the time taken to actually create the data inserted was a significant part of the workload, reducing the overall impact of the slowness caused by the IOT.

{Update, in This little update I increase my block buffer cache and show that physical IO plummets and the IOT insert performance increases dramatically}.

This demonstrates that a POC, especially one for what will become a real system, has to be a realistic volume on realistic hardware.
For my client’s POC, I still did have to bear in mind the eventual size of the live working set and the probably size of the live block buffer cache and make some educated guesses.

It also explains why my “run_quiet” timings showed a greater benefit for the heap table than the IOT and IOT_P. A windows machine has lots of pretty pointless things running that take up cpu and a bit of memory, not really IO so much. I reduced the CPU load and it benefits activity that is not IO, so it has more impact on the heap table load. Much of the time for the IOT and IOT_P is taken hammering the disk and that just takes time.

So, in summary:

  • Using an IOT increases the index block splitting and, in turn, enqueues and general workload. The increase is in proportion to the size of the IOT compared to the size of the replaced PK.
  • The performance degredation across the whole load process may well be less than 50% but the only way to really find out is to test
  • You may lose the array processing load that may benefit a heap table load if you do the load via an intermediate table.
  • With an IOT you may run into issues with physical IO if the segment (or part of the segment) you are loading into cannot fit into the buffer cache (This may be an important consideration for partitioning or ordering of the data loaded)
  • If you do a proof of concept, do it on a system that is as similar to the real one as you can
  • Just seeing the elapsed time difference between test is sometimes not enough. You need to find out where that extra time is being spent

I’ve thrown an awful lot at you in this one post, so I think I will stop there. I’ve not added the script to create the test tables here, they are in IOT-5 {lacking only the grabbing of the v$sesstat information}.

Friday Philosophy – The Dying Art of Database Design? September 9, 2011

Posted by mwidlake in Architecture, development, Friday Philosophy, rant.
Tags: , , ,

How many people under the age of {Martin checks his age and takes a decade or so off} ohh, mid 30’s does any database design these days? You know, asks the business community what they want the system to do, how the information flows through their business, what information they need to report on. And then construct a logical model of that information? Judging by some of the comments I’ve had on my blog in the last couple of years and also the meandering diatribes of bitter, vitriolic complaints uttered by fellow old(er) hacks in the pub in the evening, it seems to be coming a very uncommon practice – and thus a rare and possibly dying skill.

{update – this topic has obviously been eating at my soul for many years. Andrew Clark and I had a discussion about it in 2008 and he posted a really good article on it and many, many good comments followed}

Everything seems to have turned into “Ready, Fire, Aim”. Ie, you get the guys doing the work in a room, develop some rough idea of what you want to develop (like, look at the system you are replacing), start knocking together the application and then {on more enlightened projects} ask the users what they think. The key points are the that development kicks off before you really know what you need to produce, there is no clear idea of how the stored data will be structured and you steer the ongoing development towards the final, undefined, target. I keep coming across applications where the screen layouts for the end users seem to almost be the design document and then someone comes up with the database – as the database is just this bucket to chuck the data into and scrape it out of again.

The functionality is the important thing, “we can get ‘someone’ to make the database run faster if and when we have a problem”.

Maybe I should not complain as sometimes I am that ‘someone’ making the database run faster. But I am complaining – I’m mad as hell and I ain’t gonna take it anymore! Oh, OK, in reality I’m mildly peeved and I’m going to let off steam about it. But it’s just wrong, it’s wasting people’s time and it results in poorer systems.

Now, if you have to develop a simple system with a couple of screens and a handful of reports, it might be a waste of time doing formal design. You and Dave can whack it together in a week or two, Chi will make the screens nice, it will be used by a handful of happy people and the job is done. It’s like building a wall around a flower bed. Go to the local builders merchants, get a pallet of bricks, some cement and sand (Ready), dig a bit of a trench where you want to start(Aim) and put the wall up, extending it as you see fit (Fire). This approach won’t work when you decide to build an office block and only a fool from the school of stupid would attempt it that way.

You see, as far as I am concerned, most IT systems are all about managing data. Think about it. You want to get your initial information (like the products you sell), present it to the users (those customers), get the new (orders) data, pass it to the next business process (warehouse team) and then mine the data for extra knowledge (sales patterns). It’s a hospital system? You want information about the patients, the staff, the beds and departments, tests that need doing, results, diagnoses, 15,000 reports for the regulators… It’s all moving data. Yes, a well design front end is important (sometimes very important) but the data is everything. If the database can’t represent the data you need, you are going to have to patch an alteration in. If you can’t get the data in quick enough or out quick enough, your screens and reports are not going to be any use. If you can’t link the data together as needed you may well not be able to DO your reports and screens. If the data is wrong (loses integrity) you will make mistakes. Faster CPUS are not going to help either, data at some point has to flow onto and off disks. Those slow spinning chunks of rust. CPUS have got faster and faster, rust-busting has not. So data flow is even more important than it was.

Also, once you have built your application on top of an inadequate database design, you not only have to redesign it, you have to:

  • do some quick, hacky  fixes to get by for now
  • migrate the existing data
  • transform some of it (do some data duplication or splitting maybe)
  • alter the application to cope
  • schedule all of the above to be done together
  • tie it in with the ongoing development of the system as hey, if you are not going to take time to design you are not going to take time to assess things before promising phase 2.

I’m utterly convinced, and experience backs this up, that when you take X weeks up front doing the database design, you save 5*X weeks later on in trying to rework the system, applying emergency hacks and having meetings about what went wrong. I know this is an idea out of the 80’s guys, but database design worked.

*sigh* I’m off to the pub for a pint and to reminisce about the good-old-days.


Get every new post delivered to your Inbox.

Join 206 other followers