jump to navigation

How NOT to present November 30, 2010

Posted by mwidlake in Meeting notes.
Tags: , ,
9 comments

I’m at the UKOUOG this week and, as ever, the presentations vary in quality. Most are excellent {or even better than that}, some are not. I was in one first thing this morning and, I have to say, it was rushed, garbled, unclear and there was a definite air of unease and panic. I’m not even sure the guy got to his big point and I could think of at least three major things he did not mention at all.

I think his main problem was just starting off in a rush and never settling down. You see, I was stuck on the top floor of my Hotel and had to run to the venue. Yes, the poor presentation was by me :-(.

I usually present well {modesty forbids me from saying I am a very good presenter – but modesty can take a hike, my ego knows I am capable of giving great presentations}. I am one of those lucky people for whom presenting has never been particularly frightening and, in fact, I find it easier to present to a group of people than talk with them.

But not today. I was already worried about the session, have been for weeks, as I was doing interactive demos. But last night I ran through it, wrote down the names of the scripts and the slide numbers so I could just bang through them and timed it all. 50 mins, I would skip one unneeded section. Calm. I got a reasonable night’s sleep, got up early and ran through it all one more time, making sure my Big Point demo worked. And it did. Yes.

Went down to breakfast, had breakfast and back to the room to pick up my stuff. And realised I was late. Less than 10 minutes to do the 5 minutes over to the venue. So I fled the room, stuffing my laptop in my bag. But not my notes. Or my conference pass. I did not think of this as I stood on the top floor of the hotel, I just thought “where are the lifts?”. They were all below me, ferrying hungry people to and from breakfast. After what seemed like an hour and was only 4 or 5 minutes I decided 16 flights of stairs was OK to go down and, to give me credit, I managed those stairs and the few hundred yards to the venue in pretty good time. I did pause for a few seconds at floor 7, I think, when I remembered my notes. Too late.

But I was now panicked and arrived as a dash. I had to mess about with the Audio Visual guy to get going and started 2 mins past my slot start – and then did the 5 minutes of non-relevant stuff I had decided to drop. It was game over from there, I was failing to find the correct scripts, I was skipping relevant sections and I was blathering instead of just taking a few seconds to calm down and concentrate.

Oh well, my first time in a large room at the UKOUG and I messed up. At least I had the key lesson drummed into me. TURN UP EARLY!!!!

You can explain an invalid SQL statement November 27, 2010

Posted by mwidlake in internals.
Tags: , ,
6 comments

I’m in “nightmare weekend before presenting” mode. I’m up to my eyes at work (and have been for ages, thus the quiet blog) and my recent weekends have been full of normal {and abnormal} life.

As is the way, when up against it and putting together my proofs for wild claims, everything breaks subtly and makes my wild claims look a little, well, wild – even though they are real issues I’ve seen, worked through and fixed in the day job. *sigh*. It does not help when you come across little oddities you have never seen before and end up spending valuable time looking into them.

So here is one. I’m just putting together a very, very simple demo of how the number of rows the CBO expects to see drops off as you move outside the known range. In the below you can see the statement I am using (I keep passing in different days of the month and watching the expected number of rows drop until I hit 1 expected row), but look at how it progress to the last entry…

mdw11> select count(*) from date_test_flat where date_1=to_date('&day-02-2011','DD-MM-YYYY')
  2  /
Enter value for day: 01

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |    16 |   128 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE(' 2011-02-01 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

mdw11> /
Enter value for day: 15

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |     2 |    16 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE(' 2011-02-15 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

mdw11> /
Enter value for day: 21

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |     1 |     8 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE(' 2011-02-21 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

mdw11> /
Enter value for day: 30

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |    99 |   792 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE('30-02-2011','DD-MM-YYYY'))

mdw11>

The expected number of rows drops, becomes and – and has shot up to 99 again (which is the expected number in the known range, as I have 10,000 rows spread over 100 days). My immediate thought is “Wow! Maybe Oracle have put some odd fix in where when you go well out of range it reverts to expecting an average number of rows”. Nope. It is because I asked for the data for 30th February. And I did not get an error.

I think it is because I have set autotrace traceonly explain. This causes the SQL statement not to be executed {if it is just a select, not an insert, update or delete}. It seems the costing section of the CBO is not so good at spotting duff dates, but it then gets the costing wrong.

I’ve spotted that the format of the filter also changes when the date is invalid, I really want to check that out – but I better continue failing to write the presentation!

I know, pretty pointless knowing this but it just amused me. Below is just a quick continuation to show that if the statment is to be executed you get an error and no plan and that utterly duff dates can be passed in.

mdw11> /
Enter value for day: 28

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |     1 |     8 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE(' 2011-02-28 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

mdw11> SET AUTOTRACE ON
mdw11> /
Enter value for day: 20
any key>

  COUNT(*)
----------
         0

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |     1 |     8 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE(' 2011-02-20 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        821  consistent gets
          0  physical reads
          0  redo size
        421  bytes sent via SQL*Net to client
        415  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

mdw11> /
Enter value for day: 30
select count(*) from date_test_flat where date_1=to_date('30-02-2011','DD-MM-YYYY')
                                                         *
ERROR at line 1:
ORA-01839: date not valid for month specified


mdw11> set autotrace traceonly explain
mdw11> /
Enter value for day: 30

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |    99 |   792 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE('30-02-2011','DD-MM-YYYY'))

mdw11> /
Enter value for day: 45

Execution Plan
----------------------------------------------------------
Plan hash value: 247163334

-------------------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |     1 |     8 |   215   (0)| 00:00:04 |
|   1 |  SORT AGGREGATE    |                |     1 |     8 |            |          |
|*  2 |   TABLE ACCESS FULL| DATE_TEST_FLAT |    99 |   792 |   215   (0)| 00:00:04 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DATE_1"=TO_DATE('45-02-2011','DD-MM-YYYY'))

Database Sizing – How much Disk do I need? (The Easy Way) November 11, 2010

Posted by mwidlake in Architecture, development, VLDB.
Tags: , , , ,
7 comments

How much Disk do I need for my new Oracle database? Answer:-

  • 8-10 times the volume of raw data for an OLTP system
  • 2-4 times the raw data volume for a Data Warehouse.
  • The bigger the database, the nearer you will be to the lower multiplication factors.

{Disclaimer. This is of course just my opinion, based on some experience. If you use the above figures for a real project and get the total disc space you need wrong, don’t blame me. If you do and it is right, then of course you now owe me a beer.}

Many of us have probably had to calculate the expected size a database before, but the actual database is only one component of all the things you need to run the Oracle component of your system. You need to size the other components too – Archived redo logs, backup staging area, dataload staging area, external files, the operating system, swap space, the oracle binaries {which generally gets bigger every year but shrink in comparison to the average size of an Oracle DB} etc…

In a similar way to my thoughts on how much database space you need for a person, I also used to check out the total disk space every database I created and those that I came across took up. {A friend emailed me after my earlier posting to ask if I had an obsession about size. I think the answer must be “yes”}.

First of all, you need to know how much “raw data” you have. By this I mean what will become the table data. Back in the early 90’s this could be the total size of the flat files the old system was using, even the size of the data as it was in spreadsheets. An Oracle export file of the system gives a pretty good idea of the raw data volume too. Lacking all these then you need to roughly size your raw data. Do a calculation of “number_of_rows*sum_of_columns” for your biggest 10 tables (I might blog more on this later). Don’t be tempted to overestimate, my multipliers allow for the padding.

Let us say you have done this and it is 60GB of raw data for an OLTP system. Let the storage guys know you will probably want about 500GB of space. They will then mentally put it down as “of no consequence” as if you have dedicated storage guys you probably have many terabytes of storage. {Oh, I should mention that I am not considering redundancy at all but space that is provided. The amount of actual spinning disk is down to the level and type of RAID you storage guys make you use. That is a whole other discussion}.

If you come up with 5TB of raw data for a DW system then you need around 12-15TB of disk storage.

If you come up with more than a Terabyte or so of raw data for an OLTP system or 10 to 20 Terabytes for a DW, when you give you figures to the storage guys/procurement people then they may well go pale and say something like “you have got to be kidding!”. This is part of why the multiplication factor for Data Warehouses and larger systems in general is less, as you are forced to be more careful about the space you allocate and how you use it.

The overhead of total disk space over Raw data reduces as the database gets bigger for a number of reasons:

  • The size of the Oracle binaries and the OS does not change as the database gets bigger.
  • The size of swap space does not increase in line wiht the database as, generally speaking, if you increase the database size from 100GB to 1TB you do not have the luxury of increasing the system memory of your server. It probably doubles.
  • Very large databases tend to have something making them big, like images or embedded documents, which are not indexed. Thus the ratio of table segments to index segments increases.
  • If you have a very large database you start removing indexes (often those that support constraints) to aid performance of data load and management, again improving the ratio of table segments to index segments.
  • Backups become partial or incremental to reduce the size and duration of the backup.
  • As mentioned before, the sheer size of system is such that you just take more care over cleaning up work areas, paring down the archived redo log areas (those files to compress well) and other areas.
  • If things get extreme or you have been doing this for donkeys years {note to none-UK people, this means many, many years} you start altering PCTFREE and checking over extent sizes.

My best ever ratio of database size to raw data was around 1.6 and it took an awful lot of effort and planning to get there. And an IT manager who made me very, very aware of how much the storage was costing him (it is not the disks, it’s all the other stuff).

I should also just mention that the amount of disk you need is only one consideration. If you want your database to perform well you need to consider the number of spindles. After all, you can create a very large database indeed using a single 2TB disc – but any actual IO will perform terribly.

How Big is a Person? November 5, 2010

Posted by mwidlake in Architecture.
Tags: , ,
5 comments

How big are you in the digital world?

By this, I mean how much space do you (as in, a random person) take up in a database? If it is a reasonably well designed OLTP-type database a person takes up 4K. OK, around 4K.

If your database is holding information about people and something about them, then you will have about 4K of combined table and index data per person. So if your database holds 100,000 customers, then your database is between 200MB and 800MB, but probably close to 400MB. There are a couple of situations I know of where I am very wrong, but I’ll come to that.

How do I know this? It is an accident of the projects and places I have worked at for 20 years and the fact that I became strangely curious about this. My first job was with the NHS and back then disk was very, very expensive. So knowing how much you needed was important. Back then, it was pretty much 1.5K per patient. This covered personal details (names, addresses, personal characteristics), GP information, stays at hospitals, visits to outpatient clinics etc,. It also included the “reference “ data, ie the information about consultants, wards and departments, lookups etc. If you included the module for lab tests it went up to just over 2K. You can probably tell that doing this sizing was a job I handled. This was not Oracle, this was a database called MUMPS and we were pretty efficient in how we held that data.

When I moved to work on Oracle-based hospital systems, probably because I had done the data sizing in my previous job and partly because I was junior and lacked any real talent, I got the job to do the table sizings again, and a laborious job it was too. I did it very conscientiously, getting average lengths for columns, taking into account the length bytes, row overhead, block overhead, indexes etc etc etc. When we had built the database I added up the size of all the tables and indexes, divided by the number of patients and… it was 2K. This was when I got curious. Had I wasted my time doing the detailed sizings?

Another role and once again I get the database sizing job, only this time I wrote a little app for it. This company did utilities systems, water, gas, electricity. My app took into account everything I could think of in respect of data sizing, from the fact that the last extent would on average be 50% empty to the tablespace header. It was great. And pointless. Sum up all the tables and indexes on one of the live systems and divide by the number of customers and it came out at 2-3K per customer. Across a lot of systems. It had gone up a little, due to more data being held in your average computer system.

I’ve worked on a few more person-based systems since and for years I could not help myself, I would check the size of the data compared to the number of people. The size of the database is remarkably consistent. It is slowly going up because we hold more and more data, mostly because it is easier to suck up now as all the feeds are electronic and there is no real cost in taking in that data and holding it. Going back to the hospital systems example, back in 1990 it used to be that you would hold the fact a lab test had been requested and the key results information – like the various cell counts for a blood test. This was because sometimes you had to manually enter the results. Now the test results come off another computer and you get everything.

I said there were exceptions. There are three main ones:

  • You are holding a very large number of transaction records for the person. Telephony systems are one of the worst examples of this. Banking, credit cards and other utility systems match the 4K rule.
  • You hold images or other “unstructured” chunks of data for people. In hospital systems this would cover x-rays, ultrasound scans etc. But if you drop them out of the equation (and this is easy as they often are held in separate sub-systems) it remains a few K per person. CVs push it up as they are often in that wonderfully bloaty Word format.
  • You are holding mostly pointers to another system, in which case it can be a lot less than 4K per person. I had to size a system recently and I arrogantly said “4K per person”. It turned out to be less than 1K, but then this system turned out to actually hold most person data in one key data store and “my” system only held transaction information. I bet that datastore was about 4K per person

I have to confess that I have not done this little trick of adding up the size of all the tables and indexes and dividing by the number of people so often over the last couple of years, but the last few times I checked it was still 3-4K – though a couple of times I had to ignore a table or two holding unstructured data.
{The massive explosion in the size of database is at least partly down to holding pictures – scanned forms, photos of products, etc, but when it comes down to the core part of the app for handling people, it seems to have stayed at 4K. The other two main aspects driving up database size seem to me to be the move from regional companies and IT systems to national and international ones, and that fact that people collect and keep all and every piece of information, be it any good for anything or not}.

I’d love to know if your person-based systems come out at around 4K per person but I doubt if many of you would be curious enough to check – I think my affliction is a rare one.

Follow

Get every new post delivered to your Inbox.

Join 159 other followers