jump to navigation

How Fast for £1,000 – Architecture August 5, 2010

Posted by mwidlake in Architecture, performance, Testing.
Tags: , , ,

My previous post proposed the creation of “the fastest Oracle server for a grand”, or at least an investigation into what might be the fastest server. I’ve had some really good feedback {which I very much appreciate and am open to even more of}, so I think I’ll explore this further.

My initial ideas for the hardware configuration, written at the same time as the original post, were:

  • A single-chip, quad core intel core i5 or i7 processor (I would like two chips but the cost of multi-chip motherboards seems too high for my budget)
  • 8GB of memory as the best price point at present, but maybe push to 16GB
  • Multiple small, fast internal disks for storage, maybe expand via eSATA
  • backup to an external drive (cost not included in the budget).
  • USB3 and use of memory sticks for temp and online redo.
  • If budget will stretch, SSD disc for the core database components. like core tables, index tablespaces (who does that any more!).
    ASM or no ASM?
    If I run out of internal motherboard connections for storage, can I mix and match with USB3, external e-SATA or even GB ethernet?

As for the Oracle database considerations, I have a good few things I want to try out also. In the past (both distant and recent) I have had a lot of success in placing components of the database in specific locations. I refer to this as “Physical Implementation” {Physical Implementation, if I remember my old DB Design courses correctly, also includes things like partitioning, extent management, tablespace attributes – how you actually implement the tables, indexes and constraints that came from logical data design}.

Physically placing components like undo and redo logs on your fastest storage is old-hat but I think it gets overlooked a lot these days.
Placing of indexes and tables on different tablespaces on different storage is again an old and partially discredited practice, but I’d like to go back and have a new look at it. Again, I had some success with improved performance with this approach as little as 8 years ago but never got to rigorously test and document it. { As an aside, one benefit I have been (un)fortunate to gain from twice through putting tables and indexes in separate tablespaces is when a tablespace has been lost through file corruption – only for it to be an index tablespace, so I was able to just drop the tablespace and recreate the indexes.}

Then there is the use of clusters, IOTs, Bitmap indexes and Single Table Hash Clusters (are you reading this Piet?) which I want to explore again under 11.

I don’t think I am going to bother with mixed block sizes in one DB, I think you need very specialist needs to make it worth the overhead of managing the various caches and the fact that the CBO is not so great at accurately costing operations in non-standard block sizes {issues with the MBRC fudge factor being one}. But I think I will re-visit use of “keep” and “recycle” caches. For one thing, I want to show that they are just caches with a name and not special, by using the “Recycle” cache as the keep and the “keep” as a recycle cache.

Should I be using RAT for testing all of this? I said I was not going to use any special features beyond Enterprise edition but RAT could be jolly useful. But then I would need two servers. Is anyone willing to give me the other £1000 for it? I’d be ever so grateful! 🙂


How often is v$sys_time_model updated? July 14, 2010

Posted by mwidlake in internals, performance.
Tags: ,

I think this posting might go down as one of my more pointless contributions to the Oracle knowledge sphere 🙂

I was looking at V$SYS_TIME_MODEL and V$SESS_TIME_MODEL and I just happened to run “select * from V$SYS_TIME_MODEL” several times in very quick succession. And I noticed the values for the various counters stayed the same between a couple of the runs.

“Hmmm, interesting” I thought “The values are only flushed down to the view ‘periodically’. I wonder how periodically?”… and thus I wasted a lunch time.

I used the below sql*plus-PL/SQL script to investigate the refreshing of v$sess_time_model. Yes, I know the title says v$sys_time_model but the numbers are smaller and easier to check for the session version of the view and they are virtually the same, I can bet on the results being very similar. This is my script (and it is on on linux):

-- how often is v$sessstat updated
set trims on
set term off
set serveroutput on size unli
spool test_vstm_upd 
  for a in 1..1000  loop
    for vrec in 
      (select rpad(rpad(to_char(a),4)||' '||to_char(systimestamp,'ss.ff')||' '|| stat_name||' ',45)||value rec
      from v$sess_time_model
-- replace with your current session ID
      where sid=1989
      and stat_id in (3649082374,2748282437,2821698184,2643905994)
    end loop;
  end loop;
set term on
spool off

As you can see, it simply loops around selecting four of the values from v$sess_time_model, including the loop counter and current timestamp. Timetamp is evaluated afresh for each executed sql statement.

Here is the output for the first three iterations;

1    53.389576 DB time                       475860419
1    53.389576 DB CPU                        402642660
1    53.389576 sql execute elapsed time      209780319
1    53.389576 PL/SQL execution elapsed time 52290858
2    53.408944 DB time                       475860419
2    53.408944 DB CPU                        402642660
2    53.408944 sql execute elapsed time      209780319
2    53.408944 PL/SQL execution elapsed time 52290858
3    53.429159 DB time                       475860419
3    53.429159 DB CPU                        402642660
3    53.429159 sql execute elapsed time      209780319
3    53.429159 PL/SQL execution elapsed time 52290858

As you can see, the timetamp is increasing by 2/100s of a second or so per loop. Which is not as quick as I hoped but it is a test box. Note that the counters for DB Time, CPU time, SQL execute elapsed time and PL/SQL execution elapsed time are constant.

A few iterations later we see the v$sess_time_model counters increment:

7    53.509351 DB time                       475860419
7    53.509351 DB CPU                        402642660
7    53.509351 sql execute elapsed time      209780319
7    53.509351 PL/SQL execution elapsed time 52291610
--all change!	
8    53.531378 DB time                       475871716
8    53.531378 DB CPU                        402653957
8    53.531378 sql execute elapsed time      209786745
8    53.531378 PL/SQL execution elapsed time 52292793
-- and stable	
9    53.555889 DB time                       475871716
9    53.555889 DB CPU                        402653957
9    53.555889 sql execute elapsed time      209786745
9    53.555889 PL/SQL execution elapsed time 52292793

The counters all increment between iteration 7 and 8 and then stay the same. I can’t tell how long the counters had been the same, I need to wait and see when they change again. How long until they increment again? Well, not very long, in fact just around 0.12 seconds:

14   53.650154 DB time                       475871716
14   53.650154 DB CPU                        402653957
14   53.650154 sql execute elapsed time      209786745
14   53.650154 PL/SQL execution elapsed time 52293064
-- change	
15   53.670358 DB time                       475881268
15   53.670358 DB CPU                        402663509
15   53.670358 sql execute elapsed time      209792803
15   53.670358 PL/SQL execution elapsed time 52294180
-- still changing	
16   53.689011 DB time                       475887530
16   53.689011 DB CPU                        402669771
16   53.689011 sql execute elapsed time      209794387
16   53.689011 PL/SQL execution elapsed time 52294180
-- and still changing	
17   53.710875 DB time                       475889549
17   53.710875 DB CPU                        402671790
17   53.710875 sql execute elapsed time      209796393
17   53.710875 PL/SQL execution elapsed time 52295342
-- ...still changing...
18   53.728168 DB time                       475893032
18   53.728168 DB CPU                        402675273
18   53.728168 sql execute elapsed time      209797665
18   53.728168 PL/SQL execution elapsed time 52295342
-- and stable	
19   53.744725 DB time                       475893032
19   53.744725 DB CPU                        402675273
19   53.744725 sql execute elapsed time      209797665
19   53.744725 PL/SQL execution elapsed time 52295342

This time, the increment occurs over several iterations of the loop before becoming stable again. All four values I am pulling out increment over these iterations.

The next increment comes four iterations or 0.1 seconds later and happens swiftly, between two iterations:

22   53.802486 DB time                       475893032
22   53.802486 DB CPU                        402675273
22   53.802486 sql execute elapsed time      209797665
22   53.802486 PL/SQL execution elapsed time 52295342
-- change	
23   53.822231 DB time                       475897963
23   53.822231 DB CPU                        402680204
23   53.822231 sql execute elapsed time      209800369
23   53.822231 PL/SQL execution elapsed time 52296904
-- stable	
24   53.840085 DB time                       475905724
24   53.840085 DB CPU                        402687965
24   53.840085 sql execute elapsed time      209803330
24   53.840085 PL/SQL execution elapsed time 52296904

So it seem that v$sess_time_model is incremented in steps, not constantly, and does so every 0.10 to 0.13 seconds or so. My work here is done.

Or is it?

No, it is not, as there is now a “massive” gap where the counters do not increment for almost 3/4 of a second, until iteration 127:

126  55.530398 DB time                       475905724
126  55.530398 DB CPU                        402687965
126  55.530398 sql execute elapsed time      209803775
126  55.530398 PL/SQL execution elapsed time 52297583
-- change	
127  55.545085 DB time                       475914013
127  55.545085 DB CPU                        402696254
127  55.545085 sql execute elapsed time      209809518
127  55.545085 PL/SQL execution elapsed time 52298886
-- still changing	
128  55.560141 DB time                       475921342
128  55.560141 DB CPU                        402703583
128  55.560141 sql execute elapsed time      209812345
128  55.560141 PL/SQL execution elapsed time 52299359
-- still changing	
129  55.574806 DB time                       475922705
129  55.574806 DB CPU                        402704946
129  55.574806 sql execute elapsed time      209812345
129  55.574806 PL/SQL execution elapsed time 52299359
-- stable	
130  55.589541 DB time                       475922705
130  55.589541 DB CPU                        402704946
130  55.589541 sql execute elapsed time      209812345
130  55.589541 PL/SQL execution elapsed time 52299359 

Again, the incrementing ran over a small number of iterations of my loop.

I think I have shown that all the values increment together so I will reduce my output to just the one counter and see when it increments and over how many iterations and see if a pattern appears:

25   53.860550 DB time                       475905724
-- 53.84 to 55.54 0 1.7 seconds of stability 
126  55.530398 DB time                       475905724
127  55.545085 DB time                       475914013
128  55.560141 DB time                       475921342
129  55.574806 DB time                       475922705
136  55.682402 DB time                       475922705
137  55.697191 DB time                       475956738
138  55.712266 DB time                       475969859
139  55.727820 DB time                       475974350
140  55.743315 DB time                       475982356
141  55.758749 DB time                       475994069
142  55.773602 DB time                       476004596
143  55.788472 DB time                       476004596
144  55.803295 DB time                       476007541
145  55.818136 DB time                       476011172
146  55.832886 DB time                       476020336
147  55.847772 DB time                       476025376
148  55.865303 DB time                       476036347
-- incrementd with a couple of brief pauses over 0.34 seconds
149  55.881480 DB time                       476041481
150  55.896735 DB time                       476041481
200  56.664783 DB time                       476041481
-- 55.88 to 56.67 0.8 seconds of stability
201  56.679455 DB time                       476049162
-- increments over two iterations, 0.03 seconds
202  56.694092 DB time                       476052385
203  56.708733 DB time                       476052385
261  57.566902 DB time                       476052385
-- 56.69 to 57.59 0.9 seconds of stability
262  57.581582 DB time                       476052842
263  57.596218 DB time                       476058537
270  57.700212 DB time                       476058537
271  57.715371 DB time                       476060552
272  57.730797 DB time                       476063551
273  57.745700 DB time                       476074383
274  57.760351 DB time                       476079741
279  57.835162 DB time                       476079741
280  57.849966 DB time                       476080090
281  57.864782 DB time                       476090799
282  57.879446 DB time                       476100404
283  57.894553 DB time                       476103222
-- stable again after 0.3 seconds and a couple of mini-pauses
284  57.910592 DB time                       476103222
335  58.677438 DB time                       476103222
-- 57.91 to 58.69 0.8 seconds of stability
336  58.694704 DB time                       476113168
337  58.709995 DB time                       476113909
338  58.724782 DB time                       476119452
339  58.740756 DB time                       476119795
340  58.758659 DB time                       476129752
341  58.776040 DB time                       476132036
345  58.854895 DB time                       476132036
346  58.869516 DB time                       476138982
347  58.884100 DB time                       476145880
348  58.898772 DB time                       476160301
349  58.913401 DB time                       476178139
350  58.935391 DB time                       476183281
-- stable again after 0.27 seconds
351  58.955195 DB time                       476183281
395  59.608368 DB time                       476183281
-- 57.93 to 59.60 0.68 seconds of stability
396  59.623062 DB time                       476187188
402  59.713566 DB time                       476187188
403  59.728220 DB time                       476194591
404  59.742900 DB time                       476204006
405  59.757544 DB time                       476210666
406  59.774934 DB time                       476216338
407  59.796595 DB time                       476228874
413  59.890172 DB time                       476228874
414  59.908436 DB time                       476238680
415  59.923166 DB time                       476251316
416  59.937805 DB time                       476259466
417  59.952540 DB time                       476261228
418  59.967215 DB time                       476277094
419  59.981914 DB time                       476282108
-- stable again after 0.29 seconds
420  00.000358 DB time                       476298216
529  01.684500 DB time                       476298216
-- 00.00 to 01.69 1.69 seconds of stability
530  01.699165 DB time                       476301888
531  01.714307 DB time                       476312510

I would say that we can draw a few conclusions from the above

  • It is dangerous to look at a little bit of data and draw a firm conclusion, as I nearly did
  • The data in v$sess_time_model is only maintained in near-time not real-time
  • The counters in v$sess_time_model increment together
  • The counters seem to increment in a slightly messy way over part of a second and then are stable for 3/4 of a second to a second or two

I wonder how many of you went “Oh dear” when I said I could derive what is true for v$sys_time_model from v$sess_time_model? Could I? well, here is the modified script for v$sys_time_model:

-- how often is v$sysstat updated
set trims on
set term off
set serveroutput on size unli
spool test_vstm_upd2 
  for a in 1..1000  loop
    for vrec in 
      (select rpad(rpad(to_char(a),4)||' '||to_char(systimestamp,'ss.ff')||' '|| stat_name||' ',45)||value rec
      from v$sys_time_model
      --where sid=1989
      where stat_id in (3649082374)
    end loop;
--    dbms_output.put_line(chr(9));
  end loop;
set term on
spool off

And a sample of my output:

1    43.187666 DB time                       14429733395433
2    43.188523 DB time                       14429733395755
3    43.188642 DB time                       14429733395905
4    43.188733 DB time                       14429733395905
5    43.188822 DB time                       14429733395905
6    43.188909 DB time                       14429733395905
7    43.188995 DB time                       14429733396491
8    43.189080 DB time                       14429733396491
9    43.189164 DB time                       14429733396491
10   43.189258 DB time                       14429733396491
11   43.189345 DB time                       14429733396491
12   43.189430 DB time                       14429733396491
13   43.189515 DB time                       14429733396491
14   43.189600 DB time                       14429733396491
15   43.189687 DB time                       14429733396491
16   43.189774 DB time                       14429733396491
17   43.189858 DB time                       14429733396491
18   43.189942 DB time                       14429733396491
19   43.190026 DB time                       14429733396491
20   43.190111 DB time                       14429733396491
21   43.190200 DB time                       14429733397436
22   43.190287 DB time                       14429733397436
23   43.190371 DB time                       14429733397436
24   43.190454 DB time                       14429733397436
25   43.190540 DB time                       14429733397436
26   43.190624 DB time                       14429733397436
27   43.190708 DB time                       14429733397436
28   43.190793 DB time                       14429733397764
29   43.190877 DB time                       14429733397764
30   43.190961 DB time                       14429733397764
31   43.191045 DB time                       14429733397764
32   43.191132 DB time                       14429733397764
33   43.191221 DB time                       14429733397764
34   43.191309 DB time                       14429733397764
35   43.191392 DB time                       14429733397764
36   43.191475 DB time                       14429733402416
37   43.191558 DB time                       14429733402416
38   43.191641 DB time                       14429733403070
39   43.191725 DB time                       14429733403070
40   43.191809 DB time                       14429733403070
41   43.191893 DB time                       14429733403070
42   43.191976 DB time                       14429733403070
43   43.192060 DB time                       14429733403070
44   43.192144 DB time                       14429733403070
45   43.192230 DB time                       14429733403070
46   43.192315 DB time                       14429733403070
47   43.192400 DB time                       14429733403070
48   43.192484 DB time                       14429733403070
49   43.192569 DB time                       14429733403070
50   43.192654 DB time                       14429733403070
51   43.192737 DB time                       14429733407045
52   43.192821 DB time                       14429733407045
53   43.192904 DB time                       14429733407045
54   43.192985 DB time                       14429733407045
55   43.193069 DB time                       14429733407045
56   43.193152 DB time                       14429733407045
57   43.193237 DB time                       14429733407045
58   43.193321 DB time                       14429733407045
59   43.193404 DB time                       14429733407045
60   43.193488 DB time                       14429733407045
61   43.193574 DB time                       14429733407045
62   43.193660 DB time                       14429733408897
63   43.193743 DB time                       14429733408897
64   43.193828 DB time                       14429733408897
65   43.193912 DB time                       14429733408897
66   43.193994 DB time                       14429733408897
67   43.194076 DB time                       14429733408897
68   43.194160 DB time                       14429733409208
69   43.194283 DB time                       14429733409208
70   43.194378 DB time                       14429733409208
71   43.194465 DB time                       14429733409267
72   43.194551 DB time                       14429733409267
73   43.194635 DB time                       14429733409267
74   43.194719 DB time                       14429733409267
75   43.194801 DB time                       14429733409267
76   43.194884 DB time                       14429733409267
77   43.194967 DB time                       14429733409863
78   43.195052 DB time                       14429733409863
79   43.195136 DB time                       14429733410499
80   43.195245 DB time                       14429733410499
81   43.195329 DB time                       14429733410499
82   43.195412 DB time                       14429733410499
83   43.195495 DB time                       14429733410499
84   43.195577 DB time                       14429733410499
85   43.195660 DB time                       14429733410499
86   43.195743 DB time                       14429733410499
87   43.195825 DB time                       14429733410499
88   43.195909 DB time                       14429733410499
89   43.195991 DB time                       14429733410499
90   43.196074 DB time                       14429733410499
91   43.196156 DB time                       14429733410499
92   43.196244 DB time                       14429733410499
93   43.196326 DB time                       14429733410499
94   43.196409 DB time                       14429733410499
95   43.196493 DB time                       14429733411732
96   43.196577 DB time                       14429733411732
97   43.196661 DB time                       14429733411732
98   43.196745 DB time                       14429733411732
99   43.196826 DB time                       14429733411732
100  43.196910 DB time                       14429733412107
101  43.196992 DB time                       14429733412410
102  43.197076 DB time                       14429733412410
103  43.197158 DB time                       14429733412410
104  43.197245 DB time                       14429733412410
105  43.197327 DB time                       14429733412410
106  43.197410 DB time                       14429733412410
107  43.197493 DB time                       14429733412410
108  43.197575 DB time                       14429733412410
109  43.197658 DB time                       14429733412410
110  43.197741 DB time                       14429733412981
111  43.197824 DB time                       14429733412981
112  43.197907 DB time                       14429733412981
113  43.197990 DB time                       14429733412981
114  43.198072 DB time                       14429733413001
115  43.198156 DB time                       14429733413001
116  43.198247 DB time                       14429733413001
117  43.198330 DB time                       14429733413001
118  43.198414 DB time                       14429733413300
119  43.198499 DB time                       14429733413300
120  43.198581 DB time                       14429733413300
121  43.198665 DB time                       14429733413300
122  43.198748 DB time                       14429733413300
123  43.198830 DB time                       14429733413300
124  43.198913 DB time                       14429733413300
125  43.198997 DB time                       14429733414262
126  43.199081 DB time                       14429733414262
127  43.199165 DB time                       14429733414262
128  43.199252 DB time                       14429733414262
129  43.199336 DB time                       14429733414262
130  43.199419 DB time                       14429733414262
131  43.199503 DB time                       14429733414262
132  43.199586 DB time                       14429733414569
133  43.199669 DB time                       14429733414569
134  43.199752 DB time                       14429733414569
135  43.199834 DB time                       14429733414569
136  43.199918 DB time                       14429733414569
137  43.200000 DB time                       14429733414569
138  43.200083 DB time                       14429733414569
139  43.200166 DB time                       14429733414569
140  43.200252 DB time                       14429733414569
141  43.200334 DB time                       14429733415145
142  43.200418 DB time                       14429733415145
143  43.200504 DB time                       14429733415335
144  43.200588 DB time                       14429733415335
145  43.200672 DB time                       14429733415335
146  43.200756 DB time                       14429733415335
147  43.200838 DB time                       14429733415335
148  43.200921 DB time                       14429733415335
149  43.201003 DB time                       14429733415335
150  43.201086 DB time                       14429733415335
151  43.201169 DB time                       14429733415335
152  43.201259 DB time                       14429733415335

I would say that we can draw a few conclusions from this latest test above!

  • It is dangerous to look at one thing and assume something closely related will be the same!
  • The data in v$sys_time_model is also being updated in bursts
  • The data in v$sys_time_model is actually updated very, very frequently, at around 1/1000 of a second intervals
  • It might be that v$sess_time_model is being updated for sessions in some sort of round-robin fashion and v$sys_time_model each time the v$sess version is updated
  • You can spend a lot of time looking at really quite obscure and possibly pointless stuff
  • The reason I am losing weight is I keep skipping lunch.

Memory Changes Everything July 12, 2010

Posted by mwidlake in Architecture, performance.
Tags: , , ,

I’ve got this USB memory stick which I use to carry around my scripts, documents, presentations, Oracle manuals and enough music to keep me going for a few days. It is on an 8GB Gizzmo Junior and it is tiny. By tiny I mean as wide as my little finger, the length of a matchstick and about the same thickness of said matchstick. So small that I did indeed lose the damn thing for 6 months before I realised it had got trapped behind a credit card in my wallet.

It cost me ten British pounds about 15 months ago (less than most 4GB USB sticks seem to cost now, but then it is nothing more than the memory chip and connectors wrapped in plastic) and it highlights how cheap solid-state “storage” is becoming.

Connected to this, I was looking at buying a new PC this week and this machine comes with 10 USB slots, if you include the ones on the supplied monitor and stubs on the motherboard.
10 USB slots, 8GB gizzmo memory sticks… That would be 80GB of cheap and fast storage. Now get a few USB hubs and bulk-buy a few dozen cheap USB2 sticks and you could soon have a solid-state database of a few hundred GB for a thousand pounds. Then of course you can have fun seeing where the pinch-points in the system are (USB2 has a maximum speed per port and going USB3 right now is going to break that 1 grand barrier. But give it a year…).

This really started me thinking about when memory-based storage would take over from spinning disk as the best option for enterprise-level storage and my gut feeling is in about 5 years. I think it will be both technically possible and financially viable in much less than that, say as little as 2 years, but the cost of solid-state storage per MB will still be higher than disk by then but potentially much faster. A few considerations going through my mind were:-

  • Disk is getting a lot slower in relation to acreage. By this I mean that, for a single disc drive, capacity is doubling about every 18 months but seek time has hardly reduced in a decade and transfer rate (reading from the physical platters to the units buffer) is again almost stationary, at about 120MB/s for 10,000rpm disk and up towards 180 for those very expensive and noisy 15,000 rpm disks. Being a tad ridiculous to make the point, with modern 3TB disks you could build most Oracle database on one disc. Let’s make it two in a raid 10 configuration for redundancy. My point is, your 3TB database could well be being run right now, for real, across say 5 physical disks with a total sustainable physical throughput of around 500MB a second.
  • Solid state storage seems to be halving in price in more like 8-10 months.
  • IO subsystems are made faster by using RAID so that several physical discs can contribute to get towards the 300MB or so speed of the interface – but solid state is already that fast.
  • IO subsystems are made faster by building big caches into them and pre-fetching data that “might” be requested next. Oh, that is kind of solid state storage already.
  • Solid state storage, at least the cheap stuff in your USB stick, has the problem that you can only write to each bit a thousand or so times before it starts to get unreliable. But physical disk has exactly the same issue.
  • There are new methods of solid-state memory storage coming along – “New Scientist” had a nice article on it a few months ago, and these versions will be even higher density and more long-term reliable.
  • Seek time on solid-state memory is virtually zero, so random IO is going to be particularly fast compared to spinning disk.

Solid state memory needs less power, and thus less cooling, is silent, is potentially denser and is less vulnerable to temperature and humidity fluctuations. I can see it not needing to be kept in a specialist server room with the need for all that air con and ear defenders when you go in the room.
Just somewhere with normal air con and a lock on the door should suffice.
We do not need Solid State storage to match the size of current disks or even be as cheap to take over. As I have already pointed out, it is not acreage you need with physical disks but enough spindles and caches to make it fast enough in relation to the space. Further, we can afford to pay more for solid state if we do not need to keep it in such expensive clean-room like environments.

I can see that in a couple of years for a given computer system, say a mixed-workload order processing system, to support the storage needs we will have maybe a dozen solid-state chunks of storage, perhaps themselves consisting of several small units of memory in some sort of raid for resilience, all able to flood the IO channels into our processing server and the issue will be getting the network and io channels into the server to go fast enough. So don’t, stick all the storage directly into the server. You just got rid of half your SAN considerations.

I’m going to stop there. Partly because I have run out of time and partly because, in checking out what I am writing, I’ve just spotted someone did a better job of this before me. Over to James Morle who did a fantastic post on this very topic back in May. Stupid me for not checking out his blog more often. Jame also mentions that often it is not total throughput you are interested in at all but IOPS. That zero latency of solid-state memory is going to be great for supporting very high IOPS.

Friday Philosophy – The power of cooperation June 27, 2010

Posted by mwidlake in Friday Philosophy, Perceptions, performance.
Tags: , ,

Being the person responsible for the performance of an Oracle-based system can be an oddly lonely working experience. It can also be a very gregarious position as you get to meet a lot of people and discuss lots of different aspects of many systems – all systems seem to have performance issues and so people come along to you, hoping you can identify a “work faster” database setting for them.

But you are often the only person who specialises in Oracle performance. You generally need to be in an organisation that is very large or where performance is key to success for there to be justification for dedicating a person to the topic. To have more than one person dedicated to performance your organisations has to have a very strong focus on getting the best performance out of the Oracle and related systems {or really, really atrocious performance issues 🙂 }. So usually there is no one else around who is as experienced (or more so) as yourself to discuss such things over with or ask for a second opinion.

Which is why I am very lucky at the moment. I’m working in a team of oracle performance people. There are 2.5 of us (one is a manager with other responsibilities, so he only counts as half). Being able to say “Hey, Dave, What do you think of the wait times on scattered reads?” or “how in heck do I force this nested subquery on a view to use a hash join?” and get some other ideas is very valuable.

What is also interesting is how opinions and preferred techniques on tuning can be different and just as valid. As an example, last week I was working on a poorly performing statement. I was at home and it was the evening, so I was not communicating with the rest of the team. I managed to get the code down from 40 minutes to just under 20 by using a combination of a LEADING and USE_HASH hint. I sent the code back to the user. Only to see that within thirty seconds of each other my colleague Graeme had also sent the user a response, again getting the code down to around 20 minutes. Graeme had pulled a chunk of the code into a subquery factoring “WITH” clause and added cardinality hints. Totally different changes.

So Graeme and I then had a “philosophical” discussion about the different changes {“Mine is best” – “No! Mine is, yours is all bloated and complex”- “Your hint is less future-flexible!!!”}. Only joking, we actually discussed the changes and why we each chose what we did. Graeme commented that is showed that tuning was an art and not a science and I countered that it was a science, as we had both identified where we felt the execution plan could be improved but used different techniques to get there. The thing is, Oracle is so complex and has so many options to influence the optimiser that you have flexibility to chose different tools and techniques.

We had both identified the same main improvement but had each come up with different tweaks for later in the plan.

The end result was that we went with Graeme’s main plan {he is bigger than me} but we pulled in my tweak. That bought the execution time down to around 10 minutes, so about four times faster over all and twice as fast of either of us alone. That is one of the advantages of not working alone.

We also then discussed how we could get this code down to seconds with the use of either Materialized views or changing the process that generated the report to do so incrementally and store the daily results. Until one of us realised we had reached the boundary of compulsive tuning disorder. The report running in 10 minutes was enough improvement to satisfy the business, the report was only going to be run over the next couple of months, so spending a day or two re-working it further was going to be fun – but of no advantage to the business. We told each other to finish for the day. So another advantage of not working alone is that not only do you get more technical input but your help prevent each other losing sight of the overall aim.

It really does help to have two people working on the same area.

{There is a sneaky way of getting beyond being a lone performance specialist. If you are in an organisation long enough you can usually find some other idiot who is silly enough to want to learn more about performance and you can train them up. It does not take long before they know enough to start coming up with things you never thought of. Oracle is, after all, full of many ways to do the same thing and you can’t know it all}.

dbms_stats.set_table_stats “defaults” June 21, 2010

Posted by mwidlake in internals, statistics.
Tags: , ,
add a comment

What happens if you call dbms_stats.set_table_stats without passing in any of the values to set?

I know, why would you do it anyway? Well, I did so by accident. If I want to gather quick stats on a test table I execute something like:

exec dbms_stats.gather_table_stats(OWNNAME =>user,TABNAME =>’TEST_TAB1′)
If I am feeling generous I might state ESTIMATE_PERCENT too.

I was doing some testing work and was gathering stats and also setting stats manually. Then I started to see several of my test tables all had 2000 rows and were 100 blocks in size – at least according to the stats. I knew this was not possible. It turned out to be Cut ‘n’ Paste fingerf the trouble and I was issuing.

exec dbms_stats.set_table_stats(OWNNAME =>user,TABNAME =>’TEST_TAB1′)

If the table did not already have stats this set the stats on the table to default values of 2000 rows, 100 blocks. If the table already had stats then they were left as they were.

If those figures ring a bell, then that is because they are the default values used if a table has no stats and you have no dynamic sampling. See this table of defaults

Anyway, below is a little worked example of these default values being set. Oh, version is

TDB> drop table TEST1 purge
  2  /

TDB> select sysdate from dual;
21-JUN-2010 16:52


  4  /

TABLE_NAME                       NUM_ROWS     BLOCKS SAMPLE_SIZE
------------------------------ ---------- ---------- -----------

-- New table, no stats yet gathered, the columns hold null

-- Call dbms_stats.SET_TABLE_STATS, setting nothing
TDB> EXEC dbms_stats.set_table_stats(ownname => user,tabname => 'TEST1')


TABLE_NAME                       NUM_ROWS     BLOCKS SAMPLE_SIZE
------------------------------ ---------- ---------- -----------
TEST1                                2000        100        2000

-- The columns are set to defaults

-- Gather proper stats
TDB> exec dbms_stats.gather_table_stats(ownname =>user,tabname =>'TEST1',estimate_percent=>10)

TABLE_NAME                       NUM_ROWS     BLOCKS SAMPLE_SIZE
------------------------------ ---------- ---------- -----------
TEST1                              205430        956       20543

-- Now use SET_TABLE_STATS as intended, setting numrows to a value
TDB> EXEC dbms_stats.set_table_stats(ownname => user,tabname => 'TEST1',numrows=>5000)

TABLE_NAME                       NUM_ROWS     BLOCKS SAMPLE_SIZE
------------------------------ ---------- ---------- -----------
TEST1                                5000        956       20543

-- Try the naked SET_TABLE_STATS
TDB> EXEC dbms_stats.set_table_stats(ownname => user,tabname => 'TEST1')

TABLE_NAME                       NUM_ROWS     BLOCKS SAMPLE_SIZE
------------------------------ ---------- ---------- -----------
TEST1                                5000        956       20543

TDB> select sysdate from dual;
Any Key>
21-JUN-2010 16:52

-- And let us see how the stats have changed over the last few minutes.
TDB> select table_name,stats_update_time
  2  from dba_tab_stats_history
  3  where table_name = 'TEST1'
  4  /
Any Key>
TABLE_NAME                     STATS_UPDATE_TIME
------------------------------ --------------------------------
TEST1                          21-JUN-10 +00:00
TEST1                          21-JUN-10 +00:00
TEST1                          21-JUN-10 +00:00
TEST1                          21-JUN-10 +00:00

What is AUDSID June 17, 2010

Posted by mwidlake in internals.
Tags: , ,

If you look at v$session you see SID and SERIAL#, which most of us DBA-types know uniquely identify a session and allow you to kill it (*with the relevant permissions and knowledge you are not about to compromise a business process).

But what is AUDSID?

desc v$session
 Name                                      Null?    Type
 ----------------------------------------- -------- ------------
 SADDR                                              RAW(8)
 SID                                                NUMBER
 SERIAL#                                            NUMBER
 AUDSID                                             NUMBER
 PADDR                                              RAW(8)
 USER#                                              NUMBER
 USERNAME                                           VARCHAR2(30)
 COMMAND                                            NUMBER
 OWNERID                                            NUMBER
 TADDR                                              VARCHAR2(16)
 LOCKWAIT                                           VARCHAR2(16)
 STATUS                                             VARCHAR2(8)
 SERVER                                             VARCHAR2(9)
 SCHEMA#                                            NUMBER
 SCHEMANAME                                         VARCHAR2(30)
 OSUSER                                             VARCHAR2(30)

AUDSID is a unique identifier for the session and is used in sys.aud$ , as the SESSIONID column. It is the leading column of the only index on sys.aud$

IND_OWNER   IND_NAME           TAB_NAME           PSN       COL_NAME
----------- ------------------ ------------------ --------- ------------
SYS         I_AUD1             AUD$               1         SESSIONID
                                                  2         SES$TID

All audit records for a session are recorded with the same SESSIONID/AUDSID. Activity can be audited at session level and statement level. If audited at session level, only a single record for that audited thing is recorded in the log. This is updated to indicate any new audited actions that occur on that thing for the duration of the session, so it is important that Oracle can efficiently identify and update that record, as the impact of SQL AUDIT on database functionality wants to be kept to a minimum.

How does AUDSID relate to SID, SERIAL#, USERNAME etc? Let’s have a quick look.

select sid, serial# ,audsid ,username,OSUSER,schemaname
from v$session
       SID    SERIAL#     AUDSID USERNAME             OSUSER                    SCHEMANAME
---------- ---------- ---------- -------------------- ------------------------- --------------
       485       4745     165550 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       487      23712     165546 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       497      40388          0                      oracle                    SYS
       498      19269          0                      oracle                    SYS
       502      13362     165432 EAAAAA               govanii                   EAAAAA
       505        407     163821 LEXAAAAA             oracle                    LEXAAAAA
       506       6302 4294967295 SYS                  widlake                   SYS
       511      11702     165518 OAAAAA               backerella                OAAAAA
       512      17076     165490 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       518       9066     165507 OAAAAA               Manoled                   OAAAAA
       519       6956     163976 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       521      15272 4294967295 SYS                  widlake                   SYS
       523       4825     163975 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       536      63941 4294967295 SYS                  backerella                SYS
       524      19740     165548 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       525        850     165549 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       526      45112     165516 XXX                  backerella                XXX
       527      11086     163963 LEX_AAAAAEX          oracle                    LEX_AAAAAEX
       529       1662     163818 LEX_AAAAA            oracle                    LEX_AAAAA
       530      62788 4294967295 SYS                  widlake                   SYS
       537          1          0                      oracle                    SYS
       540          1          0                      oracle                    SYS

It does not. But it is interesting to note that this database does not have auditing enabled but the AUDSID is still populated.

It is also interesting that all SYS accounts have either an AUDSID of 0 or the same AUDSID, 4294967295 {which is 1 less than 4294967295…1024*1024*4}. I might come back to that shared AUDSID at a later date…

All the “normal” AUDSIDs are unique and about the same value, 163900 to 165560.

If you look at sys.aud$ the SESSIONID increments as time goes on. By this I do not mean it is in order of SESSIONID, it is not, but as time goes by the SESSIONID increments. That sounds a lot like what you would see with a sequence to me. And in fact that is because the value IS from a sequence:

select * from dba_sequences where sequence_name = 'AUDSES$'

--------------- --------------- ---------- ---------- ------------ ----- - ---------- -----------
SYS             AUDSES$                  1 2000000000            1 Y     N      10000      173695

So the AUDSID is a unique value that comes from a sequence and is used in the audit trail. All SYS accounts have the same AUDSID or have it set to zero.

Why do many sessions have the AUDSID set to zero? Because they are internal Oracle processes:

select sid, serial# ,audsid ,username,OSUSER,schemaname,program
from v$session
where AUDSID = 0
----- ---------- ---------- ------------ ------------ ------------ ----------------------------
  506       6887          0              oracle       SYS          oracle@db25.eng.ham.uk. (q001)
  526      45682          0              oracle       SYS          oracle@db25.eng.ham.uk. (q002)
  531         91          0 YYYYY        oracle       DLPP1        oracle@db25.eng.ham.uk. (J000)
  533          3          0              oracle       SYS          oracle@db25.eng.ham.uk. (CJQ0)
  537          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (q000)
  540          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (QMNC)
  546          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (MMNL)
  547          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (MMON)
  548          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (RECO)
  549          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (SMON)
  550          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (CKPT)
  551          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (LGWR)
  552          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (DBW0)
  553          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (MMAN)
  554          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (PSP0)
  555          1          0              oracle       SYS          oracle@db25.eng.ham.uk. (PMON)

The activity of PMON, SMON, CKPT is not audited of course. And job schedulers doing nothing do not have their activity audited either.

Oh, you remember I said AUDSID was unique? Well, check back at the sequence. It wraps and has a maximum value of 2,000,000,000. It is going to start repeating values after 2 billion sessions. Make a note to ensure you have cleaned up your audit trail before you hit your 2 billion and first session…

INTERNAL_FUNCTION() Impact April 21, 2010

Posted by mwidlake in internals, performance.
Tags: , ,

You may occasionally see something like the following in the filter predicates of an Explain Plan:


What is INTERNAL_FUNCTION, what is the use of INTERNAL_FUNCTION() in Oracle explain plans and why am I putting things like ORACLE INTERNAL_FUNCTION partition exclusion and meaning of internal_function performance almost as random strings in this blog?

Well, I want to get the hit rate up for this post as when I did a quick search on INTERNAL_FUNCTION the current top hit is This unfortunately misleading posting that tells you “It is not a significant overhead in SQL execution performance”. This internal function IS potentially a significant overhead in SQL Execution performance. You can’t add comments to the above site {just book a course or order a book}, so I’m having to create my own page…

INTERNAL_FUNCTION is, as the above posting says, where oracle does an implicit data conversion. In the shown case, as well as in my example in this blog and in most cases I have seen, from timestamp to date.

Why is it an issue? Because it is a function on a column and as such it can have a disasterous impact on sql execution performance.

This code below ( which has most select columns removed for simplicity and to protect the innocent) was performing very poorly, taking over 30 minutes to complete. This is all on Enterprise edition of Oracle.

SELECT trans0_.ID as ID3_1_
, round(transfers0_.AMOUNT,2) as formula10_0_
FROM VW_PMTT trans0_
WHERE  (trans0_.MODIFIED_DT between :p0 AND :p1
        AND trans0_.PERS_ID = :p2)
and transfers0_.ACCOUNT_ID=:p3

This is the plan and the significant filter/access predicates:

Plan hash value: 3261981189
| Id  | Operation                         | Name                 | Rows  |Bytes| Cost   |Pstart|Pstop|
|   0 | SELECT STATEMENT                  |                      |       |     | 67841  |      |     |
|   1 |  SORT ORDER BY                    |                      |     8 | 920 | 67841  |      |     |
|   2 |   VIEW                            | VW_PMTP              |     8 | 920 | 67840  |      |     |
|   3 |    UNION-ALL                      |                      |       |     |        |      |     |
|*  4 |     FILTER                        |                      |       |     |        |      |     |
|   5 |      NESTED LOOPS                 |                      |     4 | 280 | 28490  |      |     |
|*  6 |       TABLE ACCESS BY INDEX ROWID | W_TBLP               |     1 |  14 |     3  |      |
|*  7 |        INDEX UNIQUE SCAN          | PK_W_TBLP            |     1 |     |     2  |      |     |
|   8 |       PARTITION RANGE ALL         |                      |     4 | 224 | 28487  |    1 | 1580|
|*  9 |        TABLE ACCESS BY LOC INd RID| W_PAYMENT            |     4 | 224 | 28487       1 | 1580|
|* 10 |         INDEX RANGE SCAN          | IDX_W_P_TRANS        | 66936 |     |  3345  |    1 | 1580|

Predicate Information (identified by operation id):

   6 - filter("PLY"."ACCOUNTID"=:P3)
   7 - access("PLY"."PERS_ID"=:P2)

As you can see, the view is translated into a simple two-table join {and unioned with a second two-table join, which I removed again for clarity, but if anyone wants the full plan and statement, email me} where:

  • one table is accessed on a unique index (via bind variable P2).
  • Row filtered for P3
  • This table is then joined to a Partitioned table in a nested loop 
  • The table being accessed via a local index range scan.

At first glance, the plan may look fine, but Look at the pstart and pstop. 1 to 1580. That is every partition in the table.
The predicate information shows that the INTERANAL_FUNCTION(“PT”.”MODIFIED”) column is being compared to P0 and P1.

The partition key on the table is:-

Name of Table : W_PAYMENT

------------------------------ ---------- --- -----------
USER.W_PAYMENT                 TABLE       1 MODIFIED

ie the column that is being flitered by. Why no partition pruning?

Partition pruning is not occuring because of the “INTERNAL_FUNCTION” being applied to that column. The CBO is not able to understand how the result of a function will match to the partition values of the column. Not even it’s own, internally used function 😛

For the above, the “date” bind variables were defined as

p0 timestamp := to_timestamp(’01-FEB-2010 00:00′,’DD-MON-YYYY HH24:MI’);
p1 timestamp := to_timestamp(’03-FEB-2010 00:00′,’DD-MON-YYYY HH24:MI’);

I then altered the code such that the bind variables were defined as dates.

p0 date := to_date(’01-FEB-2010′,’DD-MON-YYYY’);
p1 date := to_date(’03-FEB-2010′,’DD-MON-YYYY’);

The plan and significant predicates were now:

Plan hash value: 346162108
| Id  | Operation                         | Name                 | Rows  |Bytes| Cost   |Pstart|Pstop|
|   0 | SELECT STATEMENT                  |                      |       |     | 67841  |      |     |
|   1 |  SORT ORDER BY                    |                      |     4 | 460 | 67841  |      |     |
|   2 |   VIEW                            | VW_PMTP              |     4 | 460 | 67840  |      |     |
|   3 |    UNION-ALL                      |                      |       |     |        |      |     |
|*  4 |     FILTER                        |                      |       |     |        |      |     |
|   5 |      NESTED LOOPS                 |                      |     2 | 140 | 28490  |      |     |
|*  6 |       TABLE ACCESS BY INDEX ROWID | W_TBLP               |     1 |  14 |     3  |      |     |
|*  7 |        INDEX UNIQUE SCAN          | PK_W_TBLP            |     1 |     |     2  |      |     |
|   8 |       PARTITION RANGE ALL         |                      |     2 | 112 | 28487  |  KEY | KEY |
|*  9 |        TABLE ACCESS BY LOC INd RID| W_PAYMENT            |     2 | 112 | 28487  |  KEY | KEY |
|* 10 |         INDEX RANGE SCAN          | IDX_W_P_TRANS        | 66936 |     |  3345  |  KEY | KEY |

Predicate Information (identified by operation id):

   6 - filter("PLY"."ACCOUNTID"=:P3)
   7 - access("PLY"."PERS_ID"=:P2)
   9 - filter(("PT"."TYPE"=1184769 AND "PT"."MODIFIED">=:P0 AND "PT"."MODIFIED"<:P1

The expected rows and bytes drop but the key difference in the plan is Pstart and Pstop are now KEY KEY. ie the CBO will evaluate at run time the values coming back from the Table access of W_TBLP and partition prune the access of the table.

In this situation, the original query took over 30 minutes to return. The new code took 18 seconds.

I tried one final test. I set the bind variables to be varchar2 so that implicit data conversion would occur:
p0 varchar2(20) := ’01-FEB-2010 00:00′;
p1 varchar2(20) := ’03-FEB-2010 00:00′;

With VARCHAR2 the CBO was able to do the implicit conversion without the INTERNAL_FUNCTION and the plan/filter predicates were exactly as for native dates.

It would seem this issue with INTERNAL_FUNCTION blights timestamp to date implicit data conversion but not some other implicit data conversions. Without testing further, I can’t be sure which.

This is a documented feature by Oracle. Unusually (and much to their credit) Oracle even document the potential negative impact of this particular construct with timestamps and dates. See this section in chapter 5 of the Oracle 10g data warehousing guide.

I better balance the fact I linked to an inaccurate posting on the subject with a few good ones.
This is a nice one from McLaud on the OraFAQ site where he has come across it as a problem and with help shows how it is due to implicit conversion of a timestamp from Hibernate.
This blog on the issue arising with Java is a nice example of the fix (use a date to hold a date, not a timestamp).

Finally, this link is to an older but very, very valid rant by Tom Kyte about implicit data conversion and using the correct data type for your data. {Don’t go using numbers or varchars to hold dates, just don’t, and similarly you should avoid implicit data conversion}.

Performance Tipping Points April 13, 2010

Posted by mwidlake in performance, statistics.
Tags: , ,
add a comment

This week I came across a nice example of a performance tipping point. This is where Everything is OK until you reach a point where it all quickly cascades to Not OK.

The below shows the timings for a regulalry run “Alert”. Four times an hour we want to know if something quite important has happened on the system.

---- - ------------- ------------- --------- ------------- -------------------- -----------------
292 S 0410 03:06:00 0410 03:06:31 .000359 0410 03:06:00 20100410 02:51:00 20100410 03:06:00
292 S 0410 03:21:00 0410 03:21:35 .000405 0410 03:21:00 20100410 03:06:00 20100410 03:21:00
292 S 0410 03:36:00 0410 03:36:38 .000440 0410 03:36:00 20100410 03:21:00 20100410 03:36:00
292 S 0410 03:51:00 0410 03:51:33 .000382 0410 03:51:00 20100410 03:36:00 20100410 03:51:00
292 S 0410 04:06:00 0410 04:06:28 .000324 0410 04:06:00 20100410 03:51:00 20100410 04:06:00
292 S 0410 04:21:00 0410 04:21:44 .000509 0410 04:21:00 20100410 04:06:00 20100410 04:21:00
292 S 0410 04:36:00 0410 04:36:27 .000313 0410 04:36:00 20100410 04:21:00 20100410 04:36:00
292 S 0410 04:51:00 0410 04:51:34 .000394 0410 04:51:00 20100410 04:36:00 20100410 04:51:00
292 S 0410 05:06:00 0410 05:06:44 .000509 0410 05:06:00 20100410 04:51:00 20100410 05:06:00
292 S 0410 05:21:00 0410 05:21:43 .000498 0410 05:21:00 20100410 05:06:00 20100410 05:21:00
292 S 0410 05:36:00 0410 05:37:01 .000706 0410 05:36:00 20100410 05:21:00 20100410 05:36:00
292 S 0410 05:51:00 0410 05:52:21 .000938 0410 05:51:00 20100410 05:36:00 20100410 05:51:00
292 S 0410 06:06:00 0410 06:08:09 .001493 0410 06:06:00 20100410 05:51:00 20100410 06:06:00
292 S 0410 06:21:01 0410 06:24:00 .002072 0410 06:21:01 20100410 06:06:00 20100410 06:21:01
292 S 0410 06:36:00 0410 06:40:12 .002917 0410 06:36:00 20100410 06:21:01 20100410 06:36:00
292 S 0410 06:51:00 0410 06:56:54 .004097 0410 06:51:00 20100410 06:36:00 20100410 06:51:00
292 S 0410 07:06:00 0410 07:13:17 .005058 0410 07:06:00 20100410 06:51:00 20100410 07:06:00
292 S 0410 07:21:00 0410 07:29:42 .006042 0410 07:21:00 20100410 07:06:00 20100410 07:21:00
292 S 0410 07:36:00 0410 07:47:48 .008194 0410 07:36:00 20100410 07:21:00 20100410 07:36:00
292 S 0410 07:51:00 0410 08:08:07 .011887 0410 07:51:00 20100410 07:36:00 20100410 07:51:00
292 S 0410 08:08:07 0410 08:29:43 .015000 0410 08:08:07 20100410 07:51:00 20100410 08:08:07
292 S 0410 08:29:43 0410 08:50:10 .014201 0410 08:29:43 20100410 08:08:07 20100410 08:29:43
292 S 0410 08:50:10 0410 09:22:28 .022431 0410 08:50:10 20100410 08:29:43 20100410 08:50:10
292 S 0410 09:22:28 0410 10:27:11 .044942 0410 09:22:28 20100410 08:50:10 20100410 09:22:28
292 S 0410 10:27:11 0410 12:57:16 .104225 0410 10:27:11 20100410 09:22:28 20100410 10:27:11
292 F 0410 12:57:16 0410 14:50:26 .078588 0410 12:57:16 20100410 10:27:11 20100410 12:57:16
292 F 0410 14:50:26 0410 16:49:42 .082824 0410 14:50:26 20100410 10:27:11 20100410 14:50:26
292 F 0410 16:49:42 0410 19:06:10 .094769 0410 16:49:42 20100410 10:27:11 20100410 16:49:42

The first half dozen records show the program ID 292 running in half aminute or so. It varies a little, from just under 30 seconds to 45 seconds. Each run kicks off 15 minutes after the previous and, if you check the START_PARAM and END_PARAM on the right of the listing, you can see that the report runs for the previous 15 minutes. ie the period since the last run ( including the running time of the last itteration).

Then, at 05:36 something happens. The execution takes a full minute. The next run takes 1 minute and 20 seconds. The next run takes over two minutes and each following execution takes a little longer and longer.

At 07:51, 1.5 hours later, something significant happens. The execution takes more than 15 minutes to run. This report is run every 15 minutes. I’m sure you can appreciate that this is a problem. What happens now depends on your architecture.

If you have a scheduler that simply kicks off the report every 15 minutes, at the next scheduled time (08:06 in my case) a new execution will start and you will have two version of the report running for a couple of minutes. As time progresses this overlap increases. When the run time reaches 30 minutes you will now start having 3 reports running at the same time. And soon you will get three, four, five etc version running at the same time. Depending on the number of CPUs and IO bandwidth of your system, how long will it be until it will be doing nothing but running this code?
If the code is doing somthing more than simply selecting data, the chance of the concurrent versions competing for locks or resources is high and can mean that the slow-down of the code escalates rapidly and in a very, very short time, your system is on it’s knees.

In our case, we have some protection against this. The scheduler detects that the previous version has not finished and it waits until it has done so before starting the next execution. so the next execution is delayed. In my example the 08:06 execution is delayed until 08:08, 2 minutes late.

We still have two possible situations. The report could simply continue to slow down at the previous rate and just be delayed longer and longer. In my example it has been slowing down at around 1 minute 20 seconds per run for the last 5 runs.

However, The next execution is further delayed until 08:29 – which is 6 minutes beyond the 15 minute window from 08:08. This latets run of the code is 5 minutes 30 seconds slower than the previous execution, not 1 minute 20 seconds slower. We still have a tipping point. If you remember, this report runs to cover the period since the last execution. As the last execution took more than 15 minutes, the next run has to report on more than 15 minutes. 21 minutes in this case.

The slow-down now rapidly escalates due to not just whatever was slowing the report down initially but also the growing reporting period.

Just 5 runs and 5 hours later, at 12:57:16, the report fails. It now runs for so long that it cannot recreate the data as it looked at the start of the run and we get snapshot-too-old errors. In the listing you just see the second column go from S to F.

It is game over. Thankfully only one copy of the report is running at any time (A version is kicked off as soon as the previous one fails) and so, worst case, is using only one process on the box, one CPU maximum and the IO that one thread can demand.

What was the root cause? Time and statistics and data volume.
Statistics because the job that collects statistics failed earlier.
Time because at 05:50 or so, the run window of 15 minutes was far enough out of the range of known date-times for a partition that the CBO decided only a few rows would be found and swapped to a nested-loop execution.
Data volume as data was coming in quicker as the day progressed and the nested loop plan performance was highly susceptible to increases in data volume.

For an excellent posting on why code suddenly changes it’s execution plan when nothing has happened but time passing, seethis excellent post and comments on the topic on Richard Foote’s blog

{And isn’t his blog so much skinnier than mine? 🙂 }

Friday Philosophy – I killed a presentation April 1, 2010

Posted by mwidlake in Friday Philosophy, performance, statistics.
Tags: , , ,

Well, this week saw the latest Management and Infrastructure SIG. I won’t go into the SIG meeting itself just now, maybe later this weekend. If you want an opinion on it, you might like to see what Graham Oaks said.

Being Chair of the SIG I have to find presenters. Being also one of those rare people who enjoys presenting, I know I can always tap one person to present. Me. {If only I could sing, play an instrument, act, do impression, put both my ankles behind my ears(*) or anything that was of any general interest, getting an audience would be so much easier}.

I decided to tackle a topic I have tried before and which I previously did a very poor show of: “General principles on gathering database statistics”.
It is not a title to strike joy into the soul, I know, but it is really important to getting good and reliable performance out of any modern Oracle database. Even other Oracle DBA types tend to find the topic worthy but boring, but I seem to have been forced to know a lot about it and I’ve become sadly very passionate about it.

So, I tried again. I mean, how long should it take to describe the issues with database statistics and the general principles to gathering them? I took my old presentation and stripped out all code, all SQL syntax, all listing of tables and got it down to describing the process. Then I started adding the bits and pieces you need to know to get this half-right. 132 slides. Hmmmm

I tried again and took it higher level. 48 slides. I can do 48 slides in something between 45 minutes and an hour. I was ready.

I had the last presentation slot of the day. I figure if I am going to foist myself on the audience, they should have the right to leave before I start, without missing any of the good stuff. I had 3/4 of the day’s attendees still with me when I started.

I think I killed my audience. Actually, I know I killed my audience. I lost a few as the slot progressed {I kidded myself they had to get back to the office/catch a train} and I made the fatal mistake of not shutting up when I had done 45 minutes. You see, there was all this important stuff I had not mentioned yet! Thankfully, I had a friend in the second row and when I saw him lose the will to live, I stopped. I’d been at it for 70 minutes. What we really needed was the Chair to shut me up but I was the chair.

Anyway, I came away from the talk with two realisations.

  • The topic of gathering database statistics is too large to cover in one presentation session.
  • It really is very boring, even when you are passionate about it.

I think I have no choice but to make this a blog thread (as I threatened to do so about 1 month ago). But I have also promised myself to finish off SQL Audit before I start a new thread and that was 3 months ago.

So I think I am going to have to put some time into this blogging lark. It’s OK though, I seem to go on about database statistics so much that even my cat is staring to avoid me. I’ll just stop watching Star Trek and drinking wine in the evenings and switch to preparing technical blogs. And drinking wine of course.

(*) Up until 2007 I could do this. I get out more now. Despite the stats fixation

Which Hints to Use March 18, 2010

Posted by mwidlake in performance.
Tags: , ,

When tuning Oracle SQL code, which are the best Hints to use? Are there good and bad Hints? I believe so.

Firstly, I wish Oracle had called Hints something else. As many have commented, Hints are actually directives to the CBO. If the Hint is syntactically correct and can be applied, it will be applied. If you include a “USE_INDEX” Hint to tell the CBO to use an existing index then, no matter how crazy it would be to do so, the CBO will use the hint if doing so does not prevent the query from being logically possible.

That leads onto one of the reasons I do not actually like Hints {and these reasons lead to my definition of good and bad hints}. So long as the Hint remains possible, it will be followed. It has become a vital part of the logic of the code but Hints are rarely reviewed.
Another reason, often mentioned, is that if new functionality is introduced to the Oracle Optimizer, Hints may prevent it being used {eg you might have hinted the use of an index but now Oracle can convert two other, less selective B-tree indexes into bitmaps and merge them together to get a far more selective path – but the Hint forces the use of the original index}.
The above is possible but is rare compared to a far more common issue – You hinted a specific index be used, but if you now create another, more suitable index for that sql statement, the new index will not be used. The hinted one will still be used. Similarly If you drop the index that the Hint references, now the Hint is invalid and the CBO will chose a new access path. You are given no warning when you drop an index that hints reference {and it would be very tricky for Oracle to do this for you in a reliable way}.

A final problem with Hints is that the access path may need to change as data volumes and relationships change. When there are very few rows in a driving table, a nested loop access path may make sense. But if that driving table grows and has many more rows in it, then a hash join becomes more efficient. A Hint can fix this path.

I prefer Hints that give the CBO more information but allow it to still choose the path and to vary as data volumes change.

I like the DYNAMIC_SAMPLING Hint as it is generally just telling oracle to look harder {more intelligently} at the statistical information, at the cost of spending a little longer on the parse . Most systems have OPTIMIZER_DYNAMIC_SAMPLING set at 1 or 2 so by default tables with no stats will have stats gathered. Hinting at level 3 and 4 instructs the CBO to verify estimate guesses for predicates it has made and check correlation between rows. It is probably the Hint I am most happy using.

In version 10.1 I encountered lots and lots of issues with the CBO trying to unnest and merge elements of the SQL into the main body of the query. And getting it wrong. Though it is potentially stopping the CBO from examining useful access paths, I do use NO_MERGE and NO_UNNEST quite often and I “like” them as it leaves the rest of the decisions up to the optimizer. You are basically saying “leave that subquery alone and satisfy it in isolation”. I still encounter lots of such issues on 10.2, but I also use UNNEST more often, to push a subquery into the body of the code.

I am more happy using a CARDINALITY hint than USE_NL or USE_HASH as the CARDINALITY hint is self documenting (it says exactly what correction {or lie} you are giving the optimiser). A USE_NL Hint is enforcing an access path and not really saying why.

If specific hints are going to be used, I like to be very specific. USE_HASH should state both tables that are to be hashed together {It might even be that modern versions of Oracle insist on the form USE_HASH (taba tab) and not just USE_HASH (taba), as I never use the less specific Hint}.
{ NB see comment by Jonathan Lewis as to why I am utterly wrong on this one – USE_HASH basically says “Use a hash join on the tables listed” but does not force a hash between the two stated tables}.

I only Hint specific use of an index if I can’t fix the problem with better gathered stats. I don’t LIKE adding INDEX Hints, even though they are commonly used and easy to understand. For the reasons stated above, I do not like ordering the CBO to use one of the currently existing indexes.

I really do not like using the RULE hint. In Version 9 it was an uphill struggle to get people to not use it as it so often fixed the immediate problem and, of course, oracle used it so much for internal SQL (and still do in 10 and even in 11). How many current DBAs and Developers know what the Rule Based Optimizer rules are? {hands DOWN you over-40 lot}. Using the RULE hint is bordering on homeopathy for databases. It seems to work, you have no idea why and, in reality, it may well be doing nothing, as you are using a feature of Oracle that is incompatible wiht the RBO.

I am very, very uncomfortable about some hints. The bypass_ujvc hint is one of them. It basically tells Oracle it can do a MERGE INTO statement without having the unique constraint in place to support the where clause on the target table that allows it to work reliably. You are telling the optimizer “just trust me”. IE you can lie horribly to the CBO.

All in all, I try and get the stats right rather than hint. I’ll spend ten times as long trying to understand and fix (if I can) why the estimated costs and cardinalites in an Explain Plan are wrong than slapping in an INDEX Hint. I will use Hints if I can’t fix the plan via the stats, but I try and use the more generic Hints. I know from experience that fixing the stats {or at least understanding why I can’t} fixes more code than adding one hint to one SQL statement.

A good rule of thumb is, if the Cardinality is accurate for a step, the plan will be acceptable. {This is a Rule of thumb, not a cast-iron truth!}.