jump to navigation

Performance Tipping Points April 13, 2010

Posted by mwidlake in performance, statistics.
Tags: , ,
trackback

This week I came across a nice example of a performance tipping point. This is where Everything is OK until you reach a point where it all quickly cascades to Not OK.

The below shows the timings for a regulalry run “Alert”. Four times an hour we want to know if something quite important has happened on the system.

ID S START_DATE END_DATE DUR TIME_RAN START_PARAM END_PARAM
---- - ------------- ------------- --------- ------------- -------------------- -----------------
292 S 0410 03:06:00 0410 03:06:31 .000359 0410 03:06:00 20100410 02:51:00 20100410 03:06:00
292 S 0410 03:21:00 0410 03:21:35 .000405 0410 03:21:00 20100410 03:06:00 20100410 03:21:00
292 S 0410 03:36:00 0410 03:36:38 .000440 0410 03:36:00 20100410 03:21:00 20100410 03:36:00
292 S 0410 03:51:00 0410 03:51:33 .000382 0410 03:51:00 20100410 03:36:00 20100410 03:51:00
292 S 0410 04:06:00 0410 04:06:28 .000324 0410 04:06:00 20100410 03:51:00 20100410 04:06:00
292 S 0410 04:21:00 0410 04:21:44 .000509 0410 04:21:00 20100410 04:06:00 20100410 04:21:00
292 S 0410 04:36:00 0410 04:36:27 .000313 0410 04:36:00 20100410 04:21:00 20100410 04:36:00
292 S 0410 04:51:00 0410 04:51:34 .000394 0410 04:51:00 20100410 04:36:00 20100410 04:51:00
292 S 0410 05:06:00 0410 05:06:44 .000509 0410 05:06:00 20100410 04:51:00 20100410 05:06:00
292 S 0410 05:21:00 0410 05:21:43 .000498 0410 05:21:00 20100410 05:06:00 20100410 05:21:00
292 S 0410 05:36:00 0410 05:37:01 .000706 0410 05:36:00 20100410 05:21:00 20100410 05:36:00
292 S 0410 05:51:00 0410 05:52:21 .000938 0410 05:51:00 20100410 05:36:00 20100410 05:51:00
292 S 0410 06:06:00 0410 06:08:09 .001493 0410 06:06:00 20100410 05:51:00 20100410 06:06:00
292 S 0410 06:21:01 0410 06:24:00 .002072 0410 06:21:01 20100410 06:06:00 20100410 06:21:01
292 S 0410 06:36:00 0410 06:40:12 .002917 0410 06:36:00 20100410 06:21:01 20100410 06:36:00
292 S 0410 06:51:00 0410 06:56:54 .004097 0410 06:51:00 20100410 06:36:00 20100410 06:51:00
292 S 0410 07:06:00 0410 07:13:17 .005058 0410 07:06:00 20100410 06:51:00 20100410 07:06:00
292 S 0410 07:21:00 0410 07:29:42 .006042 0410 07:21:00 20100410 07:06:00 20100410 07:21:00
292 S 0410 07:36:00 0410 07:47:48 .008194 0410 07:36:00 20100410 07:21:00 20100410 07:36:00
292 S 0410 07:51:00 0410 08:08:07 .011887 0410 07:51:00 20100410 07:36:00 20100410 07:51:00
292 S 0410 08:08:07 0410 08:29:43 .015000 0410 08:08:07 20100410 07:51:00 20100410 08:08:07
292 S 0410 08:29:43 0410 08:50:10 .014201 0410 08:29:43 20100410 08:08:07 20100410 08:29:43
292 S 0410 08:50:10 0410 09:22:28 .022431 0410 08:50:10 20100410 08:29:43 20100410 08:50:10
292 S 0410 09:22:28 0410 10:27:11 .044942 0410 09:22:28 20100410 08:50:10 20100410 09:22:28
292 S 0410 10:27:11 0410 12:57:16 .104225 0410 10:27:11 20100410 09:22:28 20100410 10:27:11
292 F 0410 12:57:16 0410 14:50:26 .078588 0410 12:57:16 20100410 10:27:11 20100410 12:57:16
292 F 0410 14:50:26 0410 16:49:42 .082824 0410 14:50:26 20100410 10:27:11 20100410 14:50:26
292 F 0410 16:49:42 0410 19:06:10 .094769 0410 16:49:42 20100410 10:27:11 20100410 16:49:42

The first half dozen records show the program ID 292 running in half aminute or so. It varies a little, from just under 30 seconds to 45 seconds. Each run kicks off 15 minutes after the previous and, if you check the START_PARAM and END_PARAM on the right of the listing, you can see that the report runs for the previous 15 minutes. ie the period since the last run ( including the running time of the last itteration).

Then, at 05:36 something happens. The execution takes a full minute. The next run takes 1 minute and 20 seconds. The next run takes over two minutes and each following execution takes a little longer and longer.

At 07:51, 1.5 hours later, something significant happens. The execution takes more than 15 minutes to run. This report is run every 15 minutes. I’m sure you can appreciate that this is a problem. What happens now depends on your architecture.

If you have a scheduler that simply kicks off the report every 15 minutes, at the next scheduled time (08:06 in my case) a new execution will start and you will have two version of the report running for a couple of minutes. As time progresses this overlap increases. When the run time reaches 30 minutes you will now start having 3 reports running at the same time. And soon you will get three, four, five etc version running at the same time. Depending on the number of CPUs and IO bandwidth of your system, how long will it be until it will be doing nothing but running this code?
If the code is doing somthing more than simply selecting data, the chance of the concurrent versions competing for locks or resources is high and can mean that the slow-down of the code escalates rapidly and in a very, very short time, your system is on it’s knees.

In our case, we have some protection against this. The scheduler detects that the previous version has not finished and it waits until it has done so before starting the next execution. so the next execution is delayed. In my example the 08:06 execution is delayed until 08:08, 2 minutes late.

We still have two possible situations. The report could simply continue to slow down at the previous rate and just be delayed longer and longer. In my example it has been slowing down at around 1 minute 20 seconds per run for the last 5 runs.

However, The next execution is further delayed until 08:29 – which is 6 minutes beyond the 15 minute window from 08:08. This latets run of the code is 5 minutes 30 seconds slower than the previous execution, not 1 minute 20 seconds slower. We still have a tipping point. If you remember, this report runs to cover the period since the last execution. As the last execution took more than 15 minutes, the next run has to report on more than 15 minutes. 21 minutes in this case.

The slow-down now rapidly escalates due to not just whatever was slowing the report down initially but also the growing reporting period.

Just 5 runs and 5 hours later, at 12:57:16, the report fails. It now runs for so long that it cannot recreate the data as it looked at the start of the run and we get snapshot-too-old errors. In the listing you just see the second column go from S to F.

It is game over. Thankfully only one copy of the report is running at any time (A version is kicked off as soon as the previous one fails) and so, worst case, is using only one process on the box, one CPU maximum and the IO that one thread can demand.

What was the root cause? Time and statistics and data volume.
Statistics because the job that collects statistics failed earlier.
Time because at 05:50 or so, the run window of 15 minutes was far enough out of the range of known date-times for a partition that the CBO decided only a few rows would be found and swapped to a nested-loop execution.
Data volume as data was coming in quicker as the day progressed and the nested loop plan performance was highly susceptible to increases in data volume.

For an excellent posting on why code suddenly changes it’s execution plan when nothing has happened but time passing, seethis excellent post and comments on the topic on Richard Foote’s blog

{And isn’t his blog so much skinnier than mine? 🙂 }

Comments»

No comments yet — be the first.

Leave a comment