jump to navigation

Friday Philosophy – When Not To Tune! Or Maybe You Should? May 20, 2022

Posted by mwidlake in Friday Philosophy, performance.
Tags:
add a comment

This week there was an interesting discussion on Twitter between myself, Andy Sayer, Peter Nosko, and several others, on when you do performance tuning. It was started by Andy tweeting the below.

As a general rule, I am absolutely in agreement with this statement and it is something I have been saying in presentations and courses for years. You tune when there is a business need to do so. As Andy says, if a batch process is running to completion in the timeframe available to it, then spending your effort on tuning it up is of no real benefit to the company. The company pays your salary and you have other things to do, so you are in fact wasting their money doing so.

Please note – I said “as a general rule“. There are caveats, exceptions to this rule as I alluded to in my reply-in-agreement tweet:

I cite “Compulsive Tuning Disorder” in my tweet. This is a phrase that has been kicking about for many years, I think it was coined by Gaja Krishna Vaidyanatha. It means someone who, once they have sped up some SQL or code, cannot let it lie and has to keep trying and trying and trying to make it faster. I think anyone who has done a fair bit of performance tuning has suffered from CTD at some point, I know I have. I remember being asked to get a management report that took over an hour to run down to a few minutes, so they could get the output when asked. I got it down to a couple of minutes. But I *knew* I could make it faster to I worked on it and worked on it and I got it down to under 5 seconds.

The business did not need it to run in 5 seconds, they just needed to be able to send the report to the requester within a few minutes of the request (they were not hanging on the phone desperate for this, the manager just wanted it before they forgot they had asked for it). All that the added effort I spent resulted in was a personal sense of achievement and my ability to humble-brag about how much I had sped it up. And my boss being really, very annoyed I had wasted time on it (which is why it sticks out in my mind).

However, there are some very valid reasons for tuning beyond what I’ll call the First Order requirement of satisfying an immediate business need of getting whatever needs to run to run in the time window required. And a couple of these I had not really thought about until the twitter discussion. So here are some of the Second Order reasons for tuning.

1 Freeing Up Resource

All the time that batch job is running for hours it is almost certainly using up resources. If parallel processing is being used (and on any Exadata or similar system it almost certainly is, unless you took steps to prevent it) that could be a lot of CPU, Memory, IO bandwidth, even network. The job might be holding locks or latches. The chances are very high that the 4 hour batch process you are looking at is not the only thing running on the system.

If there other are things running at the same time where there would be a business benefit to them running quicker or they are being impacted by those locks, it could be that tuning those other things would take more effort (or even not be possible) compared to the 4 hour job. So tuning the four hour job releases resource or removes contention.

I had an extreme example of this when some id… IT manager decided to shift another database application on to our server, but the other application was so badly designed it took up all available resource. The quick win was to add some critical indexes to the other application whilst we “negotiated” about it’s removal.

2 Future Proofing

Sometimes something needs tuning to complete in line with a Service Level Agreement, say complete in 10 seconds. You get it done, only to have to repeat the exercise in a few months as the piece of code in question has slowed down again. And you realise that as (for example) data volumes increase this is likely to be a recurring task.

It may well be worth putting in the extra effort to not only get the code to sub-10-seconds but significantly less than that. Better still, spend extra effort to solve the issue causing the code to slow down as more data is present. After all, if you get the code down to 1 second and it goes up to 5, even though it is meeting the SLA the user may well still be unhappy – “Why does this screen get slow, then speeds up for a few months, only to slow down again! At the start of the year it was appearing immediately, now it takes forevvveeeeer!”. End users MUCH prefer stable, average performance than toggling between very fast and simply fast.

3 Impacting Other Time Zones

This is sort-of the first point about resource, but it can be missed as the people being impacted by your overnight batch are in a different part of the globe. When you and the rest of your office are asleep – in America let’s say – Other users are in Asia, having a rubbish time with the system being dog slow. Depending on the culture or structure of your organisation, you may not hear the complaints from people in other countries or parts of the wider company.

4 Pre-emptive Tuning

I’m not so convinced by this one. The idea is that you look for the “most demanding” code (be it long-running, resource usage, or execution rate) and tune it up, so that you prevent any issues occurring in the future. This is not the earlier Freeing Up Resources argument, nothing is really slow at the time you do this. I think it is more pandering to someone’s CTD. I think any such effort would be far better spent trying to design in performance to a new part of the application.

5 Restart/Recovery

I’d not really thought clearly about this one until Peter Nosko highlighted it. If that four hour batch that was causing no issues fails and has to be re-run, then the extra four house could have a serious business impact.

Is this a performance issue or is it a Disaster Recovery consideration? Actually it does not matter what you label it, it is a potential threat to the business. The solution to it is far more complex and depends on several things, one of which is can the batch be partially recovered or not?

When I design long running processes I always consider recovery/restart. If it is a one-off job such as initial data take-on I don’t put much effort into it. It works or it fails and your recovery is to go back to the original system or something. If it a regular data load of 100’s of GB of critical data, that process has to be gold plated.

I design such systems so that they load data in batches in loops. The data is marked with the loop ID and ascending primary key values are logged, and the instrumentation records which stage it has got to. If someone kicks the plug out the server or the batch hits a critical error, you can restart it, it will clean up any partially completed loops and then start from after the last completed batch. It takes a lot more effort to develop but the pay-back can be huge. I worked on one project that was so out of control that the structure of the data was not really known and would change over time, so the batch would fail all the bloody time, and I would have to re-run it once someone had fixed the latest issue . The ability to restart with little lost time was a god-send to me – and the business, not that they really noticed.

Simply tuning up the batch run to complete more quickly, as Peter suggests, so it can be blown away and repeated several times is a pragmatic alternative. NB Peter also said he does something similar to what I describe above.

6 Freeing Up Cloud Resource

Again, I had not thought of this before but Thomas Teske made the point that if your system is living in the cloud and you are charged by resource usage, you may save money by tuning up anything that uses a lot of resource. You are not satisfying a First Order business need of getting things done in time but you are satisfying a First Order business need of reducing cost

7 Learning How To Tune

I think this is greatly under-rated as a reason to tune something, and even indulge in a little CTD even when the First Order business need has been met. It does not matter how much you read stuff by tuning experts, you are only going to properly learn how to performance tune, and especially tune in your specific environment, by practice. By doing it. The philosophy of performance tuning is of course a massive topic, but by looking at why things take time to run, coming up with potential solutions, and testing them, you will become really pretty good at it. And thus better at your job and more useful to your employer.

Of course, this increase in your skills may not be a First, Second, or even Third Order business need of your employer, but it could well help you change employer 😀 .

8 Integration Testing

Again highlighted by Peter Nosko, we should all be doing integrated, continuous testing, or at least something as close to that as possible. Testing a four hour batch run takes, err, more time than a 15 minute batch run. For the sake of integration testing you want to get that time down.

But, as I point out..

{joke – don’t take this tweet seriously}

As I think you can see, there are many reasons why it might be beneficial to tune up something where the First Order business need is being met, but the Second Order benefits are significant. I am sure there are some other benefits to speeding up that batch job that finishes at 4am.

I’ll finish with one observation, made by a couple of other people on the Twitter discussion (such as Nenad Noveljic). If you fix the First Order issue, when management is screaming and cry and offering a bonus to whoever does it (like that ever happens) you get noticed. Fixing Second Order issues they did not see, especially as they now may not even occur, is going to get you no reward at all.