Sometimes a program you have written just doesn't run fast enough. In this article I'm going to look at how (and how not) to approach such problems along with some tips for speeding things up in various cases.
Stop Right There!
The biggest mistake people make when trying to improve the performance of their programs is jumping straight in and trying to do it without first understanding where the bottleneck lies. Often the real performance issue may not lie where you think it does.
Tools for analyzing program performance are called profilers. They will tell you which functions and sometimes even which lines of the program are taking up what factor of the execution time. This is highly valuable information and can save you a lot of wasted time. Without it, you might set about optimizing a function or method that is only taking a tiny fraction of the execution time anyway.
For example, if you can make a function run in half the time it used to, but it only took 1% of your program's execution time, then you've maybe got half a percent faster. If you're sufficiently aware that you've got a performance problem to be hunting one down, getting half a percent faster isn't really going to help matters. Discovering that a given function contributes 20% of the execution time and getting the same kind of performance gain out of it, on the other hand, means your program overall gets 10% faster - a still modest gain, but at least one that's likely to be measurable.
I'm not writing this article targeted at any particular language, but profiling tools are out there for pretty much every mainstream language there is. Search this site. Google for it. Ask around on forums or ask your programming friends for recommendations. You could save yourself a lot of wasted time by using one.
You Could Try Cheating
Sometimes you can rely on the human nature of your users to avoid actually doing any performance enhancements at all. An application feels slow to a user if they are sat there waiting and don't notice anything changing for a long time. If you have something that takes ten seconds to do, it is likely to feel a lot slower if the application just freezes up for ten seconds than if it displays a progress bar that increments a step every second or so.
This phenomenon here is known as perceived speed. It's the reason that many large applications have splash screens: so the user gets some feedback telling them that the application actually is starting up, while in the background it loads all of the stuff it needs to. In the long run, you're actually making the whole thing take longer, since as well as loading the entire application you have to load the splash screen and its pretty graphics as well. It keeps users happy though, and makes it feel faster. That can be enough.
Optimizing Database Operations
I am endlessly surprised by the number of times the same old trick works for making slow database queries run a lot faster: adding indexes appropriately.
An index on a field makes lookups based on it a whole lot faster. That means that if you are using the field in the WHERE or JOIN clause of an SQL statement, you stand to gain by adding an index on it - especially in the case of JOINs.
Here is a query from the RSS generating script from my own personal site's blog.
SELECT blog.id, blog.dateposted, blog.title, blog.rendered,
LEFT JOIN blog_comments ON blog.id = blog_comments.blogentryid
GROUP BY blog.id
ORDER BY blog.id DESC
It had worked just fine a few months back, but then one day I noticed that Facebook had stopped importing posts from my blog. At first I thought the script was in some kind of infinite loop, but eventually I realized it was just running really, really slowly. What had gone wrong?
It turns out that in old, old posts in my blog, automated spambots had posted some 12,000 spam comments. That meant that when I did the JOIN of the blog_comments table, it was now having to find values of the blogentryid field - the foreign key to the blog posts table - to match each blog post.
Guess what? I'd forgot to put an index on the blogentryid field. I added the index, and suddenly the whole thing was running in around a second, rather than long enough to cause Facebook to time out on it. This is not the first time I have added appropriate indexes and seen incredible performance leaps: for one client I managed to bring a 3 minute long query down to a couple of seconds with such an approach. If your database queries are running slow, check your indexes before you start trying to tweak query itself.
Some programs do relatively complex or just plain time consuming computations to produce results or reports. Into the time consuming category falls calls to databases. To get a result, you send a query - potentially over the network - which then has to be interpreted by the database engine, which then gets the results and sends them back to the application - potentially over the network again - where they are then taken and placed into the data structures that the program is working with. That takes time.
If the data is not changing regularly or you can get away with showing slightly out of date data, you can improve performance by keeping copies of the results of recent queries around in memory. Then, the next time someone does the same query, you need not go to the database, or do the complex computation: you just return the result from the last time it was done.
This is known as caching. It is a trade-off: you are increasing memory usage of your application in order to decrease the time it takes, on average, to get a result. We do a lot of caching of things from databases here at Programmer's Heaven to keep the site running fast with some very, very modest hardware requirements. The small number of servers we have are kitted out with plenty of RAM, but hey, it's a lot cheaper than load balancing hardware.
Improving Algorithmic Complexity
Time for today's theoretical computer science lesson. Bear with me, because this matters if you want to be able to understand and potentially improve the performance of any algorithms in your code.
The reason that adding indexes to database fields works so well is because you are turning a linear lookup - where the database engine has to look over every row of the table - into perhaps a logarithmic order one if it's using a b-tree for the index, or even a constant order one if it's using a hash table.
Let me explain what I mean by the word order, since some people may not be familiar. Suppose we have a problem of size n - in the database case, n would be the number of records in the table. The order of an algorithm tells us how the time it takes scales with the size of the problem it is working on.
Let's take a simplified problem: we want to find all rows where field X has a value of 10. If we were scanning every row of the table to find those that matched the required condition, then we have to look at all n rows of the table. Therefore, the time it takes scales linearly with the size of the problem - the number of rows. Therefore, we say it is O(n), read "order n".
Suppose we add an index to the table on the field X. This creates a tree data structure of the values in the field across all rows in the table. The notable thing about it is what happens when you consider any node in the tree. If we are trying to locate rows where the field has a value of 10, the value of the node will either be equal to ten, less than ten or greater than ten. If it is equal to ten, then this node tells us what rows in the table have a value of ten. If it is greater than ten, we know that the value must lie in the left subtree. If it is less than ten, we know we must look in the right subtree.
If the tree is balanced - that is, each node has subtrees of equal size - then every time we visit one node in the tree, we half the number of other nodes we have to consider. This halving of the problem every time - or in fact cutting it by any constant factor - gives us something known as logarithmic order, written O(log(n)). As n grows, so does the time taken - but the rate of growth gets slower and slower as n increases.
An Example Of Algorithmic Improvement
We are implementing an application for an intelligence agency. We have a large list of names of people that they are tracking, and constantly have details of financial transactions coming in. If we see a transaction with the name of somebody were are tracking on it, then we need to raise the alarm as quickly as possible.
We don't want to hit the database every time, since that involves connecting to the database, doing a query and getting a result set back. Instead, we decide to keep the list in memory. We're coding it in C#, so our first instinct is, "OK, it's a list of names, which are strings, so let's use the generic List class".
So we declare it and fill it up from the database:
List<string> WeHaveOurEyeOnYou = new List<string>();
// Database loading code here
And every time we want to do a lookup, we use the Contains method:
And it works.
Wanting to make good use of the theory we learned earlier, we think about the order of our algorithm. The Contains method is doing a linear scan through the list. Therefore, our algorithm is O(n). If the number of suspected terrorist we know about increases tenfold, then it takes ten times as long to do the check. Can we do better?
The List class also has a BinarySearch method. This relies on our list of names being sorted, so after out database loading code we do:
This is an order O(n*log(n)) operation, but we only have to do it once, not for every lookup, so it doesn't affect the lookup time. We then change the conditional to read:
Given a sorted list, BinarySearch takes the name in the middle of the list and compares it to the name we are looking for. If it matches, we have our result. If not, we can forget about half of the list. We repeat this process until we get a match or no match. The halving of the problem every time gives us, as we learned earlier, logarithmic order - O(log(n)). The time taken now scales logarithmically with the the number of names - a whole lot better than linearly.
That's a good improvement, but can we do any better? The question is yes, but we're going to have to trade off some memory for it. There is a data structure that allows us to do constant time lookups - that is, the time it takes to do the lookup does not depend on the number of items at all! We write this as O(1). This data structure that does this is known as a hash, and in the .Net class library one such class that implements this is the generic Dictionary class.
How on earth does it achieve constant time lookups? We know that numerically indexing into an array is a constant time operation: no matter whether we are accessing element 10 or element 10,000 it takes the same amount of time. It's just a memory lookup. We know we can't index into an array using a string, but it turns out there are good ways that we can compute a integer value based upon the characters in a string that is:
- Within a given range
- Unlikely to be the same as the value calculated for a different string? This is known as hashing. Therefore, we set the range of values to be the number of items we have. Then we use the number that it generates as an index into an array of the values.
It is not quite this simple really - we are only promised that collisions are relatively unlikely, but they inevitably will happen. Therefore, we need a way to handle this.
This data structure is more complex that a simple list, and therefore takes more memory. However, we gain significant performance on large data sets when it comes to doing lookups. The only thing about the .Net Dictionary class is that it maps keys to values. In our case, we don't care about the value but just whether the hash contains the key. Therefore, we can declare our data structure like this:
Dictionary<string,bool> WeHaveOurEyeOnYou = new Dictionary<string,bool>;
We picked bool because it's small. The database loading code can then just add entries like this:
And lookups become:
You had also better sprinkle in some comments about why you used a Dictionary when a List was the obvious thing to do. Otherwise, some programmer may come along later and decide to refactor the performance enhancements they don't understand out of your code, as they wonder why on earth you aren't just using a List.
Algorithmic Improvements Are A Big Win
In the previous example, we have made a huge improvement in the case where this is a lot of names. In general, algorithmic improvements will give you the biggest gains. Therefore, they are what you should be aiming for.
What algorithm to use is very problem specific, so I can't really give much in the way of general advice here. Certainly, you can try to analyze the algorithms and data structures that you are using to understand the algorithmic complexity (order) of the code you have written, though it can be genuinely hard to do such an analysis correctly at times. There are books of algorithms you can buy, and plenty of resources online too. I'm certainly no walking encyclopedia of algorithms, which is why I have a big thick book of them that I turn to when I'm looking for something better than the obvious way. Sometimes the better way is anything but obvious.
There are two types of tasks: CPU bound and I/O bound ones. Those that are CPU bound require CPU time. The only way you can really improve things here is if you have multiple CPUs or cores at your disposal. If so, look at whether you can spawn a couple of threads (no more than one per execution core) and do parts of the execution in parallel.
The other type is I/O bound tasks. These depend on the network and disks, which are relatively slow. You might be able to take advantage of asynchronous I/O, where the operation returns immediately rather than blocking and reports back to you when the operation is complete. In the meantime, you can get on with doing other useful work.
The specifics of these are very situation specific and the options available then vary depending on what platform you are working on, so I won't even attempt to go any further here than just raising it as something you can consider. Parallelism is not a silver bullet, though, and the gains are pretty much never what you might wish for. Certainly, spawning two threads doing parts of a CPU bound task that you did in a single thread before is unlikely to come anywhere close to halving the overall execution time. If the threads have to communicate or are accessing the same areas of memory, that will certainly limit the gains you can get.
There are many advantages to working in higher level, garbage collected languages over working in languages like C where you are very close to the hardware, but also forced to manage memory yourself. However, there are times when the reality simply is that writing parts of the code that are CPU bound and really do need to perform in C rather than some other language (especially dynamic ones like Perl and Python) is just going to give a big performance win. If you are working in such a language and are really stuck, consider this. It should not be your first consideration, though.
If none of the above work, you can always try to micro-optimize. This involves looking at bits of code and finding cute ways to make them run that tiny bit faster. Doing this often involves knowing things about the underlying architecture.
For example, if I am working in C# and .Net, I may have a program that does a lot of method calls on objects instantiated from some class. If the class is not sealed (final in Java terminology), then a lot of the time the lookup of which method to call will be done using the v-table, since the object could in fact be an instance of a subclass and the method could be overridden. If I mark the class as sealed, the compiler knows at compile time which method will be called, and can emit code that avoids the vtable lookup. The problem is that in doing this optimization you have prevented anyone from extending the class in question, which may have been a bad idea. Additionally, it will give a very small gain.
The side-effects of micro-optimization are often less clear (and sometimes completely impenetrable) code. Unless you are optimizing very tight loops for CPU bound tasks, this kind of approach probably isn't going to be much of a win.
Trying to optimize programs without first trying to understand why they are performing badly is commonly done, but also likely to waste your time. You are far better for getting a profiler to get some hard numerical data on the problem, so you can target your efforts. The biggest wins come from algorithmic improvements, caching and sometimes parallelization. Improving perceived speed can save you having to do any of them, though.
This article is far form the definitive guide to performance programming, and there are plenty of things I have not even touched on. I have instead tried to present some general techniques that I know from personal experience have actually delivered improvements, sometimes significant ones. I hope it has offered some guidance to anyone looking to speed up their sluggish programs.