This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

A cheap trick to speed up recursion in C++

Disclaimer. The trick I am about to show is not an example of good engineering. It depends on implementation-specific optimization and as such is unreliable. In some contexts, it may give you some noticeable performance gains, but in other, it may even lead to performance degradation.

However, quoting Donald Knuth, “in established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering.”

Tail call optimization vs. no tail call optimization

C++ has tail call optimization. It's when a recursive function that ends with a self-call is replaced with a simple loop. Let me show you how it works

1. C++ source code
int sum_of_first(int n) { return (n == 1) ? 1 : n + sum_of_first(n-1); // tail call } int main() { return sum_of_first(65536); }

From disassembly, you can see that there is no actual self-call in the “recursive” sum_of_first function. From the Valgrind report, you can also see why this is important.

Normally, every time you call a function, you have to store its return address in a stack. Even if the address remains the same. Which means that if you call your function 65536 times in a nested way, you will be writing the same address 65536 times only to read it all back on returns. It is pure overhead. But if the self-call is a tail call, meaning that all the nested calls will eventually return to the same place, you don't have to do the calls at all. A simple loop will be enough.

Compilers can do that for you. They rewrite your code to minimize overhead whenever they can. But what if they can't?

1. C++ source code
int sum_of_first(int n) { if(n == 0) return 0; auto temp_sum = sum_of_first(n-1); // not a tail return n == 1 ? 1 : (temp_sum + n); } int main() { return sum_of_first(65536); }

When the self-call is not a tail call, and can't even be made a tail call, the compiler skips the TCO optimization. And yes, you get all the overheads: all the amount of needless reads and writes.

The idea

What if instead of a self-call, we would instantiate a copy of the original function and call that instead? The copy then will call the original. In theory, this allows a compiler to inline the copy, so instead of all 65536 nested calls, we will only have 32768. Let's try it out.

1. C++ source code
template <int I, N> // 2 instances: <0, 2> and <1, 2> int sum_of_first(int n) { if(n == 0) return 0; auto temp_sum = sum_of_first<(I + 1) % N, N>(n-1); return n == 1 ? 1 : temp_sum + n; } int main() { return sum_of_first<0, 2>(65536); }

Well, the theory and compiler optimizations don't always work together. Yes, we have fewer calls, but the code itself became larger, so the instruction count remained almost as is was. Although the data read and write count lowered, it is still far from the desired TCO-like result.

But it is something.

Besides, who said we should stop there? Let's try the same principle, but with more instances.

It's not trivial to relevantly micro-benchmark this kind of functions. The thing is, if we make the operation larger, it will just overflow the stack. We can request a larger stack, but then the operation itself will become unrealistic. Let's not measure the time then. Let's thick to instruction counts and memory operations instead.

This is not the same thing as the time of execution, but it more or less correlates with the performance expectations.

Here are the instruction counts for (almost) the same code with TCO, without TCO, and with self-calls spread through 2, 4, 8, 16, and 32 instances.

Now let's measure total data reads and writes for the same family of implementations.

Conclusion

The trick shows some performance gain, and it's safe to use. It does rely on compiler's implementation details, but in the worse case, it will merely not give you the gain you wish.

On the other hand, it is merely a cheap trick. If you have a bottleneck in the recursion, it is still best to rewrite it in an iterative way or in the manner that enables the TCO. It is more work, but the benefit will not then depend on the current compiler's optimizations. Also, the code will not look stupid in the future, when the compilers learn to do that recursion semi-inlining themselves.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with g++ 5.4.0 (-O2). The source code for the benchmark is available on Github.