Even if you can't write assembly like a poet, you can read disassembly like a hunter

Reading disassembly is more like reading tracks than reading a book. To read a book you have to know the language. Reading tracks, although it gets better with skills and experience, mostly requires attentiveness and imagination.

Most of the time we read disassembly only to answer one simple question: does the compiler do what we expect it to do? In 3 simple exercises, I’ll show you that often enough you too can answer this question even if you have no previous knowledge of assembly. I’ll use C++ as a source language, but what I’m trying to show is more or less universal, so it doesn’t matter if you write in C or Swift, C# or Rust. If you compile to any kind of machine code — you can benefit from understanding your compiler.

1. Compile-time computation

Any decent compiler tries to make your binary code not only correct but fast. This means doing as little work in runtime as possible. Sometimes it can even conduct the whole computation in compile-time, so your machine code will only contain the precomputed answer.

This source code defines the number of bits in a byte and returns the size of int in bits.

The compiler knows the size of an int. Let's say for the target platform it is 4 bytes. We also set the number of bits in a byte explicitly. Since all we want is a simple multiplication, and both numbers are known during the compilation, a compiler can simply compute the resulting number itself instead of generating the code that computes the same number each time it's being run.

Although, this is not something guaranteed by the standard. A compiler may or may not provide this optimization.

Now look at two possible disassemblies for this source code and decide what variant does compile-time computation and what doesn’t.

↑ If you pick one or another, the hidden explanation (A. K. A. spolier) will appear ↑

Of course, the one on the right does.

On a 32-bit platform int's size is 4 bytes, which is 32 bits, which is exactly the number in the code. You might not know that an integer function conventionally returns its output in eax(← click me! I'm expandable). But the thing is, the code on the right already has an answer in it, so it doesn't even matter.

2. Function inlining

Calling a function implies some overhead by preparing input data in a particular order; then starting the execution from another piece of memory; then preparing output data; and then returning back.

Not that it is all too slow but if you only want to call a function once, you don’t have to actually call the function. It just makes sense to copy or “inline” the function's body to the place it is called from and skip all the formalities. Compilers can often do this for you so you don't even have to bother.

But the standard does not promise that all the functions marked as inline shall get inlined. It's more a suggestion than a directive.

Now look at these two disassembly variants below and choose the one where the function gets inlined after all.

Not really a mystery either. It’s the one on the left. You might not know, that the instruction to call a function is really called the call. But since the disassembly on the left doesn't even contain any recall of square, the function has to be inlined anyway.

3. Loop unrolling

Just like calling functions, going in loops implies some overhead. You have to increment the counter; then compare it against some number; then jump back to the loop's beginning.

Compilers know that in some contexts it is more effective to unroll the loop. It means that some piece of code will actually be repeated several times in a row instead of messing with the counter comparison and jumping here and there.

The compiler has all the reasons to unroll such a simple loop, but it might as well choose not to.

It's the one on the right.

You might not know that j<*> is the family of jump instructions. Then again, the code on the right clearly has a repeating pattern, while the one on the left has a number three that is the loop's exit condition, and that should be enough to make a conclusion.

Conclusion

You can argue that these examples were deliberately simplified. That these are not some real-life examples. This is true to some degree. I refined them to be more demonstrative. But conceptually they are all taken from my own practice.

Using static dispatch instead of dynamic made my image processing pipeline up to 5 times faster. Repairing broken inlining helped to win back 50% of the performance for an edge-to-edge distance function. And changing the counter type to enable loop unrolling won me about 10% performance gain on matrix transformations, which is not much, but since all it took to achieve was simply changing short int to size_t in one place, I think of it as a good return of investment.

Apparently, old versions of MSVC fail to unroll loops with counters of non-native type. Who would have thought? Well, even if you know this particular quirk, you can't possibly know every other quirk of every compiler out there, so looking at disassembly once in a while might be good for you.

And you don't even have to spend years learning every assembly dialect. Reading disassembly is often easier than it looks. Try it.