This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Even if you can't write assembly like a poet, you can read disassembly like a hunter

Reading disassembly is more like reading tracks than reading a book. To read a book you have to know the language. Reading tracks, although it gets better with skills and experience, mostly requires attentiveness and imagination.

By Weiler Greg, USFWS from pixnio.com. CC0.

Most of the time we read disassembly only to answer one simple question: does the compiler do what we expect it to do? In 3 simple exercises, I’ll show you that often enough you too can answer this question even if you have no previous knowledge of assembly. I’ll use C++ as a source language, but what I’m trying to show is more or less universal, so it doesn’t matter if you write in C or Swift, C# or Rust. If you compile to any kind of machine code — you can benefit from understanding your compiler.

1. Compile-time computation

Any decent compiler tries to make your binary code not only correct but fast. This means doing as little work in runtime as possible. Sometimes it can even conduct the whole computation in compile-time, so your machine code will only contain the precomputed answer.

This source code defines the number of bits in a byte and returns the size of int in bits.

static int BITS_IN_BYTE = 8;

int main() {
    return sizeof(int)*BITS_IN_BYTE;
}

The compiler knows the size of an int. Let's say for the target platform it is 4 bytes. We also set the number of bits in a byte explicitly. Since all we want is a simple multiplication, and both numbers are known during the compilation, a compiler can simply compute the resulting number itself instead of generating the code that computes the same number each time it's being run.

Although, this is not something guaranteed by the standard. A compiler may or may not provide this optimization.

Now look at two possible disassemblies for this source code and decide what variant does compile-time computation and what doesn’t.

BITS_IN_BYTE:
  .long 8
main:
  mov eax, DWORD PTR BITS_IN_BYTE[rip]
  cdqe
  sal eax, 2
  ret
main:
  mov eax, 32
  ret

↑ If you pick one or another, the hidden explanation (A. K. A. spolier) will appear ↑

2. Function inlining

Calling a function implies some overhead by preparing input data in a particular order; then starting the execution from another piece of memory; then preparing output data; and then returning back.

Not that it is all too slow but if you only want to call a function once, you don’t have to actually call the function. It just makes sense to copy or “inline” the function's body to the place it is called from and skip all the formalities. Compilers can often do this for you so you don't even have to bother.

If the compiler makes such an optimization, this code:

inline int square(int x)  {
    return x * x;
}


int main(int argc, char** argv)  {
    return square(argc);
}

Virtually becomes this:

// not really a source code, just explaining the idea
int main(int argc, char** argv)  {
    return argc * argc;
}
	

But the standard does not promise that all the functions marked as inline shall get inlined. It's more a suggestion than a directive.

Now look at these two disassembly variants below and choose the one where the function gets inlined after all.

main:
  imul edi, edi
  mov eax, edi
  ret
square(int):
  imul edi, edi
  mov eax, edi
  ret
main:
  sub rsp, 8
  call square(int)
  add rsp, 8
  ret

3. Loop unrolling

Just like calling functions, going in loops implies some overhead. You have to increment the counter; then compare it against some number; then jump back to the loop's beginning.

Compilers know that in some contexts it is more effective to unroll the loop. It means that some piece of code will actually be repeated several times in a row instead of messing with the counter comparison and jumping here and there.

Let's say we have this piece of code:

int main(int argc, char**) {
    int result = 1;
    for(int i = 0; i < 3; ++i)
        result *= argc;
    return result;
}
	

The compiler has all the reasons to unroll such a simple loop, but it might as well choose not to.

Which disassembly has the unrolled loop?

main:
  mov eax, 1
  mov ecx, 3
.LBB0_1:
  imul eax, edi
  dec ecx
  jne .LBB0_1
  ret
main:
  mov eax, edi
  imul eax, edi
  imul eax, edi
  ret

Conclusion

You can argue that these examples were deliberately simplified. That these are not some real-life examples. This is true to some degree. I refined them to be more demonstrative. But conceptually they are all taken from my own practice.

Using static dispatch instead of dynamic made my image processing pipeline up to 5 times faster. Repairing broken inlining helped to win back 50% of the performance for an edge-to-edge distance function. And changing the counter type to enable loop unrolling won me about 10% performance gain on matrix transformations, which is not much, but since all it took to achieve was simply changing short int to size_t in one place, I think of it as a good return of investment.

Apparently, old versions of MSVC fail to unroll loops with counters of non-native type. Who would have thought? Well, even if you know this particular quirk, you can't possibly know every other quirk of every compiler out there, so looking at disassembly once in a while might be good for you.

And you don't even have to spend years learning every assembly dialect. Reading disassembly is often easier than it looks. Try it.