Challenge your performance intuition with C++ operators

I made a mistake one day. More of a typo really. I used “bitwise and” instead of “logical and” in the code. A colleague of mine noticed that during the review and said something like this. “You shouldn't use bitwise operators instead of logical ones, the latter do short-circuiting.”

Yes. Both parts of that sentence are valid. You shouldn't use bitwise operators instead of logical ones. Logical operators in C++ perform short-circuit evaluation. That's true. But the invisible implication between the two parts is wrong.

Short-circuiting may be or may be not beneficial for the performance. It may be or may be not performed on bitwise operators too. And sometimes you're better off not doing logic the most logical way to begin with.

I want to propose a game. Let's measure things and see if we can predict the measurements before they come. We did a similar thing last year, and that was fun, so why not?

I've made several variants of this code, compiled it with clang 3.8.0-2ubuntu4 -std=c++14 -O2, and measured their run time on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.

Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.

Round 1. && vs &

They are almost the same. In fact, they generate almost the same code.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_10
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_10
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_10
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_10
# BB#20:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_10:  # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

The trick is in the context. Since == generates only 0 or 1, the short-circuiting for 0 and & is still possible. 0 & anything is still 0. Although, the same trick doesn't work with | anymore.

Round 2. & vs *

“Logical and” can also be substituted with a multiplication operator. It has the same truth table on 0 and 1.

Surprisingly, they perform almost the same while their code is completely different.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_10
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_10
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_10
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_10
# BB#20:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_10:  # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  movl  $-1, %eax
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  sete  %cl
  cmpl  $1, -8(%r12,%rbx,4)
  movl  $0, %edx
  cmovel  %eax, %edx
  cmpl  $1, -12(%r12,%rbx,4)
  movzbl  %cl, %ecx
  je  .LBB0_8
# BB#7:    # %select.false
           #   in Loop: Header=BB0_6 Depth=1
  xorl  %edx, %edx
.LBB0_8:   # %select.end
           #   in Loop: Header=BB0_6 Depth=1
  andl  %edx, %ecx
  cmpl  $1, (%r12,%rbx,4)
  sete  %dl
  movzbl  %dl, %edx
  negl  %ecx
  testl  %ecx, %edx
  je  .LBB0_9
# BB#19:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_9:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

There are 4 comparisons and the rest is just about turning flags into registers. I think we can get rid of that.

Round 3. ==, && vs *, +, -

In this simple example, we can substitute almost all the logic with computation. Instead of comparing each value separately we can calculate the squared error for the whole 4-tuple.

And this works perfectly. The code is a few lines longer, but there are no comparisons until the very end. Everything is blunt number crunching. Computers love this kind of code.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  movl  -12(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  movl  -8(%r12,%rbx,4), %ecx
  addl  $-1, %ecx
  imull  %ecx, %ecx
  addl  %eax, %ecx
  movl  -4(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  addl  %ecx, %eax
  movl  (%r12,%rbx,4), %ecx
  movl  $1, %edx
  subl  %ecx, %edx
  addl  $-1, %ecx
  imull  %edx, %ecx
  cmpl  %ecx, %eax
  jne  .LBB0_7
# BB#17:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_7:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

Round 4. * vs abs

We don't really need multiplications. We don't really need squared error. Why don't we calculate an absolute error instead?

That's why. Getting an absolute value of an int is not a trivial job. It's not about clipping a bit off; it is a fair comparison and a conditional move. This pair is heavier than a simple multiplication.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  movl  -12(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  movl  -8(%r12,%rbx,4), %ecx
  addl  $-1, %ecx
  imull  %ecx, %ecx
  addl  %eax, %ecx
  movl  -4(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  addl  %ecx, %eax
  movl  (%r12,%rbx,4), %ecx
  movl  $1, %edx
  subl  %ecx, %edx
  addl  $-1, %ecx
  imull  %edx, %ecx
  cmpl  %ecx, %eax
  jne  .LBB0_7
# BB#17:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_7:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  movl  -12(%r12,%rbx,4), %eax
  leal  -1(%rax), %ecx
  movl  $1, %edx
  subl  %eax, %edx
  testl  %eax, %eax
  cmovgl  %ecx, %edx
  movl  -8(%r12,%rbx,4), %eax
  leal  -1(%rax), %ecx
  movl  $1, %esi
  subl  %eax, %esi
  testl  %eax, %eax
  cmovgl  %ecx, %esi
  addl  %edx, %esi
  movl  -4(%r12,%rbx,4), %eax
  leal  -1(%rax), %ecx
  movl  $1, %edx
  subl  %eax, %edx
  testl  %eax, %eax
  cmovgl  %ecx, %edx
  addl  %esi, %edx
  movl  (%r12,%rbx,4), %eax
  leal  -1(%rax), %ecx
  movl  $1, %esi
  subl  %eax, %esi
  testl  %eax, %eax
  cmovgl  %ecx, %esi
  addl  %edx, %esi
  jne  .LBB0_7
# BB#17:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_7:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

What's interesting, in floating-point numbers, when getting an absolute value is about clipping a bit off, there are no tests or conditions.

Round 5. int vs double

The question is, will there be any gain if we simply translate our code to work with doubles?

Not really, no. The operations are different, but essentially the code is the same. It's still about comparing and jumping.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  vmovsd  -24(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  -16(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  -8(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  (%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

Round 6. *, +, - int vs double

No. The code is conceptually the same though, so the loss is also not that drastic.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  movl  -12(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  movl  -8(%r12,%rbx,4), %ecx
  addl  $-1, %ecx
  imull  %ecx, %ecx
  addl  %eax, %ecx
  movl  -4(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  addl  %ecx, %eax
  movl  (%r12,%rbx,4), %ecx
  movl  $1, %edx
  subl  %ecx, %edx
  addl  $-1, %ecx
  imull  %edx, %ecx
  cmpl  %ecx, %eax
  jne  .LBB0_7
# BB#17:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_7:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  vmovsd  .LCPI1_0(%rip), %xmm0
  vxorps  %xmm1, %xmm1, %xmm1
  .align  16, 0x90
.LBB1_6:
  # =>This Inner Loop Header: Depth=1
  vaddsd  -24(%r12,%rbx,8), %xmm0, %xmm2
  vmulsd  %xmm2, %xmm2, %xmm2
  vaddsd  -16(%r12,%rbx,8), %xmm0, %xmm3
  vmulsd  %xmm3, %xmm3, %xmm3
  vaddsd  %xmm3, %xmm2, %xmm2
  vaddsd  -8(%r12,%rbx,8), %xmm0, %xmm3
  vmulsd  %xmm3, %xmm3, %xmm3
  vaddsd  %xmm3, %xmm2, %xmm2
  vaddsd  (%r12,%rbx,8), %xmm0, %xmm3
  vmulsd  %xmm3, %xmm3, %xmm3
  vaddsd  %xmm3, %xmm2, %xmm2
  vucomisd  %xmm1, %xmm2
  jne  .LBB1_9
  jnp  .LBB1_8
  jmp  .LBB1_9
.LBB1_8:   #   in Loop: Header=BB1_6 Depth=1
  incl  4(%rsp)
.LBB1_9:   # %.backedge
           #   in Loop: Header=BB1_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB1_6

Round 7. double vs float

That sound a bit like cheating because with single precision types we would have to read two times less from the memory but let's give it a go.

And that's the most peculiar thing. Although the code now reads much less from the memory, performance wise it's almost identical to its original form.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  vmovsd  -24(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  -16(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  -8(%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  vmovsd  (%r15,%rbx,8), %xmm0
  vucomisd  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  vmovss  -12(%r15,%rbx,4), %xmm0
  vucomiss  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  vmovss  -8(%r15,%rbx,4), %xmm0
  vucomiss  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  vmovss  -4(%r15,%rbx,4), %xmm0
  vucomiss  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  vmovss  (%r15,%rbx,4), %xmm0
  vucomiss  .LCPI0_0(%rip), %xmm0
  jne  .LBB0_11
  jp  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

This has probably something to do with the bus and the CPU working asynchronously. If the CPU is busy enough, the bus load doesn't matter.

Round 8. 1 vs 0

This is the bonus round. There is a notion, right along with the short-circuiting being good for performance, that the comparison with zero is cheaper than any other comparison. Let's try it out.

Of course, it isn't. It may be contextually beneficial to check against zero because with Intel architecture there is a dedicated zero flag, and most of the time you can read it from the previous operation skipping the comparison part altogether. But in our example, there is no operation to load the zero flag so the code on both sides is almost identical.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $1, -12(%r12,%rbx,4)
  jne  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -8(%r12,%rbx,4)
  jne  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, -4(%r12,%rbx,4)
  jne  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $1, (%r12,%rbx,4)
  jne  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  cmpl  $0, -12(%r12,%rbx,4)
  jne  .LBB0_11
# BB#7:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $0, -8(%r12,%rbx,4)
  jne  .LBB0_11
# BB#8:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $0, -4(%r12,%rbx,4)
  jne  .LBB0_11
# BB#9:    #   in Loop: Header=BB0_6 Depth=1
  cmpl  $0, (%r12,%rbx,4)
  jne  .LBB0_11
# BB#10:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
  .align  16, 0x90
.LBB0_11:  # %._crit_edge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

Congratulations!

You scored pixels of error. As a reference, if you leave every slider untouched there would be exactly pixels of error.

Conclusion

Truisms don't help performance. That's my conclusion. We all know that logical operations perform short-circuiting; that floats are shorter than doubles; that throwing away a sign from a number is easier than computing its square. It's all true. It's all irrelevant without a proper context.