Challenge your performance intuition with C++ operators
I made a mistake one day. More of a typo really. I used “bitwise and” instead of “logical and” in the code. A colleague of mine noticed that during the review and said something like this. “You shouldn't use bitwise operators instead of logical ones, the latter do short-circuiting.”
Yes. Both parts of that sentence are valid. You shouldn't use bitwise operators instead of logical ones. Logical operators in C++ perform short-circuit evaluation. That's true. But the invisible implication between the two parts is wrong.
Short-circuiting may be or may be not beneficial for the performance. It may be or may be not performed on bitwise operators too. And sometimes you're better off not doing logic the most logical way to begin with.
I want to propose a game. Let's measure things and see if we can predict the measurements before they come. We did a similar thing last year, and that was fun, so why not?
So here is the benchmark example.
#include <chrono>
#include <iostream>
#include <random>
#include <array>
int main() {
using TheType = int;
constexpr auto TheSize = 16 * 10000000;
std::mt19937 rng(0);
std::uniform_int_distribution<TheType> distribution(0, 1);
std::vector<TheType> xs(TheSize);
for (auto& digit : xs) {
digit = distribution(rng);
}
volatile auto four_1_in_a_row = 0u;
auto start = std::chrono::system_clock::now();
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1)
++four_1_in_a_row;
auto end = std::chrono::system_clock::now();
std::cout << "time: " << (end-start).count() * 1e-9
<< " 1111s: " << four_1_in_a_row << "\n";
}
I've made several variants of this code, compiled it with clang 3.8.0-2ubuntu4 -std=c++14 -O2, and measured their run time on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.
Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.
Round 1. && vs &
The main question. Is && faster than &?
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
&& xs[i+1] == 1
&& xs[i+2] == 1
&& xs[i+3] == 1)
++four_1_in_a_row;
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
& xs[i+1] == 1
& xs[i+2] == 1
& xs[i+3] == 1)
++four_1_in_a_row;
They are almost the same.
They are almost the same. In fact, they generate almost the same code.
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $1, -12(%r12,%rbx,4)
jne .LBB0_11
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -8(%r12,%rbx,4)
jne .LBB0_11
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -4(%r12,%rbx,4)
jne .LBB0_11
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $1, (%r12,%rbx,4)
jne .LBB0_11
# BB#10: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.align 16, 0x90
.LBB0_11: # %._crit_edge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $1, -12(%r12,%rbx,4)
jne .LBB0_10
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -8(%r12,%rbx,4)
jne .LBB0_10
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -4(%r12,%rbx,4)
jne .LBB0_10
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $1, (%r12,%rbx,4)
jne .LBB0_10
# BB#20: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.LBB0_10: # %.backedge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
The trick is in the context. Since == generates only 0 or 1, the short-circuiting for 0 and & is still possible. 0 & anything is still 0. Although, the same trick doesn't work with | anymore.
Round 2. & vs *
“Logical and” can also be substituted with a multiplication operator. It has the same truth table on 0 and 1.
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
& xs[i+1] == 1
& xs[i+2] == 1
& xs[i+3] == 1)
++four_1_in_a_row;
for (auto i = 0u; i < TheSize - 3; ++i)
if((xs[i] == 1)
* (xs[i+1] == 1)
* (xs[i+2] == 1)
* (xs[i+3] == 1))
++four_1_in_a_row;
They are almost the same.
Surprisingly, they perform almost the same while their code is completely different.
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $1, -12(%r12,%rbx,4)
jne .LBB0_10
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -8(%r12,%rbx,4)
jne .LBB0_10
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -4(%r12,%rbx,4)
jne .LBB0_10
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $1, (%r12,%rbx,4)
jne .LBB0_10
# BB#20: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.LBB0_10: # %.backedge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
There are 4 comparisons and the rest is just about turning flags into registers. I think we can get rid of that.
Round 3. ==, && vs *, +, -
In this simple example, we can substitute almost all the logic with computation. Instead of comparing each value separately we can calculate the squared error for the whole 4-tuple.
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
&& xs[i+1] == 1
&& xs[i+2] == 1
&& xs[i+3] == 1)
++four_1_in_a_row;
...
inline int sq(int x) {
return x*x;
}
...
for (auto i = 0u; i < TheSize - 3; ++i)
if(sq(xs[i] - 1)
+ sq(xs[i+1] - 1)
+ sq(xs[i+2] - 1)
+ sq(xs[i+3] - 1) == 0)
++four_1_in_a_row;
They are almost the same.
And this works perfectly. The code is a few lines longer, but there are no comparisons until the very end. Everything is blunt number crunching. Computers love this kind of code.
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $1, -12(%r12,%rbx,4)
jne .LBB0_11
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -8(%r12,%rbx,4)
jne .LBB0_11
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -4(%r12,%rbx,4)
jne .LBB0_11
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $1, (%r12,%rbx,4)
jne .LBB0_11
# BB#10: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.align 16, 0x90
.LBB0_11: # %._crit_edge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
We don't really need multiplications. We don't really need squared error. Why don't we calculate an absolute error instead?
...
inline int sq(int x) {
return x*x;
}
...
for (auto i = 0u; i < TheSize - 3; ++i)
if(sq(xs[i] - 1)
+ sq(xs[i+1] - 1)
+ sq(xs[i+2] - 1)
+ sq(xs[i+3] - 1) == 0)
++four_1_in_a_row;
for (auto i = 0u; i < TheSize - 3; ++i)
if(std::abs(xs[i] - 1)
+ std::abs(xs[i+1] - 1)
+ std::abs(xs[i+2] - 1)
+ std::abs(xs[i+3] - 1) == 0)
++four_1_in_a_row;
They are almost the same.
That's why. Getting an absolute value of an int is not a trivial job. It's not about clipping a bit off; it is a fair comparison and a conditional move. This pair is heavier than a simple multiplication.
That sound a bit like cheating because with single precision types we would have to read two times less from the memory but let's give it a go.
...
using TheType = double;
...
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
&& xs[i+1] == 1
&& xs[i+2] == 1
&& xs[i+3] == 1)
++four_1_in_a_row;
...
using TheType = float;
...
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
&& xs[i+1] == 1
&& xs[i+2] == 1
&& xs[i+3] == 1)
++four_1_in_a_row;
They are almost the same.
And that's the most peculiar thing. Although the code now reads much less from the memory, performance wise it's almost identical to its original form.
This has probably something to do with the bus and the CPU working asynchronously. If the CPU is busy enough, the bus load doesn't matter.
Round 8. 1 vs 0
This is the bonus round. There is a notion, right along with the short-circuiting being good for performance, that the comparison with zero is cheaper than any other comparison. Let's try it out.
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 1
&& xs[i+1] == 1
&& xs[i+2] == 1
&& xs[i+3] == 1)
++four_1_in_a_row;
for (auto i = 0u; i < TheSize - 3; ++i)
if(xs[i] == 0
&& xs[i+1] == 0
&& xs[i+2] == 0
&& xs[i+3] == 0)
four_0_in_a_row++;
They are almost the same.
Of course, it isn't. It may be contextually beneficial to check against zero because with Intel architecture there is a dedicated zero flag, and most of the time you can read it from the previous operation skipping the comparison part altogether. But in our example, there is no operation to load the zero flag so the code on both sides is almost identical.
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $1, -12(%r12,%rbx,4)
jne .LBB0_11
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -8(%r12,%rbx,4)
jne .LBB0_11
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $1, -4(%r12,%rbx,4)
jne .LBB0_11
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $1, (%r12,%rbx,4)
jne .LBB0_11
# BB#10: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.align 16, 0x90
.LBB0_11: # %._crit_edge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
movq %rax, %r14
.align 16, 0x90
.LBB0_6:
# =>This Inner Loop Header: Depth=1
cmpl $0, -12(%r12,%rbx,4)
jne .LBB0_11
# BB#7: # in Loop: Header=BB0_6 Depth=1
cmpl $0, -8(%r12,%rbx,4)
jne .LBB0_11
# BB#8: # in Loop: Header=BB0_6 Depth=1
cmpl $0, -4(%r12,%rbx,4)
jne .LBB0_11
# BB#9: # in Loop: Header=BB0_6 Depth=1
cmpl $0, (%r12,%rbx,4)
jne .LBB0_11
# BB#10: # in Loop: Header=BB0_6 Depth=1
incl 4(%rsp)
.align 16, 0x90
.LBB0_11: # %._crit_edge
# in Loop: Header=BB0_6 Depth=1
addq $1, %rbx
cmpq $160000000, %rbx
jne .LBB0_6
Congratulations!
You scored pixels of error. As a reference, if you leave every slider untouched there would be exactly pixels of error.
Conclusion
Truisms don't help performance. That's my conclusion. We all know that logical operations perform short-circuiting; that floats are shorter than doubles; that throwing away a sign from a number is easier than computing its square. It's all true. It's all irrelevant without a proper context.