Challenge your performance intuition with nanosecond sorting
In the physical world, nanotechnology predictably brings us surprises. Being structured on nano-scale, the most common materials, such as carbon, iron or copper, gain new unforeseen properties. What we know about them on the macro-scale do not apply anymore.
Surprisingly, a very similar effect happens in computation. If the operation you want to speed-up already runs in a few nano-seconds, your reasoning about algorithmic complexity probably wouldn't apply.
The most effective algorithms become mediocre while the useless rise from the oblivion to shine and amaze. One of these algorithms is the index sort. It has unconditional O(n2) complexity and this makes it probably the least performant sorting algorithm you never heard about. It's useless on the macro-scale. But on the nano-scale, it excels.
Index sort
The idea of index sort is this. For every item in the assorted array, you can find its index in the sorted array by comparing it to all the others.
It's not algorithmically elegant. But it's deterministic meaning that you can avoid branching completely. On small arrays, this should be a noticeable advantage.
And now for the challenge
Let's compare how different sorting works on triplets — arrays of three elements. You might think that this is a rare task but when you work with 3D graphics it comes up surprisingly often. Last year alone I did triplet sorting for triangles' indices, for cubic equation roots, and for axis-oriented bounding box' dimensions.
Tasks like that are rarely bottlenecks but if you can speed up your algorithm for even 12% with no measurable effort, it's probably worth doing.
Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.
Round 1.
The main question, the question #1 that has to be answered before anything else: why don't you use std::sort?
std::sort(t.begin(), t.end());
const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;
They are almost the same.
Index sort is five times faster in this context.
But in a different context, it might not work all that well. For instance, on ARMv7 (v7l) it's only 2 times faster. And with clang (Debian clang 3.5.0) it's even less drastic.
Clang's sort seems to be slightly better on small arrays generally. On my testing Intel i7 it's also 20% faster than the GCC version. Still, not enough to make it interesting so let's move along with the index sort.
Round 2.
Sorting triplets is fun but can we generalise it with no additional overhead?
const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;
template <size_t N>
void index_sort(std::array<int, N>& t) {
std::array<int, N> a = t;
for(auto i = 0u; i < N; ++i) {
auto k = 0u;
for(auto j = 0u; j < N; ++j) {
if(j > i)
k += int(a[i] > a[j]);
else if(j < i)
k += int(a[i] >= a[j]);
}
t[k] = a[i];
}
}
The code is not drastically different and it has no reason to be. My theory is that this gain is accidental. The most surprisingly, it reproduces well across compilers and CPUs.
Round 3.
For a triplet, there is another std::sort substitude and it's just three conditional swaps.
template <size_t N>
void index_sort(std::array<int, N>& t) {
std::array<int, N> a = t;
for(auto i = 0u; i < N; ++i) {
auto k = 0u;
for(auto j = 0u; j < N; ++j) {
if(j > i)
k += int(a[i] > a[j]);
else if(j < i)
k += int(a[i] >= a[j]);
}
t[k] = a[i];
}
}
void swap_sort(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
if (a > b)
std::swap(a, b);
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
They are almost the same.
They still work faster than std::sort but not as fast as the index sort. However, that's on modern Intel. On ARMv7 branching doesn't make all that difference and the measurements are almost identical.
What if we do std::min/std::max instead of conditional swaps?
void swap_sort(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
if (a > b)
std::swap(a, b);
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
void swap_sort_no_ifs
(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
const auto temp = std::min(a, b);
b = std::max(a, b);
a = temp;
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
The min/max version is branchless! Clang uses the conditional moves instead of conditional jumps and this makes the code on the right five times faster on the very same CPU.
Round 5.
You don't have to rely on the compiler to make the deterministic conditional swap. There is an arithmetic trick that lets you sort a pair of numbers.
A half-sum of a pair minus half-difference is the lowest number; a half-sum plus a half-difference is the greatest. Of course, they are the same when their half-difference is 0.
This limits the application of this sorting. With integers, you can accidentally overflow, and with floating-point numbers, you can lose precision if the numbers have different exponent. But let's just presume this works for us. Let's see which is faster.
void swap_sort_no_ifs
(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
const auto temp = std::min(a, b);
b = std::max(a, b);
a = temp;
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
void swap_sort_arithmetic_hack
(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
auto sum = a+b;
auto diff = std::abs(a-b);
a = (sum - diff) / 2;
b = (sum + diff) / 2;
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
They are almost the same.
The code is a little excessive but it is deterministic. It is branchless.
mov edx, DWORD PTR [rdi+4]
mov eax, DWORD PTR [rdi]
lea ecx, [rax+rdx]
sub eax, edx
cdq
xor eax, edx
sub eax, edx
mov edx, ecx
sub edx, eax
add eax, ecx
mov ecx, eax
mov esi, edx
shr ecx, 31
shr esi, 31
add eax, ecx
mov ecx, DWORD PTR [rdi+8]
add edx, esi
sar eax
sar edx
lea r8d, [rcx+rax]
sub eax, ecx
mov ecx, eax
sar ecx, 31
xor eax, ecx
sub eax, ecx
mov ecx, r8d
sub ecx, eax
add eax, r8d
mov esi, ecx
shr esi, 31
add esi, ecx
mov ecx, eax
shr ecx, 31
sar esi
add eax, ecx
lea ecx, [rsi+rdx]
sar eax
mov DWORD PTR [rdi+8], eax
mov eax, edx
sub eax, esi
cdq
xor eax, edx
sub eax, edx
mov edx, ecx
sub edx, eax
mov esi, edx
shr esi, 31
add edx, esi
sar edx
add eax, ecx
mov DWORD PTR [rdi], edx
mov edx, eax
shr edx, 31
add eax, edx
sar eax
mov DWORD PTR [rdi+4], eax
ret
Since we're already knee-deep into limitations, why don't we try a truly integer-specific hack?
Round 6.
I'll be honest, I stole this from somewhere on the Internet and I don't know how exactly this works. Apparently, it works just fine on an integer half-range so it's also limited.
But would it be faster?
void swap_sort_arithmetic_hack
(std::array<int, 3>& t) {
auto sort_ = [](auto& a, auto& b) {
auto sum = a+b;
auto diff = std::abs(a-b);
a = (sum - diff) / 2;
b = (sum + diff) / 2;
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
void swap_sort_bit_hack
(std::array<int, 3>& t) {
auto sort_ = [](auto& x, auto& y) {
constexpr auto shift =
(sizeof(int) * CHAR_BIT - 1);
const auto temp =
y + ((x - y) & ((x - y) >> shift));
y = x - ((x - y) & ((x - y) >> shift));
x = temp;
};
sort_(t[0], t[1]);
sort_(t[1], t[2]);
sort_(t[0], t[1]);
}
They are almost the same.
It is even faster than before and it generates even better code.
mov edx, DWORD PTR [rdi+4]
mov eax, DWORD PTR [rdi]
lea ecx, [rax+rdx]
sub eax, edx
cdq
xor eax, edx
sub eax, edx
mov edx, ecx
sub edx, eax
add eax, ecx
mov ecx, eax
mov esi, edx
shr ecx, 31
shr esi, 31
add eax, ecx
mov ecx, DWORD PTR [rdi+8]
add edx, esi
sar eax
sar edx
lea r8d, [rcx+rax]
sub eax, ecx
mov ecx, eax
sar ecx, 31
xor eax, ecx
sub eax, ecx
mov ecx, r8d
sub ecx, eax
add eax, r8d
mov esi, ecx
shr esi, 31
add esi, ecx
mov ecx, eax
shr ecx, 31
sar esi
add eax, ecx
lea ecx, [rsi+rdx]
sar eax
mov DWORD PTR [rdi+8], eax
mov eax, edx
sub eax, esi
cdq
xor eax, edx
sub eax, edx
mov edx, ecx
sub edx, eax
mov esi, edx
shr esi, 31
add edx, esi
sar edx
add eax, ecx
mov DWORD PTR [rdi], edx
mov edx, eax
shr edx, 31
add eax, edx
sar eax
mov DWORD PTR [rdi+4], eax
ret
mov ecx, DWORD PTR [rdi]
mov edx, DWORD PTR [rdi+4]
mov eax, ecx
sub eax, edx
mov esi, eax
sar esi, 31
and eax, esi
add edx, eax
sub ecx, eax
mov eax, DWORD PTR [rdi+8]
mov esi, ecx
sub esi, eax
mov r8d, esi
sar r8d, 31
and esi, r8d
sub ecx, esi
add eax, esi
mov DWORD PTR [rdi+8], ecx
mov ecx, edx
sub ecx, eax
mov esi, ecx
sar esi, 31
and ecx, esi
sub edx, ecx
add eax, ecx
mov DWORD PTR [rdi+4], edx
mov DWORD PTR [rdi], eax
ret
Yes. It is GCC-specific though. On Clang, the best code still comes from min/max swaps.
Bonus round.
Since we know that the algorithm is supposed to work with small arrays, maybe we can help the compiler to pick the types better. I was always curious to try uint_fast8_t somewhere.
template <size_t N>
void index_sort(std::array<int, N>& t) {
std::array<int, N> a = t;
for(auto i = 0u; i < N; ++i) {
auto k = 0u;
for(auto j = 0u; j < N; ++j) {
if(j > i)
k += int(a[i] > a[j]);
else if(j < i)
k += int(a[i] >= a[j]);
}
t[k] = a[i];
}
}
template <size_t N>
void index_sort_fast_t(std::array<int, N>& t) {
std::array<int, N> a = t;
for(uint_fast8_t i = 0u; i < N; ++i) {
uint_fast8_t k = 0u;
for(uint_fast8_t j = 0u; j < N; ++j) {
if(j > i)
k += uint_fast8_t(a[i] > a[j]);
else if(j < i)
k += uint_fast8_t(a[i] >= a[j]);
}
t[k] = a[i];
}
}
It's not immodest, you're just doing me a favor by sharing a quiz ;-)
Conclusion
You can always win some performance by using clever tricks. Something type-specific or compiler-specific or processor-specific. However, even in the nano-world, the simplest solution often works the best.