This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Challenge your performance intuition with nanosecond sorting

In the physical world, nanotechnology predictably brings us surprises. Being structured on nano-scale, the most common materials, such as carbon, iron or copper, gain new unforeseen properties. What we know about them on the macro-scale do not apply anymore.

Surprisingly, a very similar effect happens in computation. If the operation you want to speed-up already runs in a few nano-seconds, your reasoning about algorithmic complexity probably wouldn't apply.

The most effective algorithms become mediocre while the useless rise from the oblivion to shine and amaze. One of these algorithms is the index sort. It has unconditional O(n²) complexity and this makes it probably the least performant sorting algorithm you never heard about. It's useless on the macro-scale. But on the nano-scale, it excels.

Index sort

The idea of index sort is this. For every item in the assorted array, you can find its index in the sorted array by comparing it to all the others.

Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       ?

    2   >=              >       >       ?

    4   >=      >=              >       ?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       1+?

    2   >=              >       >       ?

    4   >=      >=              >       ?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       ?

    4   >=      >=              >       ?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       ?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       1+?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       2+?

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       3

    2   >=      >=      >=              ?




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       3

    2   >=      >=      >=              1




Example
X (assorted): 3 2 4 2

   Xi   Comparisons                     Index
         3       2       4       2
    3           >       >       >       2

    2   >=              >       >       0

    4   >=      >=              >       3

    2   >=      >=      >=              1


                           Xs (sorted): 2 2 3 4
       (Xs[2] = 3, Xs[0] = 2, Xs[3] = 8, Xs[1] = 2)

It's not algorithmically elegant. But it's deterministic meaning that you can avoid branching completely. On small arrays, this should be a noticeable advantage.

And now for the challenge

Let's compare how different sorting works on triplets — arrays of three elements. You might think that this is a rare task but when you work with 3D graphics it comes up surprisingly often. Last year alone I did triplet sorting for triangles' indices, for cubic equation roots, and for axis-oriented bounding box' dimensions.

Tasks like that are rarely bottlenecks but if you can speed up your algorithm for even 12% with no measurable effort, it's probably worth doing.

Anyway, here is my benchmark.

constexpr size_t samples = 10'000'000;
std::array<std::array<int, 3>, samples> g_data;

static void ResetData() {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int> random_number(1, 100);
    for (auto& numbers : g_data) {
        numbers[0] = random_number(rng);
        numbers[1] = random_number(rng);
        numbers[2] = random_number(rng);
    }
}

int CheckDataForMissorts() {
    int missorts = 0;
    for (auto& numbers : g_data) {
        if(numbers[1] < numbers[0] || numbers[2] < numbers[1])
            missorts++;
    }
    return missorts;
}

#define MEASURE(CODE_TO_MEASURE, NAME_TO_PRINT) \
    { \
    ResetData();    \
    auto start = std::chrono::system_clock::now(); \
    for(auto& t : g_data) { \
        CODE_TO_MEASURE \
    }   \
    auto end = std::chrono::system_clock::now(); \
    std::chrono::duration<double> time = end - start; \
    std::cout << time.count() << " - " << NAME_TO_PRINT; \
    std::cout << "\n"; \
    std::cout << "missorts: " << CheckDataForMissorts(); \
    std::cout << "\n\n"; \
    }

It generates a lot of triplets, runs a piece of code a lot of times, and reports the total running time.

All the pieces of code are compiled with GCC 5.4.0: g++ -std=c++14 -O3, and measured on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.

Everything (spoilers including) is availabe on GitHub.

Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.

Round 1.

The main question, the question #1 that has to be answered before anything else: why don't you use std::sort?

std::sort(t.begin(), t.end());

const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;

They are almost the same.

Round 2.

Sorting triplets is fun but can we generalise it with no additional overhead?

const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;

template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}

They are almost the same.

Round 3.

For a triplet, there is another std::sort substitude and it's just three conditional swaps.

template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}

void swap_sort(std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        if (a > b)
            std::swap(a, b);
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

They are almost the same.

Round 4.

What if we do std::min/std::max instead of conditional swaps?

void swap_sort(std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        if (a > b)
            std::swap(a, b);
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

void swap_sort_no_ifs
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        const auto temp = std::min(a, b);
        b = std::max(a, b);
        a = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

They are almost the same.

With GCC, there is virtually no difference.

        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rdi+4]
        cmp     eax, edx
        jle     .L2
        mov     ecx, DWORD PTR [rdi+8]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rdi+4], eax
        cmp     ecx, eax
        jge     .L1
        mov     esi, edx
        mov     edx, eax
        mov     eax, esi
.L6:
        cmp     eax, ecx
        mov     DWORD PTR [rdi+4], ecx
        mov     DWORD PTR [rdi+8], edx
        jle     .L1
        mov     DWORD PTR [rdi], ecx
        mov     DWORD PTR [rdi+4], eax
        ret
.L2:
        mov     ecx, DWORD PTR [rdi+8]
        cmp     edx, ecx
        jg      .L6
.L1:
        rep ret

        mov     edx, DWORD PTR [rdi+4]
        mov     eax, DWORD PTR [rdi]
        cmp     edx, eax
        jl      .L8
        mov     esi, eax
        jle     .L2
        mov     ecx, DWORD PTR [rdi+8]
        cmp     ecx, edx
        jl      .L10
.L15:
        mov     eax, edx
        jle     .L4
.L5:
        cmp     eax, esi
        mov     DWORD PTR [rdi+8], ecx
        jl      .L12
        mov     edx, esi
        jle     .L6
        mov     DWORD PTR [rdi+4], eax
        mov     DWORD PTR [rdi], edx
        ret
.L12:
        mov     edx, eax
.L6:
        mov     eax, esi
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rdi+4], eax
        ret
.L8:
        mov     esi, edx
.L2:
        mov     ecx, DWORD PTR [rdi+8]
        mov     edx, eax
        cmp     ecx, edx
        jge     .L15
.L10:
        mov     eax, ecx
.L4:
        mov     ecx, edx
        jmp     .L5

Again, the code is accidentally different but it's essentially the same thing.

However, with Clang the difference is obvious.

        mov     ecx, dword ptr [rdi]
        mov     edx, dword ptr [rdi + 4]
        cmp     ecx, edx
        jle     .LBB0_1
        mov     dword ptr [rdi], edx
        mov     dword ptr [rdi + 4], ecx
        mov     eax, edx
        jmp     .LBB0_3
.LBB0_1:
        mov     eax, ecx
        mov     ecx, edx
.LBB0_3:
        mov     edx, dword ptr [rdi + 8]
        cmp     ecx, edx
        jle     .LBB0_4
        mov     dword ptr [rdi + 4], edx
        mov     dword ptr [rdi + 8], ecx
        jmp     .LBB0_6
.LBB0_4:
        mov     edx, ecx
.LBB0_6:
        cmp     eax, edx
        jle     .LBB0_8
        mov     dword ptr [rdi], edx
        mov     dword ptr [rdi + 4], eax
.LBB0_8:
        ret

        mov     eax, dword ptr [rdi]
        mov     ecx, dword ptr [rdi + 4]
        cmp     ecx, eax
        mov     edx, eax
        cmovle  edx, ecx
        cmp     eax, ecx
        cmovl   eax, ecx
        mov     ecx, dword ptr [rdi + 8]
        cmp     ecx, eax
        mov     esi, eax
        cmovle  esi, ecx
        cmp     eax, ecx
        cmovl   eax, ecx
        mov     dword ptr [rdi + 8], eax
        cmp     esi, edx
        mov     eax, edx
        cmovle  eax, esi
        cmp     edx, esi
        cmovge  esi, edx
        mov     dword ptr [rdi + 4], esi
        mov     dword ptr [rdi], eax
        ret

The min/max version is branchless! Clang uses the conditional moves instead of conditional jumps and this makes the code on the right five times faster on the very same CPU.

Round 5.

You don't have to rely on the compiler to make the deterministic conditional swap. There is an arithmetic trick that lets you sort a pair of numbers.

A half-sum of a pair minus half-difference is the lowest number; a half-sum plus a half-difference is the greatest. Of course, they are the same when their half-difference is 0.

This limits the application of this sorting. With integers, you can accidentally overflow, and with floating-point numbers, you can lose precision if the numbers have different exponent. But let's just presume this works for us. Let's see which is faster.

void swap_sort_no_ifs
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        const auto temp = std::min(a, b);
        b = std::max(a, b);
        a = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

void swap_sort_arithmetic_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        auto sum = a+b;
        auto diff = std::abs(a-b);
        a = (sum - diff) / 2;
        b = (sum + diff) / 2;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

They are almost the same.

The code is a little excessive but it is deterministic. It is branchless.

        mov     edx, DWORD PTR [rdi+4]
        mov     eax, DWORD PTR [rdi]
        cmp     edx, eax
        jl      .L8
        mov     esi, eax
        jle     .L2
        mov     ecx, DWORD PTR [rdi+8]
        cmp     ecx, edx
        jl      .L10
.L15:
        mov     eax, edx
        jle     .L4
.L5:
        cmp     eax, esi
        mov     DWORD PTR [rdi+8], ecx
        jl      .L12
        mov     edx, esi
        jle     .L6
        mov     DWORD PTR [rdi+4], eax
        mov     DWORD PTR [rdi], edx
        ret
.L12:
        mov     edx, eax
.L6:
        mov     eax, esi
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rdi+4], eax
        ret
.L8:
        mov     esi, edx
.L2:
        mov     ecx, DWORD PTR [rdi+8]
        mov     edx, eax
        cmp     ecx, edx
        jge     .L15
.L10:
        mov     eax, ecx
.L4:
        mov     ecx, edx
        jmp     .L5

        mov     edx, DWORD PTR [rdi+4]
        mov     eax, DWORD PTR [rdi]
        lea     ecx, [rax+rdx]
        sub     eax, edx
        cdq
        xor     eax, edx
        sub     eax, edx
        mov     edx, ecx
        sub     edx, eax
        add     eax, ecx
        mov     ecx, eax
        mov     esi, edx
        shr     ecx, 31
        shr     esi, 31
        add     eax, ecx
        mov     ecx, DWORD PTR [rdi+8]
        add     edx, esi
        sar     eax
        sar     edx
        lea     r8d, [rcx+rax]
        sub     eax, ecx
        mov     ecx, eax
        sar     ecx, 31
        xor     eax, ecx
        sub     eax, ecx
        mov     ecx, r8d
        sub     ecx, eax
        add     eax, r8d
        mov     esi, ecx
        shr     esi, 31
        add     esi, ecx
        mov     ecx, eax
        shr     ecx, 31
        sar     esi
        add     eax, ecx
        lea     ecx, [rsi+rdx]
        sar     eax
        mov     DWORD PTR [rdi+8], eax
        mov     eax, edx
        sub     eax, esi
        cdq
        xor     eax, edx
        sub     eax, edx
        mov     edx, ecx
        sub     edx, eax
        mov     esi, edx
        shr     esi, 31
        add     edx, esi
        sar     edx
        add     eax, ecx
        mov     DWORD PTR [rdi], edx
        mov     edx, eax
        shr     edx, 31
        add     eax, edx
        sar     eax
        mov     DWORD PTR [rdi+4], eax
        ret

Since we're already knee-deep into limitations, why don't we try a truly integer-specific hack?

Round 6.

I'll be honest, I stole this from somewhere on the Internet and I don't know how exactly this works. Apparently, it works just fine on an integer half-range so it's also limited.

But would it be faster?

void swap_sort_arithmetic_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        auto sum = a+b;
        auto diff = std::abs(a-b);
        a = (sum - diff) / 2;
        b = (sum + diff) / 2;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

void swap_sort_bit_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& x, auto& y) {
        constexpr auto shift =
            (sizeof(int) * CHAR_BIT - 1);
        const auto temp =
            y + ((x - y) & ((x - y) >> shift));
        y = x - ((x - y) & ((x - y) >> shift));
        x = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}

They are almost the same.

It is even faster than before and it generates even better code.

        mov     edx, DWORD PTR [rdi+4]
        mov     eax, DWORD PTR [rdi]
        lea     ecx, [rax+rdx]
        sub     eax, edx
        cdq
        xor     eax, edx
        sub     eax, edx
        mov     edx, ecx
        sub     edx, eax
        add     eax, ecx
        mov     ecx, eax
        mov     esi, edx
        shr     ecx, 31
        shr     esi, 31
        add     eax, ecx
        mov     ecx, DWORD PTR [rdi+8]
        add     edx, esi
        sar     eax
        sar     edx
        lea     r8d, [rcx+rax]
        sub     eax, ecx
        mov     ecx, eax
        sar     ecx, 31
        xor     eax, ecx
        sub     eax, ecx
        mov     ecx, r8d
        sub     ecx, eax
        add     eax, r8d
        mov     esi, ecx
        shr     esi, 31
        add     esi, ecx
        mov     ecx, eax
        shr     ecx, 31
        sar     esi
        add     eax, ecx
        lea     ecx, [rsi+rdx]
        sar     eax
        mov     DWORD PTR [rdi+8], eax
        mov     eax, edx
        sub     eax, esi
        cdq
        xor     eax, edx
        sub     eax, edx
        mov     edx, ecx
        sub     edx, eax
        mov     esi, edx
        shr     esi, 31
        add     edx, esi
        sar     edx
        add     eax, ecx
        mov     DWORD PTR [rdi], edx
        mov     edx, eax
        shr     edx, 31
        add     eax, edx
        sar     eax
        mov     DWORD PTR [rdi+4], eax
        ret

        mov     ecx, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rdi+4]
        mov     eax, ecx
        sub     eax, edx
        mov     esi, eax
        sar     esi, 31
        and     eax, esi
        add     edx, eax
        sub     ecx, eax
        mov     eax, DWORD PTR [rdi+8]
        mov     esi, ecx
        sub     esi, eax
        mov     r8d, esi
        sar     r8d, 31
        and     esi, r8d
        sub     ecx, esi
        add     eax, esi
        mov     DWORD PTR [rdi+8], ecx
        mov     ecx, edx
        sub     ecx, eax
        mov     esi, ecx
        sar     esi, 31
        and     ecx, esi
        sub     edx, ecx
        add     eax, ecx
        mov     DWORD PTR [rdi+4], edx
        mov     DWORD PTR [rdi], eax
        ret

Yes. It is GCC-specific though. On Clang, the best code still comes from min/max swaps.

Bonus round.

Since we know that the algorithm is supposed to work with small arrays, maybe we can help the compiler to pick the types better. I was always curious to try uint_fast8_t somewhere.

template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}

template <size_t N>
void index_sort_fast_t(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(uint_fast8_t i = 0u; i < N; ++i) {
        uint_fast8_t k = 0u;
        for(uint_fast8_t j = 0u; j < N; ++j) {
            if(j > i)
                k += uint_fast8_t(a[i] > a[j]);
            else if(j < i)
                k += uint_fast8_t(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}

They are almost the same.