This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Challenge your performance intuition with nanosecond sorting

In the physical world, nanotechnology predictably brings us surprises. Being structured on nano-scale, the most common materials, such as carbon, iron or copper, gain new unforeseen properties. What we know about them on the macro-scale do not apply anymore.

Surprisingly, a very similar effect happens in computation. If the operation you want to speed-up already runs in a few nano-seconds, your reasoning about algorithmic complexity probably wouldn't apply.

The most effective algorithms become mediocre while the useless rise from the oblivion to shine and amaze. One of these algorithms is the index sort. It has unconditional O(n2) complexity and this makes it probably the least performant sorting algorithm you never heard about. It's useless on the macro-scale. But on the nano-scale, it excels.

Index sort

The idea of index sort is this. For every item in the assorted array, you can find its index in the sorted array by comparing it to all the others.

Example
X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > ? 2 >= > > ? 4 >= >= > ? 2 >= >= >= ?

It's not algorithmically elegant. But it's deterministic meaning that you can avoid branching completely. On small arrays, this should be a noticeable advantage.

And now for the challenge

Let's compare how different sorting works on triplets — arrays of three elements. You might think that this is a rare task but when you work with 3D graphics it comes up surprisingly often. Last year alone I did triplet sorting for triangles' indices, for cubic equation roots, and for axis-oriented bounding box' dimensions.

Tasks like that are rarely bottlenecks but if you can speed up your algorithm for even 12% with no measurable effort, it's probably worth doing.

Anyway, here is my benchmark.

constexpr size_t samples = 10'000'000;
std::array<std::array<int, 3>, samples> g_data;

static void ResetData() {
    std::mt19937 rng(0);
    std::uniform_int_distribution<int> random_number(1, 100);
    for (auto& numbers : g_data) {
        numbers[0] = random_number(rng);
        numbers[1] = random_number(rng);
        numbers[2] = random_number(rng);
    }
}

int CheckDataForMissorts() {
    int missorts = 0;
    for (auto& numbers : g_data) {
        if(numbers[1] < numbers[0] || numbers[2] < numbers[1])
            missorts++;
    }
    return missorts;
}

#define MEASURE(CODE_TO_MEASURE, NAME_TO_PRINT) \
    { \
    ResetData();    \
    auto start = std::chrono::system_clock::now(); \
    for(auto& t : g_data) { \
        CODE_TO_MEASURE \
    }   \
    auto end = std::chrono::system_clock::now(); \
    std::chrono::duration<double> time = end - start; \
    std::cout << time.count() << " - " << NAME_TO_PRINT; \
    std::cout << "\n"; \
    std::cout << "missorts: " << CheckDataForMissorts(); \
    std::cout << "\n\n"; \
    }
    

It generates a lot of triplets, runs a piece of code a lot of times, and reports the total running time.

All the pieces of code are compiled with GCC 5.4.0: g++ -std=c++14 -O3, and measured on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.

Everything (spoilers including) is availabe on GitHub.

Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.

Round 1.

The main question, the question #1 that has to be answered before anything else: why don't you use std::sort?

std::sort(t.begin(), t.end());
    
const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;
	

They are almost the same.

Round 2.

Sorting triplets is fun but can we generalise it with no additional overhead?

const auto a = t[0];
const auto b = t[1];
const auto c = t[2];
t[int(a > b) + int(a > c)] = a;
t[int(b >= a) + int(b > c)] = b;
t[int(c >= a) + int(c >= b)] = c;
    
template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}
    

They are almost the same.

Round 3.

For a triplet, there is another std::sort substitude and it's just three conditional swaps.

template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}
    
void swap_sort(std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        if (a > b)
            std::swap(a, b);
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    

They are almost the same.

Round 4.

What if we do std::min/std::max instead of conditional swaps?

void swap_sort(std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        if (a > b)
            std::swap(a, b);
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    
void swap_sort_no_ifs
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        const auto temp = std::min(a, b);
        b = std::max(a, b);
        a = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    

They are almost the same.

Round 5.

You don't have to rely on the compiler to make the deterministic conditional swap. There is an arithmetic trick that lets you sort a pair of numbers.

A half-sum of a pair minus half-difference is the lowest number; a half-sum plus a half-difference is the greatest. Of course, they are the same when their half-difference is 0.

This limits the application of this sorting. With integers, you can accidentally overflow, and with floating-point numbers, you can lose precision if the numbers have different exponent. But let's just presume this works for us. Let's see which is faster.

void swap_sort_no_ifs
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        const auto temp = std::min(a, b);
        b = std::max(a, b);
        a = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    
void swap_sort_arithmetic_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        auto sum = a+b;
        auto diff = std::abs(a-b);
        a = (sum - diff) / 2;
        b = (sum + diff) / 2;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    

They are almost the same.

Round 6.

I'll be honest, I stole this from somewhere on the Internet and I don't know how exactly this works. Apparently, it works just fine on an integer half-range so it's also limited.

But would it be faster?

void swap_sort_arithmetic_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& a, auto& b) {
        auto sum = a+b;
        auto diff = std::abs(a-b);
        a = (sum - diff) / 2;
        b = (sum + diff) / 2;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    
void swap_sort_bit_hack
    (std::array<int, 3>& t) {
    auto sort_ = [](auto& x, auto& y) {
        constexpr auto shift =
            (sizeof(int) * CHAR_BIT - 1);
        const auto temp =
            y + ((x - y) & ((x - y) >> shift));
        y = x - ((x - y) & ((x - y) >> shift));
        x = temp;
    };
    sort_(t[0], t[1]);
    sort_(t[1], t[2]);
    sort_(t[0], t[1]);
}
    

They are almost the same.

Bonus round.

Since we know that the algorithm is supposed to work with small arrays, maybe we can help the compiler to pick the types better. I was always curious to try uint_fast8_t somewhere.

template <size_t N>
void index_sort(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(auto i = 0u; i < N; ++i) {
        auto k = 0u;
        for(auto j = 0u; j < N; ++j) {
            if(j > i)
                k += int(a[i] > a[j]);
            else if(j < i)
                k += int(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}
    
template <size_t N>
void index_sort_fast_t(std::array<int, N>& t) {
    std::array<int, N> a = t;
    for(uint_fast8_t i = 0u; i < N; ++i) {
        uint_fast8_t k = 0u;
        for(uint_fast8_t j = 0u; j < N; ++j) {
            if(j > i)
                k += uint_fast8_t(a[i] > a[j]);
            else if(j < i)
                k += uint_fast8_t(a[i] >= a[j]);
        }
        t[k] = a[i];
    }
}
    

They are almost the same.

Links

Challenge your performance intuition with C++ magic squares

Challenge your performance intuition with C++ operators

Challenge your performance intuition with C++ sine

Further reading: Learning the value of good benchmarking technique with C++ magic squares.

Github with all the experiments.