This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.

Challenge your performance intuition with C++ sine

It's fascinating how the smallest detail always provokes the most of the discussion. I've been writing and talking about polynomial synthesis a lot, and I always use the sine function as an example. And there's always someone who approaches me with questions like: does modeling the sine make sense performance-wise? Couldn't I use a lookup table instead? Does it have to be a polynomial?

And I always have a two-part response for them. The first one is "it's just an example." And the second one — "Yes, but..."

The idea of polynomial synthesis, also known as polynomial modeling, is simple: if it's hard to compute some function, or even formulate it in a direct form — don't do that. Use a polynomial that behaves like it instead.

The sine function is a perfect specimen for polynomial modeling. It has a lot of interesting differential and integral properties. It has a lot of practical applications. And yes, quite often, you want to exchange a bit of precision for a bit of performance, and polynomial modeling is starting to make practical sense.

It doesn't have to be the sine specifically, but since it's the sine that provokes questions, I should probably answer them properly.

I wrote a small benchmark to measure performance of different varians of sine, and did a series of experiments with it.

#include <chrono>
#include <iostream>
#include <random>
#include <array>

int main() {
    constexpr auto n = 10000000;
    std::mt19937 rng(0);
    std::uniform_real_distribution<double> distribution(0, 2);
    std::vector<double> xs(n);
    for (auto &x : xs) {
        x = distribution(rng);
    }

    auto start = std::chrono::system_clock::now();

    auto side_effect = 0;
    for (auto i = 0u; i < n; ++i) {
        The run time of the code here will be measured.
        It has to affect the side effect so the compiler
        wouldn't optimize it out.
    }

    auto end = std::chrono::system_clock::now();

    std::cout << "time: " << (end-start).count() * 1e-9
              << "  se: " << side_effect << "\n";
}

All the measurements are performed on Intel(R) Core(TM) i7-9700F CPU @ 3.00GHz; code compiled with clang version 11.0.0-2~ubuntu20.04.1, and with compiler flags -O2 -std=c++14 -march=native

Simply laying down my numbers would be boring. So, as usual, I propose a game. Given your best judgment, please estimate the relative speed of two pieces of code. The goal is to minimize your error in slider pixels so whoever is closest to zero, wins.

Round 1. Sine vs. a small lookup table

Ok, so why don't we try a lookup table first? The idea of the lookup table is simple: instead of computing a function, we look up its value in some precalculated table. Yes, there will be some quantity of error, but we can manipulate it by using either smaller but inaccurate or more precise but larger tables.

It's a nice way to balance speed and precision. Let's try it on a very small table of a 100 elements.

double sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += std::sin(xs[i]);
constexpr auto ln = 100;
std::array<double, ln> lookup;
for(auto i = 0u; i < ln; ++i)
    lookup[i] = sin(static_cast<double>(i)
        / static_cast<double>(ln));

...

double sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += lookup[static_cast<size_t>(xs[i] * ln)];

They are almost the same.

Round 2. Sine vs. a large lookup table

Let's emulate a larger application by forcing the lookup table compete for cache storate with itself. Let's make the table unreasonably large.

double sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += std::sin(xs[i]);
constexpr auto ln = 10000000;
std::array<double, ln> lookup;
for(auto i = 0u; i < ln; ++i)
    lookup[i] = sin(static_cast<double>(i)
        / static_cast<double>(ln));

...

double sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += lookup[static_cast<size_t>(xs[i] * ln)];

They are almost the same.

Round 3. Double-precision sine vs. single-precision sine

The other option to exchange precision for speed would be to choose a smaller carrier type for the computation. Or would it?

In the early days of floating-point coprocessors, there was indeed a performance difference between computations in single and double precision. Then, on x87, it vanished. All the computation was done in 80-bit numbers regardless of the input and output types. Then, with superscalarization, the difference went back, since you can store more floats than doubles in a single large register. Also, you can fit more numbers in your cache, and read more numbers with a single line.

So what is the difference these days?

std::vector<double> ds(n);
std::vector<float> fs(n);
for (auto i = 0u; i &tl; n; ++i) {
    const auto x = distribution(rng);
    ds[i] = x;
    fs[i] = static_cast<float>(x);
}

...

double sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += std::sin(ds[i]);
std::vector<double> ds(n);
std::vector<float> fs(n);
for (auto i = 0u; i &tl; n; ++i) {
    const auto x = distribution(rng);
    ds[i] = x;
    fs[i] = static_cast<float>(x);
}

...

float sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += std::sin(fs[i]);

They are almost the same.

Round 4. sin() vs. std::sin()

While doing the previous experiment, I noticed one interesting thing. The sin() and std::sin() in C++ are completely different numbers. Now, without googling it up, can you guess what is their performance difference on single precision floating points?

float sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += std::sin(fs[i]);
float sum  = 0.;
for (auto i = 0u; i < n; ++i)
    sum += sin(fs[i]);

They are almost the same.

Round 5. Sine vs. a polynomial model

Ok, now for the polynomial model. There is no single best way to model a sine. You can use the Taylor series or you can write down the properties of the function you want to model, make them into a linear system, and then solve the system and collect the coefficients for your polynomial.

Let's make a model out of two values of sine, two derivatives, and the fact that it's anti-symmetrical. Here's the Python code that computes the model for us.

from sympy import *
from sympy.printing import print_ccode
from math import pi

a, b, c, d, x = symbols('a b c d x')

sine = a*x**7 + b*x**5 + c*x**3 + d*x
sine_d = diff(sine, x)
sine_i = integrate(sine, x)

the_system = [
    sine_i.subs(x, pi / 2) - sine_i.subs(x, 0) - 1,
    sine_d.subs(x, 0) - 1,
    sine_d.subs(x, pi / 2),
    sine.subs(x, pi / 2) - 1,
    # implicitly: sine.subs(x, 0) - 0
]

res = solve(the_system, (a, b, c, d))

for var, exp in res.items():
    print(var, exp)

This model isn't very accurate, but it still looks a lot like the real sine on its most important interval, and when you turn it into the periodic function, it will remain continuous and smooth on the whole range of real numbers. Lookup tables don't have continuity, and the Taylor series wouldn't grant you smoothness.

It's also fast.

double sum  = 0.;

for (auto i = 0u; i < n; ++i)
    sum += std::sin(xs[i]);
double sum  = 0.;
for (auto i = 0u; i < n; ++i) {
    const auto x = xs[i];
    const auto sine =
        -0.000182690409228785*x*x*x*x*x*x*x
        +0.00830460224186793*x*x*x*x*x
        -0.166651012143690*x*x*x
        +x;
    sum += sine;
}

They are almost the same.

Round 6. Multiplication vs. division in pure computation

But why polynomial? What makes it so special?

Nothing except that it is the cheapest thing to compute. Computers do additions and multiplications astonishingly well, but even the division is comparatively slow. Not to mention exponents and logarithms that, on the hardware level, are usually substituted with polynomial models anyway.

So let's try two synonymous operations. One made entirely by multiplying, and the other — by dividing.

const int n = 10000000;
volatile double a = 2.;
volatile double b = 0.5;

...

double result = 1.;
for (auto i = 0u; i < n; ++i)
    result = result * a * b;
const int n = 10000000;
volatile double a = 2.;
volatile double b = 0.5;

...

double result = 1.;
for (auto i = 0u; i < n; ++i)
    result = result / a / b;

They are almost the same.

Round 7. Division vs. multiplication in array

But the difference between division and multiplication is not necessary that pronounced. In fact, computation itself is rarely a bottleneck these days. You would normally spend more time on waiting for the memory bus than doing actual divisions and multiplications.

constexpr auto n = 10000000;
std::mt19937 rng(0);
std::uniform_int_distribution<unsigned int>
    distribution(0, 2);
std::vector<unsigned int> xs(n);
for (auto &x : xs) {
    x = distribution(rng);
}
volatile double d = 0.5;
volatile double m = 2.0;

...

auto sum  = 0;
for (auto i = 0u; i < n; ++i)
    sum += xs[i] * m;
constexpr auto n = 10000000;
std::mt19937 rng(0);
std::uniform_int_distribution<unsigned int>
    distribution(0, 2);
std::vector<unsigned int> xs(n);
for (auto &x : xs) {
    x = distribution(rng);
}
volatile double d = 0.5;
volatile double m = 2.0;

...

auto sum  = 0;
for (auto i = 0u; i < n; ++i)
    sum += xs[i] / d;

They are almost the same.

Links

Challenge your performance intuition with C++ magic squares

Challenge your performance intuition with nanosecond sorting

Challenge your performance intuition with C++ operators

Further reading: Learning the value of good benchmarking technique with C++ magic squares.

Github with all the experiments.