This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Check if your performance intuition still works with CUDA

For those of you who don't know what CUDA is, let me explain. Imagine, buses were never invented. There are cars, trains, planes, and motorcycles, just not buses. And one day someone smart asks himself: “wouldn't it be splendid to have cars that would fit a lot of people? One guy could be driving, and all the rest will enjoy the ride.” “Right, like trucks but for people!” “No-no-no, who on earth would ever want to travel by truck? Like airplanes, but without wings!”

And so they start producing planes with foldable wings, and they do carry a lot of people, and they do go on public roads, and they are a huge success! Especially when it comes to commuting cryptominers to their cryptomines and back.

Now replace “cars” with “personal computers”, and “planes” with “GPUs” — and that's CUDA for you.

CUDA stands for Compute Unified Device Architecture and it is meant to revolutionize computing by making drastically different computational devices work together. It's just that we don't have a lot of those devices on our hands. Or in production. Or even in development. Maybe in the future. Meanwhile, everyone has a GPU, so for now, CUDA is a technology that turns your GPU into a micro-super-computer.

With CUDA, you can write code in C, Fortran, or C++, and make your graphic card run it using its own resources. Memory, cache, processors. It has a lot of computing units working in parallel so if your task is easily parallelizable, you can make it run on GPU much faster than on CPU. Of course, these units differ vastly from general-purpose processors we all used to, and our performance intuition may not appear transferable.

So let's see what optimizations work on GPU and what not.

But before we start. I have done a whole series of quizzes about micro-optimizations. It's called “Challenge your performance intuition...” with something-something.

1) ...with C++ magic squares;

2) ...with nanosecond sorting;

3) ...with C++ operators;

4) ...with C++ sine.

The code for these quizzes was only measured on the CPU and mostly on Intel. So what I want to do now, I want to take a few micro-optimizations from these quizzes, rerun them on GPU and see whether they still work or not. And, of course, the whole exercise is going to be a quiz too.

This is my benchmark.

#include <iostream>
#include <vector>
#include <random>
#include <cuda_runtime.h>


using TheType = float;
constexpr auto TheSize = 65536u*128u;
constexpr auto TheSizeInBytes = TheSize*sizeof(TheType);
constexpr auto TheInnerLoop = 256u;


#define attempt(smth) { \
    auto s=(smth); \
    if(s!=cudaSuccess) { \
        std::cout << cudaGetErrorString(s) << \
        " at " << __LINE__ << "\n"; return -1; \
    } \
}

#define measure(smth) {\
    /*timestamp start*/\
    cudaEvent_t start;\
    cudaEventCreate(&start);\
    cudaEventRecord(start, 0);\
    cudaEvent_t stop;\
    cudaEventCreate(&stop); \
\
    /* run it*/\
    int threadsPerBlock = 256;\
    int blocksPerGrid = \
        (TheSize - TheInnerLoop + threadsPerBlock - 1) \
        / threadsPerBlock;\
    smth<<<blocksPerGrid, threadsPerBlock>>> \
        (d_xs, d_ys, d_zs, TheSize);\
    attempt(cudaGetLastError());\
    attempt(cudaDeviceSynchronize());\
\
    /* timestamp stop*/\
    cudaEventRecord(stop, 0); \
    cudaEventSynchronize(stop);\
    float elapsedTime;\
    cudaEventElapsedTime(&elapsedTime, start, stop);\
    std::cout << "Time of " << #smth << \
        ": " << elapsedTime << "\n";}\


int main(void)
{
    // prepare the data
    std::mt19937 rng(0);
    std::uniform_real_distribution<TheType> distribution(0.f, 1.f);
    std::vector<TheType> xs(TheSize);
    std::vector<TheType> ys(TheSize);
    std::vector<TheType> zs(TheSize);
    for (TheType &number : xs) number = distribution(rng);
    for (TheType &number : ys) number = distribution(rng);


    // do the allocations
    float *d_xs = nullptr;
    float *d_ys = nullptr;
    float *d_zs = nullptr;
    attempt(cudaMalloc((void **)&d_xs, TheSizeInBytes));
    attempt(cudaMalloc((void **)&d_ys, TheSizeInBytes));
    attempt(cudaMalloc((void **)&d_zs, TheSizeInBytes));

    // and copying
    attempt(cudaMemcpy(d_xs, xs.data(), TheSizeInBytes,
        cudaMemcpyHostToDevice));
    attempt(cudaMemcpy(d_ys, ys.data(), TheSizeInBytes,
        cudaMemcpyHostToDevice));

    measure(some kernel function);


    // back (for debug, don't really want it)
    attempt(cudaMemcpy(zs.data(), d_zs, TheSizeInBytes,
        cudaMemcpyDeviceToHost));

    attempt(cudaFree(d_xs));
    attempt(cudaFree(d_ys));
    attempt(cudaFree(d_zs));
    return 0;
}

All the measurements are performed on GeForce GTX 1050 Ti Mobile.

Now I'm going to propose some snippets of codes in pairs. For each pair, please set what you think is the relative performance of it with the slider below the code (yes, it's a slider), and press the “Reveal the truth” button. Instead of points, you'll get pixels of error. It's how far is the slider set from the real number.

The goal of the quiz is to score as few pixels of error as possible.

Round 1. mul vs. div

There is a well-known optimization quoted among others in Modern Fortran Explained.

For example, if a, b, and c are real variables, the expression a / b / c might be evaluated as a / (b * c) on a processor that can multiply much faster than it can divide.

So... there are processors that are not like that? Let's see if my GPU can divide as fast as multiply.

__global__ void mul(const float *xs1,
                    const float *xs2,
                    float *ys,
                    int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += xs1[i+j] * xs2[i+j];
  }
  ys[i] = res;
}

__global__ void div(const float *xs1,
                    const float *xs2,
                    float *ys,
                    int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += xs1[i+j] / xs2[i+j];
  }
  ys[i] = res;
}

Set your estimate with a slider and press the button below.

They are almost the same.

Round 2. mul vs. div --use_fast_math

CUDA has a magic compiler flag that promises to make math faster. It's called --use_fast_math, and what it does, it relaxes precision expectations for the division (--prec-div=false), and the square root function (--prec-sqrt=false), allows processors to ignore denormalized floating-point numbers (--ftz=true), and allows the instruction that makes multiplication and addition in one go (--fmad=true).

For the previous experiment, the --prec-div=false would be the most relevant. Let's see how fast the division would be with its precision constraints lifted.

// nvcc --use_fast_math
__global__ void mul(const float *xs1,
                    const float *xs2,
                    float *ys,
                    int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += xs1[i+j] * xs2[i+j];
  }
  ys[i] = res;
}

// nvcc --use_fast_math
__global__ void div(const float *xs1,
                    const float *xs2,
                    float *ys,
                    int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += xs1[i+j] / xs2[i+j];
  }
  ys[i] = res;
}

They are almost the same.

Round 3. sqrt vs. sqrt --prec-sqrt=false

How well this fast math works for square roots?

__global__ void std_sqrt(const float *xs1,
                         const float *xs2,
                         float *ys,
                         int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += std::sqrt(xs1[i+j]);
  }
  ys[i] = res;
}

// nvcc --prec-sqrt=false
__global__ void std_sqrt(const float *xs1,
                         const float *xs2,
                         float *ys,
                         int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += std::sqrt(xs1[i+j]);
  }
  ys[i] = res;
}

They are almost the same.

Round 4. sine vs. polynomial sine

There was a way to emullate the sine function with a fast but inaccurate polynomial described in Challenge your performance intuition with C++ sine. Let's check if it works for CUDA.

__global__ void std_sin(const float *xs1,
                        const float *xs2,
                        float *ys,
                        int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += std::sin(xs1[i+j]);
  }
  ys[i] = res;
}

__global__ void poly_sin(const float *xs1,
                         const float *xs2,
                         float *ys,
                         int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += -0.000182690409228785*x*x*x*x*x*x*x
           +0.00830460224186793*x*x*x*x*x
           -0.166651012143690*x*x*x
           +x;
  }
  ys[i] = res;
}

They are almost the same.

Round 5. sine: double vs. float

How much faster would it be to compute the polynomial in native single-precision floats instead of double-precision ones we requested?

__global__ void poly_sin(const float *xs1,
                         const float *xs2,
                         float *ys,
                         int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += -0.000182690409228785*x*x*x*x*x*x*x
           +0.00830460224186793*x*x*x*x*x
           -0.166651012143690*x*x*x
           +x;
  }
  ys[i] = res;
}

__global__ void poly_sin3(const float *xs1,
                          const float *xs2,
                          float *ys,
                          int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += -0.000182690409228785f*x*x*x*x*x*x*x
           +0.00830460224186793f*x*x*x*x*x
           -0.166651012143690f*x*x*x
           +x;
  }
  ys[i] = res;
}

They are almost the same.

Round 6. Polynomial vs. Horner's scheme

There is a way to evaluate an n-polynomial with only n-1 multiplications and additions. It's called Horner's scheme.

ax³ + bx² + cx + d = x(x(xa + b) + c) + d

Normally, this is something a compiler can figure out by itself. But would it?

__global__ void poly_sin3(const float *xs1,
                          const float *xs2,
                          float *ys,
                          int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += -0.000182690409228785f*x*x*x*x*x*x*x
           +0.00830460224186793f*x*x*x*x*x
           -0.166651012143690f*x*x*x
           +x;
  }
  ys[i] = res;
}

__global__ void poly_sin4(const float *xs1,
                          const float *xs2,
                          float *ys,
                          int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += x*(x*x*(x*x*(x*x*
           -0.000182690409228785f
           +0.00830460224186793f)
           -0.166651012143690f)
           +1.f);
  }
  ys[i] = res;
}

They are almost the same.

Round 7. sine --use_fast_math vs. Horner's sine

Ok, but we haven't tried running the standard sine with fast math enabled. Let's compared it to our Horner's single-precision polynomial.

// nvcc --use_fast_math
__global__ void std_sin(const float *xs1,
                        const float *xs2,
                        float *ys,
                        int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    res += std::sin(xs1[i+j]);
  }
  ys[i] = res;
}

__global__ void poly_sin4(const float *xs1,
                          const float *xs2,
                          float *ys,
                          int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  auto res = 0.f;
  for(auto j = 0u; j < TheInnerLoop; ++j) {
    const auto x = xs1[i+j];
    res += x*(x*x*(x*x*(x*x*
           -0.000182690409228785f
           +0.00830460224186793f)
           -0.166651012143690f)
           +1.f);
  }
  ys[i] = res;
}

They are almost the same.

Round 8. Logical && vs. bitwise &

In Challenge your performance intuition with C++ operators, a few alternatives to the logical operations are examined. Apparently, on x86, short-circuiting on logical operations is not necessarily beneficial. Sometimes we want it off. In C++ you don't have legitimate control over it so we have to use tricks to hint a compiler of our intentions.

Sometimes, using bitwitse operations instead of logical ones work.

__global__ void logical_and(const float *xs1,
                            const float *xs2,
                            float *ys,
                            int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  bool all_gt = true;
  for(auto j = 0u; j < TheInnerLoop - 3; ++j) {
    all_gt = all_gt
           && (xs1[i+j] > xs1[i+j])
           && (xs1[i+j+1] > xs1[i+j+1])
           && (xs1[i+j+2] > xs1[i+j]+2)
           && (xs1[i+j+3] > xs1[i+j+3]);
  }
  ys[i] = all_gt ? 1.f : 0.f;
}

__global__ void bit_and(const float *xs1,
                        const float *xs2,
                        float *ys,
                        int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  bool all_gt = true;
  for(auto j = 0u; j < TheInnerLoop - 3; ++j) {
    all_gt = all_gt
           & (xs1[i+j] > xs1[i+j])
           & (xs1[i+j+1] > xs1[i+j+1])
           & (xs1[i+j+2] > xs1[i+j]+2)
           & (xs1[i+j+3] > xs1[i+j+3]);
  }
  ys[i] = all_gt ? 1.f : 0.f;
}

They are almost the same.

Round 9. Logical && vs. mul trick

The other trick to make a compiler skip the short circuiting is turning logic into arithmetics.

__global__ void logical_and(const float *xs1,
                            const float *xs2,
                            float *ys,
                            int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  bool all_gt = true;
  for(auto j = 0u; j < TheInnerLoop - 3; ++j) {
    all_gt = all_gt
           && (xs1[i+j] > xs1[i+j])
           && (xs1[i+j+1] > xs1[i+j+1])
           && (xs1[i+j+2] > xs1[i+j]+2)
           && (xs1[i+j+3] > xs1[i+j+3]);
  }
  ys[i] = all_gt ? 1.f : 0.f;
}

__global__ void mul_and(const float *xs1,
                        const float *xs2,
                        float *ys,
                        int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  bool all_gt = true;
  for(auto j = 0u; j < TheInnerLoop - 3; ++j) {
    all_gt = all_gt
           * (xs1[i+j] > xs1[i+j])
           * (xs1[i+j+1] > xs1[i+j+1])
           * (xs1[i+j+2] > xs1[i+j]+2)
           * (xs1[i+j+3] > xs1[i+j+3]);
  }
  ys[i] = all_gt ? 1.f : 0.f;
}

They are almost the same.

Round 10. swap sort vs. index sort

In Challenge your performance intuition with nanosecond sorting an index sort is compared with standard sort. CUDA doesn't allow the standard sort but we can use a swap sort instead.

#define swap(a, b) {auto c = a; a = b; b = c;}

__global__ void sort(const float *xs1,
                     const float *xs2,
                     float *ys,
                     int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  float checksum = 0.;
  for(auto j = 0u; j < TheInnerLoop - 2; ++j) {
    double s[3] = {xs1[i+j],xs1[i+j+1],xs1[i+j+2]};
    if(s[0] > s[1])
      swap(s[0], s[1]);
    if(s[1] > s[2])
      swap(s[1], s[2]);
    if(s[0] > s[1])
      swap(s[0], s[1]);
    checksum += s[0] + 2*s[1] + 3*s[3];
  }
  ys[i] = checksum;
}

__global__ void index_sort(const float *xs1,
                           const float *xs2,
                           float *ys,
                           int size) {
  int i = (blockDim.x * blockIdx.x + threadIdx.x);
  float checksum = 0.;
  for(auto j = 0u; j < TheInnerLoop - 2; ++j) {
    double s[3] = {xs1[i+j],xs1[i+j+1],xs1[i+j+2]};
    const auto a = s[0];
    const auto b = s[1];
    const auto c = s[2];
    s[int(a > b) + int(a > c)] = a;
    s[int(b >= a) + int(b > c)] = b;
    s[int(c >= a) + int(c >= b)] = c;
    checksum += s[0] + 2*s[1] + 3*s[3];
  }
  ys[i] = checksum;
}

They are almost the same.