This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.
For those of you who don't know what CUDA is, let me explain. Imagine, buses were never invented. There are cars, trains, planes, and motorcycles, just not buses. And one day someone smart asks himself: “wouldn't it be splendid to have cars that would fit a lot of people? One guy could be driving, and all the rest will enjoy the ride.” “Right, like trucks but for people!” “No-no-no, who on earth would ever want to travel by truck? Like airplanes, but without wings!”
And so they start producing planes with foldable wings, and they do carry a lot of people, and they do go on public roads, and they are a huge success! Especially when it comes to commuting cryptominers to their cryptomines and back.
Now replace “cars” with “personal computers”, and “planes” with “GPUs” — and that's CUDA for you.
CUDA stands for Compute Unified Device Architecture and it is meant to revolutionize computing by making drastically different computational devices work together. It's just that we don't have a lot of those devices on our hands. Or in production. Or even in development. Maybe in the future. Meanwhile, everyone has a GPU, so for now, CUDA is a technology that turns your GPU into a micro-super-computer.
With CUDA, you can write code in C, Fortran, or C++, and make your graphic card run it using its own resources. Memory, cache, processors. It has a lot of computing units working in parallel so if your task is easily parallelizable, you can make it run on GPU much faster than on CPU. Of course, these units differ vastly from general-purpose processors we all used to, and our performance intuition may not appear transferable.
So let's see what optimizations work on GPU and what not.
But before we start. I have done a whole series of quizzes about micro-optimizations. It's called “Challenge your performance intuition...” with something-something.
2) ...with nanosecond sorting;
4) ...with C++ sine.
The code for these quizzes was only measured on the CPU and mostly on Intel. So what I want to do now, I want to take a few micro-optimizations from these quizzes, rerun them on GPU and see whether they still work or not. And, of course, the whole exercise is going to be a quiz too.
This is my benchmark.
#include <iostream> #include <vector> #include <random> #include <cuda_runtime.h> using TheType = float; constexpr auto TheSize = 65536u*128u; constexpr auto TheSizeInBytes = TheSize*sizeof(TheType); constexpr auto TheInnerLoop = 256u; #define attempt(smth) { \ auto s=(smth); \ if(s!=cudaSuccess) { \ std::cout << cudaGetErrorString(s) << \ " at " << __LINE__ << "\n"; return -1; \ } \ } #define measure(smth) {\ /*timestamp start*/\ cudaEvent_t start;\ cudaEventCreate(&start);\ cudaEventRecord(start, 0);\ cudaEvent_t stop;\ cudaEventCreate(&stop); \ \ /* run it*/\ int threadsPerBlock = 256;\ int blocksPerGrid = \ (TheSize - TheInnerLoop + threadsPerBlock - 1) \ / threadsPerBlock;\ smth<<<blocksPerGrid, threadsPerBlock>>> \ (d_xs, d_ys, d_zs, TheSize);\ attempt(cudaGetLastError());\ attempt(cudaDeviceSynchronize());\ \ /* timestamp stop*/\ cudaEventRecord(stop, 0); \ cudaEventSynchronize(stop);\ float elapsedTime;\ cudaEventElapsedTime(&elapsedTime, start, stop);\ std::cout << "Time of " << #smth << \ ": " << elapsedTime << "\n";}\ int main(void) { // prepare the data std::mt19937 rng(0); std::uniform_real_distribution<TheType> distribution(0.f, 1.f); std::vector<TheType> xs(TheSize); std::vector<TheType> ys(TheSize); std::vector<TheType> zs(TheSize); for (TheType &number : xs) number = distribution(rng); for (TheType &number : ys) number = distribution(rng); // do the allocations float *d_xs = nullptr; float *d_ys = nullptr; float *d_zs = nullptr; attempt(cudaMalloc((void **)&d_xs, TheSizeInBytes)); attempt(cudaMalloc((void **)&d_ys, TheSizeInBytes)); attempt(cudaMalloc((void **)&d_zs, TheSizeInBytes)); // and copying attempt(cudaMemcpy(d_xs, xs.data(), TheSizeInBytes, cudaMemcpyHostToDevice)); attempt(cudaMemcpy(d_ys, ys.data(), TheSizeInBytes, cudaMemcpyHostToDevice)); measure(some kernel function); // back (for debug, don't really want it) attempt(cudaMemcpy(zs.data(), d_zs, TheSizeInBytes, cudaMemcpyDeviceToHost)); attempt(cudaFree(d_xs)); attempt(cudaFree(d_ys)); attempt(cudaFree(d_zs)); return 0; } |
All the measurements are performed on GeForce GTX 1050 Ti Mobile.
Now I'm going to propose some snippets of codes in pairs. For each pair, please set what you think is the relative performance of it with the slider below the code (yes, it's a slider), and press the “Reveal the truth” button. Instead of points, you'll get pixels of error. It's how far is the slider set from the real number.
The goal of the quiz is to score as few pixels of error as possible.
There is a well-known optimization quoted among others in Modern Fortran Explained.
For example, if a, b, and c are real variables, the expression a / b / c might be evaluated as a / (b * c) on a processor that can multiply much faster than it can divide.
So... there are processors that are not like that? Let's see if my GPU can divide as fast as multiply.
__global__ void mul(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += xs1[i+j] * xs2[i+j]; } ys[i] = res; } |
__global__ void div(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += xs1[i+j] / xs2[i+j]; } ys[i] = res; } |
Set your estimate with a slider and press the button below.
They are almost the same.
CUDA has a magic compiler flag that promises to make math faster. It's called --use_fast_math, and what it does, it relaxes precision expectations for the division (--prec-div=false), and the square root function (--prec-sqrt=false), allows processors to ignore denormalized floating-point numbers (--ftz=true), and allows the instruction that makes multiplication and addition in one go (--fmad=true).
For the previous experiment, the --prec-div=false would be the most relevant. Let's see how fast the division would be with its precision constraints lifted.
// nvcc --use_fast_math __global__ void mul(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += xs1[i+j] * xs2[i+j]; } ys[i] = res; } |
// nvcc --use_fast_math __global__ void div(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += xs1[i+j] / xs2[i+j]; } ys[i] = res; } |
They are almost the same.
How well this fast math works for square roots?
__global__ void std_sqrt(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += std::sqrt(xs1[i+j]); } ys[i] = res; } |
// nvcc --prec-sqrt=false __global__ void std_sqrt(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += std::sqrt(xs1[i+j]); } ys[i] = res; } |
They are almost the same.
There was a way to emullate the sine function with a fast but inaccurate polynomial described in Challenge your performance intuition with C++ sine. Let's check if it works for CUDA.
__global__ void std_sin(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += std::sin(xs1[i+j]); } ys[i] = res; } |
__global__ void poly_sin(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += -0.000182690409228785*x*x*x*x*x*x*x +0.00830460224186793*x*x*x*x*x -0.166651012143690*x*x*x +x; } ys[i] = res; } |
They are almost the same.
How much faster would it be to compute the polynomial in native single-precision floats instead of double-precision ones we requested?
__global__ void poly_sin(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += -0.000182690409228785*x*x*x*x*x*x*x +0.00830460224186793*x*x*x*x*x -0.166651012143690*x*x*x +x; } ys[i] = res; } |
__global__ void poly_sin3(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += -0.000182690409228785f*x*x*x*x*x*x*x +0.00830460224186793f*x*x*x*x*x -0.166651012143690f*x*x*x +x; } ys[i] = res; } |
They are almost the same.
There is a way to evaluate an n-polynomial with only n-1 multiplications and additions. It's called Horner's scheme.
ax3 + bx2 + cx + d = x(x(xa + b) + c) + d
Normally, this is something a compiler can figure out by itself. But would it?
__global__ void poly_sin3(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += -0.000182690409228785f*x*x*x*x*x*x*x +0.00830460224186793f*x*x*x*x*x -0.166651012143690f*x*x*x +x; } ys[i] = res; } |
__global__ void poly_sin4(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += x*(x*x*(x*x*(x*x* -0.000182690409228785f +0.00830460224186793f) -0.166651012143690f) +1.f); } ys[i] = res; } |
They are almost the same.
Ok, but we haven't tried running the standard sine with fast math enabled. Let's compared it to our Horner's single-precision polynomial.
// nvcc --use_fast_math __global__ void std_sin(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { res += std::sin(xs1[i+j]); } ys[i] = res; } |
__global__ void poly_sin4(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); auto res = 0.f; for(auto j = 0u; j < TheInnerLoop; ++j) { const auto x = xs1[i+j]; res += x*(x*x*(x*x*(x*x* -0.000182690409228785f +0.00830460224186793f) -0.166651012143690f) +1.f); } ys[i] = res; } |
They are almost the same.
In Challenge your performance intuition with C++ operators, a few alternatives to the logical operations are examined. Apparently, on x86, short-circuiting on logical operations is not necessarily beneficial. Sometimes we want it off. In C++ you don't have legitimate control over it so we have to use tricks to hint a compiler of our intentions.
Sometimes, using bitwitse operations instead of logical ones work.
__global__ void logical_and(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); bool all_gt = true; for(auto j = 0u; j < TheInnerLoop - 3; ++j) { all_gt = all_gt && (xs1[i+j] > xs1[i+j]) && (xs1[i+j+1] > xs1[i+j+1]) && (xs1[i+j+2] > xs1[i+j]+2) && (xs1[i+j+3] > xs1[i+j+3]); } ys[i] = all_gt ? 1.f : 0.f; } |
__global__ void bit_and(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); bool all_gt = true; for(auto j = 0u; j < TheInnerLoop - 3; ++j) { all_gt = all_gt & (xs1[i+j] > xs1[i+j]) & (xs1[i+j+1] > xs1[i+j+1]) & (xs1[i+j+2] > xs1[i+j]+2) & (xs1[i+j+3] > xs1[i+j+3]); } ys[i] = all_gt ? 1.f : 0.f; } |
They are almost the same.
The other trick to make a compiler skip the short circuiting is turning logic into arithmetics.
__global__ void logical_and(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); bool all_gt = true; for(auto j = 0u; j < TheInnerLoop - 3; ++j) { all_gt = all_gt && (xs1[i+j] > xs1[i+j]) && (xs1[i+j+1] > xs1[i+j+1]) && (xs1[i+j+2] > xs1[i+j]+2) && (xs1[i+j+3] > xs1[i+j+3]); } ys[i] = all_gt ? 1.f : 0.f; } |
__global__ void mul_and(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); bool all_gt = true; for(auto j = 0u; j < TheInnerLoop - 3; ++j) { all_gt = all_gt * (xs1[i+j] > xs1[i+j]) * (xs1[i+j+1] > xs1[i+j+1]) * (xs1[i+j+2] > xs1[i+j]+2) * (xs1[i+j+3] > xs1[i+j+3]); } ys[i] = all_gt ? 1.f : 0.f; } |
They are almost the same.
In Challenge your performance intuition with nanosecond sorting an index sort is compared with standard sort. CUDA doesn't allow the standard sort but we can use a swap sort instead.
#define swap(a, b) {auto c = a; a = b; b = c;} __global__ void sort(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); float checksum = 0.; for(auto j = 0u; j < TheInnerLoop - 2; ++j) { double s[3] = {xs1[i+j],xs1[i+j+1],xs1[i+j+2]}; if(s[0] > s[1]) swap(s[0], s[1]); if(s[1] > s[2]) swap(s[1], s[2]); if(s[0] > s[1]) swap(s[0], s[1]); checksum += s[0] + 2*s[1] + 3*s[3]; } ys[i] = checksum; } |
__global__ void index_sort(const float *xs1, const float *xs2, float *ys, int size) { int i = (blockDim.x * blockIdx.x + threadIdx.x); float checksum = 0.; for(auto j = 0u; j < TheInnerLoop - 2; ++j) { double s[3] = {xs1[i+j],xs1[i+j+1],xs1[i+j+2]}; const auto a = s[0]; const auto b = s[1]; const auto c = s[2]; s[int(a > b) + int(a > c)] = a; s[int(b >= a) + int(b > c)] = b; s[int(c >= a) + int(c >= b)] = c; checksum += s[0] + 2*s[1] + 3*s[3]; } ys[i] = checksum; } |
They are almost the same.
Challenge your performance intuition with C++ magic squares
Challenge your performance intuition with nanosecond sorting
Challenge your performance intuition with C++ operators
Challenge your performance intuition with C++ sine
Also, there's Github with all the experiments.
Index #programming #quizzes | ← there's more. |
+ Github & RSS |