This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Outperforming LAPACK with C metaprogramming

We don't have many tools for metaprogramming in C. We don't have templates or constexprs C++ has; we don't have mixins like in D; and we certainly don't have homoiconicity of Lisp and Julia. All we have in C is the preprocessor that supports macros. And we don't even have recursive macros neither as in macros that recurrently call themselves, nor as in macros that define macros. Apparently, you can't implement something as awesome as a compile time compiler with this.

Yet with preprocessor macros only you can still explain your compiler what do you want, and the compiler, being sophisticated machine, will use this knowledge to generate better code. You can make it compute rather elaborate things statically, and more importantly you can make it generate zero-overhead code to win quite a lot in performance.

I'll show you a pair of tricks that will turn a crude and inefficient algorithm for solving small linear systems into something that can outrun LAPACK.

1. Quasi-recursive inlinable functions

Function inlining is the basics of compiler optimization. When the body of a function is small and the arguments are numerous it makes perfect sense to build the body into the code directly without arranging the conventional data configuring and jumping there and back in the memory. However, there are things that effectively prevent function inlining and one of them is the recursion.

Indeed, when a function calls itself simply building its body into its body will not work. You can force GCC to do that, but it will only eat up all the memory and crash.

The trick is to promise a compiler that the recursion will soon end. You have to unroll a recursive call into a finite number of specific function calls. And you can do this with macros.

Yes, the compiler would have to generate extra code on the first steps of compilation, but it will eventually eliminate the dead calls once the variables are proven to be constant during the runtime. GCC is very good in doing this. In fact, you don't even have to help it with const modifiers. If it can, it will minimize the runtime computation to the very possible extend.

In this example we will make it compute a quasi-recursive factorial function statically. Please press the “next step” button to see how the preprocessor and the compiler work together.

Step 1. C source code
#define RECURSIVE(PREFIX, CHUNK1, SUFFIX, END_SUFFIX) \
inline PREFIX##_8 END_SUFFIX \
inline PREFIX##_7 CHUNK1##_8 SUFFIX \
inline PREFIX##_6 CHUNK1##_7 SUFFIX \
inline PREFIX##_5 CHUNK1##_6 SUFFIX \
inline PREFIX##_4 CHUNK1##_5 SUFFIX \
inline PREFIX##_3 CHUNK1##_4 SUFFIX \
inline PREFIX##_2 CHUNK1##_3 SUFFIX \
inline PREFIX##_1 CHUNK1##_2 SUFFIX \
inline PREFIX##_0 CHUNK1##_1 SUFFIX 

RECURSIVE(int f, (int i)
  {return i == 1
    ? 1
    : i*f, (i-1);},
  (int i){return 0;})

#include "stdio.h"

int main(){
  printf("%d", f_0(4));
}


    Step 2. C pseudo-code after preprocessing
inline int f_8(int i) {return 1;}
inline int f_7(int i) {return i == 1 ? 1 : i*f_8(i-1);}
inline int f_6(int i) {return i == 1 ? 1 : i*f_7(i-1);}
inline int f_5(int i) {return i == 1 ? 1 : i*f_6(i-1);}
inline int f_4(int i) {return i == 1 ? 1 : i*f_5(i-1);}
inline int f_3(int i) {return i == 1 ? 1 : i*f_4(i-1);}
inline int f_2(int i) {return i == 1 ? 1 : i*f_3(i-1);}
inline int f_1(int i) {return i == 1 ? 1 : i*f_2(i-1);}
inline int f_0(int i) {return i == 1 ? 1 : i*f_1(i-1);}

#include "stdio.h"

int main(){
  printf("%d", f_0(4));
}









    Step 3. Pseudo-code after inlining
int main(){
  printf("%d", (int i){
  i == 1
    ? 1
    : i*(i == 1
      ? 1
      : i*(i == 1
        ? 1
        : i*(i == 1
          ? 1
          : i*(i == 1
            ? 1
            : i*(i == 1
              ? 1
              : i*(i == 1
                ? 1
                : i*(i == 1
                  ? 1
                  : i*1;);););););););}(4));
}




    Step 4. Pseudo-code after constant propagation









int main(){
  printf("%d", 4*(4-1)*(4-1-1)*1);
}












    Step 5. Assembly
	.cfi_startproc
	leal	4(%esp), %ecx
	.cfi_def_cfa 1, 0
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	.cfi_escape 0x10,0x5,0x2,0x75,0
	movl	%esp, %ebp
	pushl	%ecx
	.cfi_escape 0xf,0x3,0x75,0x7c,0x6
	subl	$8, %esp
	pushl	$24
	pushl	$.LC0
	pushl	$1
	call	__printf_chk
	movl	-4(%ebp), %ecx
	.cfi_def_cfa 1, 0
	addl	$16, %esp
	xorl	%eax, %eax
	leave
	.cfi_restore 5
	leal	-4(%ecx), %esp
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc

Of course, this hack has its limits. For instance, when you have to do several calls from a quasi-recursive function, the compiler would have to build not a list of bodies, but a whole tree. This slows down compilation enormously. It takes noticeable time to compile 4-depth 4-branches quasi-recursion on a modern desktop. 6-depth 6-branches might turn it catatonic.

2. Forced loop unroll

Compilers are usually good in unrolling loops. In fact, they are usually better than us, regular people. The good side of the loop unrolling is that it removes the overhead on index counting and comparison. The bad side is larger code. Sometimes we would win more from a small loop running completely in the instruction cache than from a huge unrolled loop that has to be read from uncached memory. Usually, compilers can balance these things for a specific processor architecture better than people with assemblers.

Sometimes however we know better than the compiler. We know that even if the loop is not quite short, and even if it has inner loops it should still be eligible for unrolling because it will shrink after the dead code elimination. Here is the trick that will leave no options to the C compiler.

This is a bubble sort conducted in compile time. Once again, please use the “next step” button to see how it goes.

Step 1. C source code
#define LOOP_TO_0(ID, N, BODY) {int ID; switch(N){\
case 4: ID=4;{BODY};\
case 3: ID=3;{BODY};\
case 2: ID=2;{BODY};\
case 1: ID=1;{BODY};\
case 0: ID=0;{BODY};\
}}

#define STATIC_SORT_INT(a) \
  LOOP_TO_0(i, sizeof(a)/sizeof(a[0]) - 2, LOOP_TO_0(j, i, \
    int k = i-j; \
    int s = a[k]+a[k+1]; \
    int d = abs(a[k]-a[k+1]); \
    a[k] = (s-d)/2; \
    a[k+1] = (s+d)/2; \
  ));

#include "stdio.h"

int main(){
  int a[4] = {1, 5, 2, 4};
  STATIC_SORT_INT(a);
  printf("%d %d %d %d\n", a[0], a[1], a[2], a[3]);
}



    Step 2. C pseudo-code after preprocessing
#include "stdio.h"

int main(){
  int a[4] = {1, 5, 2, 4};
  int i; switch(2){
    case 4: i=4; {
      int j; switch(4){
        case 4: j=4; int k=i-j; int s=a[k]+a[k+1];
          int d=abs(a[k]-a[k+1]); a[k] = (s-d)/2; a[k+1] = (s+d)/2;
        case 3: j=3; int k=i-j; int s=a[k]+a[k+1];
          int d=abs(a[k]-a[k+1]); a[k] = (s-d)/2; a[k+1] = (s+d)/2;
        case 2: j=2; int k=i-j; int s=a[k]+a[k+1];
          int d=abs(a[k]-a[k+1]); a[k] = (s-d)/2; a[k+1] = (s+d)/2;
        case 1: j=1; int k=i-j; int s=a[k]+a[k+1];
          int d=abs(a[k]-a[k+1]); a[k] = (s-d)/2; a[k+1] = (s+d)/2;
        case 0: j=0; int k=i-j; int s=a[k]+a[k+1];
          int d=abs(a[k]-a[k+1]); a[k] = (s-d)/2; a[k+1] = (s+d)/2;
        }
      }
    case 3: i=3; {
      // ... 14 lines per high-level case
    //... that's 70 lines in total
    }
  }
  printf("%d %d %d %d\n", a[0], a[1], a[2], a[3]);
}

    Step 3. C pseudo-code after dead cases elimination
#include "stdio.h"

int main(){
  int a[4] = {1, 5, 2, 4};
  int i;
  i = 2;
  j = 2; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  j = 1; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  j = 0; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  i = 1;
   j = 1; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  j = 0; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  i = 0;
  j = 0; int k=i-j; int s=a[k]+a[k+1]; int d=abs(a[k]-a[k+1]);
    a[k] = (s-d)/2; a[k+1] = (s+d)/2;
  }
  printf("%d %d %d %d\n", a[0], a[1], a[2], a[3]);
}




    Step 4. Pseudo-code after constant propagation










#include "stdio.h"

int main(){
  printf("%d %d %d %d\n", 1, 2, 4, 5);
}












    Step 5. Assembly
	.cfi_startproc
	leal	4(%esp), %ecx
	.cfi_def_cfa 1, 0
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	.cfi_escape 0x10,0x5,0x2,0x75,0
	movl	%esp, %ebp
	pushl	%ecx
	.cfi_escape 0xf,0x3,0x75,0x7c,0x6
	subl	$12, %esp
	pushl	$5
	pushl	$4
	pushl	$2
	pushl	$1
	pushl	$.LC0
	pushl	$1
	call	__printf_chk
	movl	-4(%ebp), %ecx
	.cfi_def_cfa 1, 0
	addl	$32, %esp
	xorl	%eax, %eax
	leave
	.cfi_restore 5
	leal	-4(%ecx), %esp
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc

Technically, one can do the very same hack with the inlined pseudo-recursion from before. But this compiles faster and it has shown itself easier to debug.

Racing LAPACK

LAPACK is known to be a rather effective library for solving linear algebra problems. I'm going to measure how it performs on 1M small linear systems, and compare it to the brute Cramer's rule solution done with loops and recursive functions first, and with metaprogramming tricks to remove overhead then.

That's exactly what I want to measure:

#include "stdio.h"

void dgesv_( int* n, int* nrhs, double* a, int* lda, int* ipiv,
                double* b, int* ldb, int* info );

#define N 5
#define NRHS 1
#define LDA N
#define LDB N

int main() {
    double sum_x[5] = {0., 0., 0., 0., 0.};
    for(int i = 0; i < 1000000; ++i) {
        int n = N, nrhs = NRHS, lda = LDA, ldb = LDB, info;

        int ipiv[N];
        double a[LDA*N] = {
         6.80, -2.11,  5.66,  5.97,  8.23,
        -6.05, -3.30,  5.36, -4.44,  1.08,
        -0.45,  2.58, -2.70,  0.27,  9.04,
         8.32,  2.71,  4.35, -7.17,  2.14,
        -9.67, -5.14, -7.26,  6.08, -6.87
        };
        double b[LDB*NRHS] = {
            4.02,  6.19, -8.22, -7.57, -3.03,
        };

        dgesv_( &n, &nrhs, a, &lda, ipiv, b, &ldb, &info );
        for(int j = 0; j < 5; ++j){
            sum_x[j] += b[j];
        }
    }
    printf("%f, %f, %f, %f, %f\n",
      sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]);
}

I know this is not a perfect benchmark, for the very least it doesn't account for branch mis-prediction. But I'm ok with this. The metaprogrammed version I am going to compare it to simply has no branches to mis-predict.

But first, the version with all the overheads:

#include "stdio.h"

double aijn(double* a, double* b, int i, int j, int n, int m){
  return (n == m) ? a[i*m+j] :
    aijn(a, b, i, j, n+1, m) * aijn(a, b, n, n, n+1, m) -
    aijn(a, b, i, n, n+1, m) * aijn(a, b, n, j, n+1, m);
}

double bin(double* a, double* b, int i, int n, int m){
  return (n == m) ? b[i] :
    aijn(a, b, n, n, n+1, m) * bin(a, b, i, n+1, m) -
    aijn(a, b, i, n, n+1, m) * bin(a, b, n, n+1, m);
}

void solve_xi(double* a, double* b, double* x, int i, int m){
  double d = bin(a, b, i, i+1, m);
  for(int j = 0; j < i; ++j)
    d-=aijn(a, b, i, j, i+1, m)*x[j];
  x[i] = d / aijn(a, b, i, i, i+1, m);
}

void runtime_solve(double* a, double * b, double* x, int m){
  for(int k = 0; k < m; ++k)
    solve_xi(a, b, x, k, m);
}

int main() {
    double sum_x[5] = {0., 0., 0., 0., 0.};
        for(int i = 0; i < 1000000; ++i) {
        double a[5*5] = {
            6.80, -6.05, -0.45,  8.32, -9.67,
           -2.11, -3.30,  2.58,  2.71, -5.14,
            5.66,  5.36, -2.70,  4.35, -7.26,
            5.97, -4.44,  0.27, -7.17,  6.08,
            8.23,  1.08,  9.04,  2.14, -6.87
        };
        double b[5] = {
            4.02,  6.19, -8.22, -7.57, -3.03,
        };

        double x[5];

        runtime_solve(a, b, x, sizeof(x) / sizeof(x[0]));
        for(int j = 0; j < 5; ++j){
            sum_x[j] += x[j];
        }
    }
    printf("%f, %f, %f, %f, %f\n",
      sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]);
}

It has a loop, a nested loop and a pair of recursive functions. Nothing we can't linearize with macros. The version with metaprogramming tricks looks like this:

// from static_recursion.h
#define RECURSIVE_4_CALLS(PREFIX, CHUNK1, CHUNK2, CHUNK3, CHUNK4, SUFFIX, END_SUFFIX) \
inline PREFIX##_4 END_SUFFIX \
inline PREFIX##_3 CHUNK1##_4 CHUNK2##_4 CHUNK3##_4 CHUNK4##_4 SUFFIX \
inline PREFIX##_2 CHUNK1##_3 CHUNK2##_3 CHUNK3##_3 CHUNK4##_3 SUFFIX \
inline PREFIX##_1 CHUNK1##_2 CHUNK2##_2 CHUNK3##_2 CHUNK4##_2 SUFFIX \
inline PREFIX##_0 CHUNK1##_1 CHUNK2##_1 CHUNK3##_1 CHUNK4##_1 SUFFIX

// from static_loops.h
#define LOOP_TO_0(ID, N, BODY) {int ID; switch(N){\
case 7: ID=7;{BODY};\
case 6: ID=6;{BODY};\
case 5: ID=5;{BODY};\
case 4: ID=4;{BODY};\
case 3: ID=3;{BODY};\
case 2: ID=2;{BODY};\
case 1: ID=1;{BODY};\
case 0: ID=0;{BODY};\
}}

#define LOOP_TO(ID, START, END, BODY) \
  LOOP_TO_0(ID, (END)-(START), ID=(END)-ID; BODY)

#define LOOP_DOWNTO(ID, START, END, BODY) \
  LOOP_TO_0(ID, (START)-(END), ID=(END)+ID; BODY)

// from static_linear_solver.h
RECURSIVE_4_CALLS(double aijn, (double* a, double* b, int i, int j, int n, int m){
  return (n == m) ? a[i*m+j] :
    aijn, (a, b, i, j, n+1, m) * aijn, (a, b, n, n, n+1, m) -
    aijn, (a, b, i, n, n+1, m) * aijn, (a, b, n, j, n+1, m);
}, (double* a, double* b, int i, int j, int n, int m){return a[i*m+j];})

RECURSIVE_4_CALLS(double bin, (double* a, double* b, int i, int n, int m){
  return (n == m) ? b[i] :
    aijn, (a, b, n, n, n+1, m) * bin, (a, b, i, n+1, m) -
    aijn, (a, b, i, n, n+1, m) * bin, (a, b, n, n+1, m);
}, (double* a, double* b, int i, int n, int m){return b[i];})

#define SOLVE_X(A, B, X, I, M) \
  double d = bin_0(A, B, I, (I+1), M); \
  LOOP_TO(j, 0, (I)-1, d-=aijn_0(A, B, I, j, (I+1), M)*X[j];); \
  X[I] = d / aijn_0(A, B, I, I, (I+1), M);

#define STATIC_SOLVE(A, B, X) \
  int m = sizeof(X) / sizeof(X[0]); \
  LOOP_TO(k, 0, m-1, SOLVE_X(A, B, X, k, m););

// and the benchmark itself
#include "stdio.h"
#include "static_linear_solver.h"

int main() {
    double sum_x[5] = {0., 0., 0., 0., 0.};
    for(int i = 0; i < 1000000; ++i) {
        volatile double v_a[5*5] = {
            6.80, -6.05, -0.45,  8.32, -9.67,
           -2.11, -3.30,  2.58,  2.71, -5.14,
            5.66,  5.36, -2.70,  4.35, -7.26,
            5.97, -4.44,  0.27, -7.17,  6.08,
            8.23,  1.08,  9.04,  2.14, -6.87
        };
        double a[5*5] = {
            v_a[0+0],  v_a[0+1],  v_a[0+2],  v_a[0+3],  v_a[0+4],
            v_a[5+0],  v_a[5+1],  v_a[5+2],  v_a[5+3],  v_a[5+4],
            v_a[10+0], v_a[10+1], v_a[10+2], v_a[10+3], v_a[10+4],
            v_a[15+0], v_a[15+1], v_a[15+2], v_a[15+3], v_a[15+4],
            v_a[20+0], v_a[20+1], v_a[20+2], v_a[20+3], v_a[20+4],
        };
        volatile double v_b[5] = {
            4.02,  6.19, -8.22, -7.57, -3.03,
        };
        double b[5] = {
            v_b[0],  v_b[1], v_b[2], v_b[3], v_b[4],
        };

        double x[5];

        STATIC_SOLVE(a, b, x);
        for(int j = 0; j < 5; ++j){
            sum_x[j] += x[j];
        }
    }
    printf("%f, %f, %f, %f, %f\n", sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]);
}

Note that the input arrays start as volatile but then get copied to the non-modified arrays used as the STATIC_SOLVE's input. The thing is, if we don't do that, the whole thing gets solved statically for 0 runtime cost. That's not all I want to measure. I want my compiler to build a code that would work on arbitrary data, that's why the input data should be volatile.

On the other hand, having volatile as function argument inside the macros would kill the optimization for constant data, and I want static and semi-static solutions measured as well.

We can have both static and runtime solution with explicit copying. This way a single macros fits every possible configuration of constant and volatile data. In this regard it makes sense to measure not only LAPACK and metaprogrammed solution, but several other things as well. Here are the measurements.

Full static solution. 1 KLOC of assembly running for 0.00 seconds.
Static A matrix, volatile B vector. 5 KLOC of assembly, 0.04 seconds.
Volatile A matrix, static B vector. 7.5 KLOC, 0.10 seconds.
The metaprogrammed solution for all volatile data. 8.3 KLOC, 0.10 seconds.
The solution when the matrix is triangle, meaning upper right half is all 0, and the other data is volatile. 6.5 KLOC, 0.08 seconds.
LAPACK. 0.65 seconds.
Original solution with recursion and loops. 10 KLOC, 2.10 seconds.

Of course, these measurements are not at all precise but I think they still work since the numbers are so distinct.

Conclusion

You can trick the compiler into doing things it wouldn't normally do, and benefit from this performance-wise. Technically, you can even do this without macros, but macros make it easier.

Apart from the tricks themselves there are also some things worth mentioning.

The compiler can be surprisingly good in localizing constant data. You can copy some of the array elements from side-effect source, and leave other defined in the compile time, and the compiler would account for that. The solution with triangle matrix clearly wins from this.
Semi-static computation does not necessary give good advantage. The solution with static B is roughly the same and sometimes even worse than the solution with both arrays being volatile.
The compiled size of the solution with loops and calls unrolled would not necessarily be large. In my case the unrolled solution results into roughly 8 KLOC of assembly, while the direct one compiles to 10 KLOC.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with g++ 5.4.0 -std=c++11 -O2. The source code for the benchmark is available on Github.