This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Outperforming everything with anything
_{Python? Sure, why not?}

This continues Outperforming LAPACK with C metaprogramming, and Vastly outperforming LAPACK with C++ metaprogramming. These two posts describe language tricks to make compilers generate highly-performant code for you. But do you need compilers at all?

The alternative would be writing programs in plain assembly. Although the horrors of it are vastly exaggerated, this approach has two major flaws.

Assembly code is not portable.
Although it became much easier with modern tools, assembly programming still requires a lot of tedious work.

Thankfully, we all live in the XXI century, and both of these problems have already been addressed.

The solution to the first one would be LLVM. Initially, it meant “Low Level Virtual Machine” and that is exactly what we want to ensure portability. Simply put, it takes some code written in very low-level hardware agnostic language and returns some highly optimized native code for the specific hardware platform. With LLVM we have both the power of low-level programming and the automation of hardware-oriented micro-optimizations.

And the solution to the second problem is any “scripting” language you want. Scheme, Python, Perl, even bash or AWK will do. They are all meant to eliminate tedious work. Everything you use daily for automation you can use to generate highly-performant code.

The plan

Let's repeat the same experiment I did with C and C++ in the posts from before. Let's generate a fully inlined fully unrolled solution and embed it into the benchmarking code. The plan then is as follows.

Use Clang to generate LLVM intermediate code for the benchmark that is supposed to measure yet non-existent function named solve_5
Make Python generate a linear solver code in LLVM.
Hijack a benchmark with a Python script replacing solve_5 call with the generated solver.
Use the LLVM static compiler to turn the intermediate code into the machine code.
Use the GNU assembler and the Clang's linker to make the machine code into an executable binary.

That's how it looks in a Makefile:

all: generate_llvm generate_s generate_elf test
.PHONY: generate_llvm
generate_llvm:
  clang benchmark.c -march=native -S -emit-llvm -o benchmark.bc
  python substitute_solver_call.py benchmark.bc
.PHONY: generate_s
generate_s:
  llc -O2 benchmark.bc -o benchmark.s
.PHONY: generate_elf
generate_elf:
  as benchmark.s -o benchmark.o
  clang -o benchmark benchmark.o

The Python part

We need a linear solver in Python just like we had with C and C++. Here it is:

# this generates n-solver in LLVM code with LLVMCode objects.
# No LLVM stuff yet, just completely Pythonic solution
def solve_linear_system(a_array, b_array, x_array, n_value):
  def a(i, j, n):
    if n == n_value:
      return a_array[i * n_value + j]
    return a(i, j, n+1)*a(n, n, n+1) - a(i, n, n+1)*a(n, j, n+1)

  def b(i, n):
    if n == n_value:
      return b_array[i]
    return a(n, n, n+1)*b(i, n+1) - a(i, n, n+1)*b(n, n+1)

  def x(i):
    d = b(i,i+1)
    for j in range(i):
      d -= a(i, j, i+1) * x_array[j]
    return d / a(i, i, i+1)

  for k in range(n_value):
    x_array[k] = x(k)

return x_array

When we run this on numbers, we have numbers. But we want the code. So let's make an object that pretends to be numbers only to spy on the algorithm. The object that writes down every operation the algorithm wants it to perform in ready to assemble LLVM intermediate language.

# this is basically the whole LLVM layer
I = 0
STACK = []

class LLVMCode:
  # the only constructor for now is by double* instruction
  def __init__(self, io, code = ''):
    self.io = io
    self.code = code
  def __getitem__(self, i):
    global I, STACK
    copy_code = "%" + str(I+1)
    copy_code += " = getelementptr inbounds double, double* "
    copy_code += self.io +", i64 " + str(i) + "\n"
    copy_code += "%" + str(I+2)
    copy_code += " = load double, double* %" + str(I+1)
    copy_code += ", align 8\n"
    I += 2
    STACK += [I]
    return LLVMCode(self.io, copy_code)
  def __setitem__(self, i, other_llvcode):
    global I, STACK
    self.code += other_llvcode.code
    self.code += "%" + str(I+1)
    self.code += " = getelementptr inbounds double, double* "
    self.code += self.io +", i64 " + str(i) + "\n"
    self.code += "store double %" + str(I)
    self.code += ", double* %" + str(I+1) + ", align 8\n"
    I += 1
    STACK = STACK[:-1]
    return self
  def general_arithmetics(self, operator, other_llvcode):
    global I, STACK
    self.code += other_llvcode.code;
    self.code += "%" + str(I+1) + " = f" + operator
    self.code += " double %" + str(STACK[-2]) + ", %"
    self.code += str(STACK[-1]) + "\n";
    I += 1
    STACK = STACK[:-2] + [I]
    return self
  def __add__(self, other_llvcode):
    return self.general_arithmetics('add', other_llvcode)
  def __sub__(self, other_llvcode):
    return self.general_arithmetics('sub', other_llvcode)
  def __mul__(self, other_llvcode):
    return self.general_arithmetics('mul', other_llvcode)
  def __div__(self, other_llvcode):
    return self.general_arithmetics('div', other_llvcode)

Now when we run the solver with this kind of objects, we get a fully functional linear solver written in LLVM intermediate language. Then we put it into the benchmark code to see how fast it is.

Instruction in LLVM are numbered and we want to preserve this enumeration, so the function that inserts our code into our benchmark is not trivial. But it's not very complicated either.

# this replaces the function call
# and updates all the instructions' indices
def replace_call(text, line, params):
  global I, STACK
  # '%12 ' -> 12
  I = int(''.join(
    [xi for xi in params[2] if xi.isdigit()]
    ))
  first_instruction_to_replace = I + 1
  STACK = []
  replacement = solve_linear_system(
    LLVMCode(params[0]),
    LLVMCode(params[1]),
    LLVMCode(params[2]), 5).code
  delta_instruction = I - first_instruction_to_replace + 1
  for i in xrange(first_instruction_to_replace, sys.maxint):
    not_found = sum(
      [text.find('%' + str(i) + c) == -1
        for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH]
      )
    if not_found == 4:
      # the last instruction has already been substituted
      break
    new_i = i + delta_instruction
    for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH:
      # substitute instruction number
      text = text.replace('%' + str(i) + c, '%' + str(new_i) + c)
return text.replace(line, replacement)

The whole piece of code that implements the solver, provides the Python-to-LLVM layer, and does the code insertion is only 100 lines long! You can see it in one piece on GitHub.

The benchmark

The benchmark itself is in C. When we run the Makefile, its call for solve_5 gets replaced by the LLVM code generated from Python by Python.

Step 1. Benchmark C source code
#include "stdio.h"

extern void solve_5(double* a, double* b, double* x); // fake

int main() {
    double sum_x[5] = {0., 0., 0., 0., 0.};
    for(int i = 0; i < 1000000; ++i) {
        double a[5*5] = {
            6.80, -6.05, -0.45,  8.32, -9.67,
           -2.11, -3.30,  2.58,  2.71, -5.14,
            5.66,  5.36, -2.70,  4.35, -7.26,
            5.97, -4.44,  0.27, -7.17,  6.08,
            8.23,  1.08,  9.04,  2.14, -6.87
        };
        double b[5] = {
            4.02,  6.19, -8.22, -7.57, -3.03,
        };
        double x[5];

        solve_5(a, b, x); // this will get replaced later

        for(int j = 0; j < 5; ++j){
            sum_x[j] += x[j];
        }
    }
    printf("%f, %f, %f, %f, %f\n",
        sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]);
    return 0;
}

Step 2. LLVM assembly language
... 27 lines before...
; <label>:6                                       ; preds = %3
  %7 = bitcast [25 x double]* %a to i8*
  call void @llvm.memcpy.p0i8.p0i8.i32
    (i8* %7, i8* bitcast ([25 x double]* @main.a to i8*), i32 200, i32 8, i1 false)
  %8 = bitcast [5 x double]* %b to i8*
  call void @llvm.memset.p0i8.i32(i8* %8, i8 0, i32 40, i32 8, i1 false)
  %9 = bitcast i8* %8 to [5 x double]*
  %10 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 0
  store double 4.020000e+00, double* %10
  %11 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 1
  store double 6.190000e+00, double* %11
  %12 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 2
  store double -8.220000e+00, double* %12
  %13 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 3
  store double -7.570000e+00, double* %13
  %14 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 4
  store double -3.030000e+00, double* %14
  %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0
  %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0
  %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0
  call void @solve_5(double* %15, double* %16, double* %17) ; to replace
  store i32 0, i32* %j, align 4
  br label %18 ; to update

; <label>:18                                      ; preds = %29, %6
  %19 = load i32, i32* %j, align 4
  %20 = icmp slt i32 %19, 5
  br i1 %20, label %21, label %32
... 58 lines after... Step 3. LLVM after the call replacement
... 44 lines before ...
  %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0
  %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0
  %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0
%18 = getelementptr inbounds double, double* %15, i64 6  ; generated
%19 = load double, double* %18, align 8                  ; by
%20 = getelementptr inbounds double, double* %15, i64 24 ; the
%21 = load double, double* %20, align 8                  ; Python
%22 = fmul double %19, %21                               ; script

... 2407 more lines of auto-generated code ...

%2422 = getelementptr inbounds double, double* %17, i64 3
%2423 = load double, double* %2422, align 8
%2424 = fmul double %2421, %2423
%2425 = fsub double %2419, %2424
%2426 = getelementptr inbounds double, double* %15, i64 24
%2427 = load double, double* %2426, align 8
%2428 = fdiv double %2425, %2427
%2429 = getelementptr inbounds double, double* %17, i64 4
store double %2428, double* %2429, align 8

  store i32 0, i32* %j, align 4
  br label %2430  ; instructions updated by the same script

; :18                                      ; preds = %2441, %6
  %2431 = load i32, i32* %j, align 4
  %2432 = icmp slt i32 %2431, 5
  br i1 %2432, label %2433, label %2444
... still 58 lines after...Step 4. Native optimized assembly
... 139 lines of assembly ...
	vmovsd	352(%esp), %xmm0        # xmm0 = mem[0],zero
	vmulsd	256(%esp), %xmm2, %xmm6
	vmovsd	.LCPI0_0, %xmm4         # xmm4 = mem[0],zero
	vmulsd	%xmm4, %xmm0, %xmm7
	vsubsd	%xmm7, %xmm6, %xmm6
	vmulsd	128(%esp), %xmm0, %xmm5 # 8-byte Folded Reload
	vmovapd	%xmm0, %xmm3
	vmulsd	344(%esp), %xmm2, %xmm7
	vsubsd	%xmm5, %xmm7, %xmm1
	vmovsd	%xmm1, 128(%esp)        # 8-byte Spill
	vmulsd	280(%esp), %xmm2, %xmm7
	vmulsd	192(%esp), %xmm4, %xmm5 # 8-byte Folded Reload
	vsubsd	%xmm5, %xmm7, %xmm7
	vmovsd	%xmm7, 120(%esp)        # 8-byte Spill
	vmovsd	104(%esp), %xmm0        # 8-byte Reload

	vmulsd	%xmm6, %xmm0, %xmm5
	vmulsd	%xmm7, %xmm1, %xmm6
	vsubsd	%xmm6, %xmm5, %xmm5
	vmovsd	%xmm5, 64(%esp)         # 8-byte Spill
	vmovapd	%xmm3, %xmm7
	vmulsd	88(%esp), %xmm7, %xmm3  # 8-byte Folded Reload
	vmulsd	336(%esp), %xmm2, %xmm6
	vsubsd	%xmm3, %xmm6, %xmm3
	vmulsd	%xmm0, %xmm3, %xmm3
	vmulsd	80(%esp), %xmm1, %xmm6  # 8-byte Folded Reload
	vsubsd	%xmm6, %xmm3, %xmm1
	vmovsd	%xmm1, 88(%esp)         # 8-byte Spill
... only 400 more lines of assembly ...

The most noteworthy is how the ultra-verbose intermediate code generated by the Python script turns into some very compact and very effective code for the hardware. It is highly super-scalarized too. But is it good enough to compete with C and C++ solutions?

Here are some approximate numbers for the three cases: C with tricks, C++ with a trick, and Python with LLVM.

The tricks for C don't quite work for Clang, so GCC version is measured. It runs for roughly 70 ms on average.
The C++ version was built with Clang and it runs for 60 ms.
The Python version, the one described here, runs only for 55 ms.

Of course, this speed-up is not something you would kill for. But it shows that you can write programs in Python that outperform their counterparts written in C or C++. You don't have to learn some special language to create high-performant applications or libraries.

Conclusion

I think that the dichotomy between fast compiling languages and slow scripting languages is a bluff. Native code generation may very well be not a core feature but something like a pluggable option. Like a library or an embedded language. Like Numba for Python or Terra for Lua. Think about the benefits: you can do your research and rapid prototyping in one language, and then generate highly-performant code... in the same language.

High-performance computing has no reason to remain a privilege of compiling languages. A compiler is just a soft machine for code generation. You can generate code in any language you want. I believe you can teach Matlab to generate ultra-fast LLVM code if you want to.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with clang version 3.8.0-2ubuntu4 and g++ 5.4.0 with -march=native -O2. The source code for the benchmark is available on Github.

Outperforming everything with anything Python? Sure, why not?

The plan

The Python part

The benchmark

Conclusion

Outperforming everything with anything
_{Python? Sure, why not?}