This is Words and Buttons Online — a collection of interactive #tutorials, #demos, and #quizzes about #mathematics, #algorithms and #programming.
This continues Outperforming LAPACK with C metaprogramming, and Vastly outperforming LAPACK with C++ metaprogramming. These two posts describe language tricks to make compilers generate highly-performant code for you. But do you need compilers at all?
The alternative would be writing programs in plain assembly. Although the horrors of it are vastly exaggerated, this approach has two major flaws.
Thankfully, we all live in the XXI century, and both of these problems have already been addressed.
The solution to the first one would be LLVM. Initially, it meant “Low Level Virtual Machine” and that is exactly what we want to ensure portability. Simply put, it takes some code written in very low-level hardware agnostic language and returns some highly optimized native code for the specific hardware platform. With LLVM we have both the power of low-level programming and the automation of hardware-oriented micro-optimizations.
And the solution to the second problem is any “scripting” language you want. Scheme, Python, Perl, even bash or AWK will do. They are all meant to eliminate tedious work. Everything you use daily for automation you can use to generate highly-performant code.
Let's repeat the same experiment I did with C and C++ in the posts from before. Let's generate a fully inlined fully unrolled solution and embed it into the benchmarking code. The plan then is as follows.
That's how it looks in a Makefile:
all: generate_llvm generate_s generate_elf test .PHONY: generate_llvm generate_llvm: clang benchmark.c -march=native -S -emit-llvm -o benchmark.bc python substitute_solver_call.py benchmark.bc .PHONY: generate_s generate_s: llc -O2 benchmark.bc -o benchmark.s .PHONY: generate_elf generate_elf: as benchmark.s -o benchmark.o clang -o benchmark benchmark.o |
We need a linear solver in Python just like we had with C and C++. Here it is:
# this generates n-solver in LLVM code with LLVMCode objects. # No LLVM stuff yet, just completely Pythonic solution def solve_linear_system(a_array, b_array, x_array, n_value): def a(i, j, n): if n == n_value: return a_array[i * n_value + j] return a(i, j, n+1)*a(n, n, n+1) - a(i, n, n+1)*a(n, j, n+1) def b(i, n): if n == n_value: return b_array[i] return a(n, n, n+1)*b(i, n+1) - a(i, n, n+1)*b(n, n+1) def x(i): d = b(i,i+1) for j in range(i): d -= a(i, j, i+1) * x_array[j] return d / a(i, i, i+1) for k in range(n_value): x_array[k] = x(k) return x_array |
When we run this on numbers, we have numbers. But we want the code. So let's make an object that pretends to be numbers only to spy on the algorithm. The object that writes down every operation the algorithm wants it to perform in ready to assemble LLVM intermediate language.
# this is basically the whole LLVM layer I = 0 STACK = [] class LLVMCode: # the only constructor for now is by double* instruction def __init__(self, io, code = ''): self.io = io self.code = code def __getitem__(self, i): global I, STACK copy_code = "%" + str(I+1) copy_code += " = getelementptr inbounds double, double* " copy_code += self.io +", i64 " + str(i) + "\n" copy_code += "%" + str(I+2) copy_code += " = load double, double* %" + str(I+1) copy_code += ", align 8\n" I += 2 STACK += [I] return LLVMCode(self.io, copy_code) def __setitem__(self, i, other_llvcode): global I, STACK self.code += other_llvcode.code self.code += "%" + str(I+1) self.code += " = getelementptr inbounds double, double* " self.code += self.io +", i64 " + str(i) + "\n" self.code += "store double %" + str(I) self.code += ", double* %" + str(I+1) + ", align 8\n" I += 1 STACK = STACK[:-1] return self def general_arithmetics(self, operator, other_llvcode): global I, STACK self.code += other_llvcode.code; self.code += "%" + str(I+1) + " = f" + operator self.code += " double %" + str(STACK[-2]) + ", %" self.code += str(STACK[-1]) + "\n"; I += 1 STACK = STACK[:-2] + [I] return self def __add__(self, other_llvcode): return self.general_arithmetics('add', other_llvcode) def __sub__(self, other_llvcode): return self.general_arithmetics('sub', other_llvcode) def __mul__(self, other_llvcode): return self.general_arithmetics('mul', other_llvcode) def __div__(self, other_llvcode): return self.general_arithmetics('div', other_llvcode) |
Now when we run the solver with this kind of objects, we get a fully functional linear solver written in LLVM intermediate language. Then we put it into the benchmark code to see how fast it is.
Instruction in LLVM are numbered and we want to preserve this enumeration, so the function that inserts our code into our benchmark is not trivial. But it's not very complicated either.
# this replaces the function call # and updates all the instructions' indices def replace_call(text, line, params): global I, STACK # '%12 ' -> 12 I = int(''.join( [xi for xi in params[2] if xi.isdigit()] )) first_instruction_to_replace = I + 1 STACK = [] replacement = solve_linear_system( LLVMCode(params[0]), LLVMCode(params[1]), LLVMCode(params[2]), 5).code delta_instruction = I - first_instruction_to_replace + 1 for i in xrange(first_instruction_to_replace, sys.maxint): not_found = sum( [text.find('%' + str(i) + c) == -1 for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH] ) if not_found == 4: # the last instruction has already been substituted break new_i = i + delta_instruction for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH: # substitute instruction number text = text.replace('%' + str(i) + c, '%' + str(new_i) + c) return text.replace(line, replacement) |
The whole piece of code that implements the solver, provides the Python-to-LLVM layer, and does the code insertion is only 100 lines long! You can see it in one piece on GitHub.
The benchmark itself is in C. When we run the Makefile, its call for solve_5 gets replaced by the LLVM code generated from Python by Python.
|
The most noteworthy is how the ultra-verbose intermediate code generated by the Python script turns into some very compact and very effective code for the hardware. It is highly super-scalarized too. But is it good enough to compete with C and C++ solutions?
Here are some approximate numbers for the three cases: C with tricks, C++ with a trick, and Python with LLVM.
Of course, this speed-up is not something you would kill for. But it shows that you can write programs in Python that outperform their counterparts written in C or C++. You don't have to learn some special language to create high-performant applications or libraries.
I think that the dichotomy between fast compiling languages and slow scripting languages is a bluff. Native code generation may very well be not a core feature but something like a pluggable option. Like a library or an embedded language. Like Numba for Python or Terra for Lua. Think about the benefits: you can do your research and rapid prototyping in one language, and then generate highly-performant code... in the same language.
High-performance computing has no reason to remain a privilege of compiling languages. A compiler is just a soft machine for code generation. You can generate code in any language you want. I believe you can teach Matlab to generate ultra-fast LLVM code if you want to.
All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with clang version 3.8.0-2ubuntu4 and g++ 5.4.0 with -march=native -O2. The source code for the benchmark is available on Github.
Index #programming #demos | ← there's more. |
+ Github & RSS |