This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Lexical differential highlighting instead of syntax highlighting

In 2013 I was working in nuclear power plant automation. Can't talk much since I still haven't figured out which part of it was classified. But I probably wouldn't get arrested for mentioning that the job required reading a lot of assembly code.

Reading assembly is not as hard as it might occur to an untrained person. In fact, everyone can read a bit of assembly. But in large quantities, it's not too easy either. The mnemonics like RCR, SHRD, WBINVD, and CMPXCHG8B are fun to write, but hell to read.

What's worse, the standard approach to syntax highlighting doesn't help at all. It's fine that mov doesn't look like eax, but I'd rather prefer pmulhw and pmulhuw to be shown as differently as possible.

So I employed another kind of highlighting. It's not sytnax but lexical differential highlighting. “Lexical” since it doesn't need true syntax analysis, primitive tokenization and filtering are enough. And it's “differential” because it aims to highlight the difference between lexemes. Ideally, the smaller the lexical difference, the greater the color difference should be.

It works like this.

  movq  %rax, %r14
  .align  16, 0x90
.LBB0_6:
  # =>This Inner Loop Header: Depth=1
  movl  -12(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  movl  -8(%r12,%rbx,4), %ecx
  addl  $-1, %ecx
  imull  %ecx, %ecx
  addl  %eax, %ecx
  movl  -4(%r12,%rbx,4), %eax
  addl  $-1, %eax
  imull  %eax, %eax
  addl  %ecx, %eax
  movl  (%r12,%rbx,4), %ecx
  movl  $1, %edx
  subl  %ecx, %edx
  addl  $-1, %ecx
  imull  %edx, %ecx
  cmpl  %ecx, %eax
  jne  .LBB0_7
# BB#17:   #   in Loop: Header=BB0_6 Depth=1
  incl  4(%rsp)
.LBB0_7:   # %.backedge
           #   in Loop: Header=BB0_6 Depth=1
  addq  $1, %rbx
  cmpq  $160000000, %rbx
  jne  .LBB0_6

It's 2019, and I'm getting back to this idea. I'm using lexical differential highlighting not only for assembly but for most of the code published on Words and Buttons. There are two reasons to do so. First, it works with other languages too. And second, it saves your traffic. Yes, including right now.

Even considering quotes and comments, the tokenizer itself can be implemented in about 30 lines of code. And the painting function is even smaller. So instead of doing syntax highlighting statically or dragging a third-party library as a dependency, I simply rewrite the coloring code specifically for every page.

This way I can highlight code for any assembly dialect or any language including the most obscure and outdated ones. And it only takes a few KB per instance.

But, of course, rewriting the highlighter by hand would be extravagant, so I made this generator to do it for me.

Highlighter generator

... here will be the highlighter code...

Feel free to use it however you like. Just as every other piece of code on Words and Buttons, it's properly unlicensed.

P. S.

After a very fruitful discussion on Reddit, I've tried to emulate a bit of syntax highlighting to work together with the lexical highlighting. And it worked! Here's a quasi-sytnax-differential highlighter for JavaScript. It's highlighterd by itself.

function colorized_with_js_highlighter(text) {
    const separators = ['function ', ' if(', 'return ', 'var ', 'const ', ' for(',
        '\n', ' ', '\t', '.', ',', ':', ';', '+', '-', '/', '*', '(', ')', '<', '>', '[', ']', '{', '}'];
    const quotes = ['\'', '"'];
    const comments = [['//', '\n'], ['/*', '*/']];

    function painted_in(line, color) {
        return line.length == 0 ? "" : "<span style=\"color:#" + color + "\">" + line + "</span>";
    }

    function colorized(token) {
        var code_sum = 0;
        for(var i = 0; i < token.length; ++i)
            code_sum += ([1, 7, 11, 13][i % 4] * token.charCodeAt(i));
        var zero_channel = code_sum % 3;
        var color = '' + (zero_channel == 0 ? '3' : '') + (1 + (code_sum % 5) * 2)
            + (zero_channel == 1 ? '3' : '') + (4 + (code_sum % 2) * 5)
            + (zero_channel == 2 ? '3' : '');
        return painted_in(token, color);
    }

    function separated(line, i) {
        if(i == separators.length)
            return colorized(line);
        return line.split(separators[i]).map(function(subline) {
            return separated(subline, i + 1);}).join(separators[i]);
    }

    function unquoted(line, i) {
        if(i == quotes.length)
            return separated(line, 0);
        var chunk_no = 0;
        return line.split('\\' + quotes[i]).join('\0').split(quotes[i]).map(function (chunk) {
            return chunk.split('\0').join('\\' + quotes[i]);}).map(function (chunk) {
                return ++chunk_no % 2  == 1 ? unquoted(chunk, i + 1) : painted_in(quotes[i] + chunk + quotes[i], "555");}).join('');
    }

    function uncommented(line, i) {
        if(i == comments.length)
            return unquoted(line, 0);
        var chunks = line.split(comments[i][0]);
        return uncommented(chunks[0], i + 1) + chunks.slice(1).map( function(chunk) {
            var in_out_comment = chunk.split(comments[i][1]);
            return painted_in(comments[i][0] + in_out_comment[0] + (in_out_comment.length > 1 ? comments[i][1] : ''), "555")
                + uncommented(in_out_comment.slice(1).join(comments[i][1]), i + 1);}).join('');
    }

    return uncommented(text, 0);
}