This is Words and Buttons Online — a collection of interactive #tutorials, #demos , and #quizzes about #mathematics, #algorithms and #programming.

Challenge your performance intuition with C++ magic squares

This was supposed to be a five-minute interview question. I came up with it on Thursday, then spent a night looking for a good answer. Then the other night. Then the Saturday morning. Then it finally occurred to me that it's more like 500 minutes now and it wouldn't work for an interview.

But it's still a fun problem to fiddle with. And it goes like this.

Here is a magic square:

8	3	4
1	5	9
6	7	2

It has all the numbers from 1 to 9 each occurring exactly once. The sum in every row is 15. The sum in every column is 15. And the sum in both diagonals is 15, too. We can write a square as a 9-character string like this: "834159672".

The problem: propose a function that tells if a 3x3 square written as an ASCII string of digits is magic or not fast.

Here is a benchmark. It has no dependencies, and it's C++11 standard compliant, so it runs in basically every environment.

#include <iostream>
#include <chrono>

using namespace std;

// Here you should implement your
// magic checker. It should return
// true if the square is magic,
// false otherwise.
//
//                8 1 6    this is
// "816357492" =  3 5 7    a magic
//                4 9 2    square
//
bool check_if_magic(const std::string& square)
{
  // add your code here
  // (or above if you need global stuff)




  return false;
}

// this generates all possible combinations
// of 1-9 digits that may or may not
// form a magic square
static string buffer = "000000000";
void generate_or_check(int index_or_check = 8)
{
  if(index_or_check == -1){
    if(check_if_magic(buffer))
      cout << buffer << " ";
    return;
  }

  for(auto i = 1u; i < 10; ++i){
    buffer[index_or_check] = '0' + i;
    generate_or_check(index_or_check-1);
  }
}

// this runs the generator and measures time
int main()
{
  auto start = std::chrono::system_clock::now();
  generate_or_check();
  auto end = std::chrono::system_clock::now();
  chrono::duration<double>
    difference = end - start;
  cout << difference.count() << "\n\n";
}

I ran it with an empty checker to see how much time the generator takes. Then ran thirteen different solutions three times each, and picked the best time for every solution. Now I know what works and what doesn't.

Now let's play a game. There will be twelve rounds, each round is a match between two solutions. Or rather between their execution times. Using your intuition and best judgment, please estimate their relative performance. Use the sliders below the code samples.

All is measured on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.
Compiled with g++ 5.4.0 -std=c++11 -O2.

Round 1. Check 15 and check digits vs. check digits and then check 15

The naïve solution would be to check every row, column, and diagonal sum and then assert that all the numbers are there. We don't have to parse actual numbers, we can go with ASCII.

To check that all the numbers are there and each occurs only once, we can fill the array of counters and then check that every number has its counter set to 1.

The question is, what should we do first? On the left, the rows, columns, and diagonals checks come first and then there's a counter check. On the right, the counter check comes first. Let's find out which one if faster.

Now place your bets!

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != '5'*3)
    || (sq[3] + sq[4] + sq[5] != '5'*3)
    || (sq[6] + sq[7] + sq[8] != '5'*3)

    || (sq[0] + sq[3] + sq[6] != '5'*3)
    || (sq[1] + sq[4] + sq[7] != '5'*3)
    || (sq[2] + sq[5] + sq[8] != '5'*3)

    || (sq[0] + sq[4] + sq[8] != '5'*3)
    || (sq[2] + sq[4] + sq[6] != '5'*3))
    return false;

  std::array<int, 9> numbers_count{};
  for(auto i = 0u; i < 9; ++i)
    ++numbers_count[sq[i]-'1'];
  for(auto i = 0u; i < 9; ++i)
    if(numbers_count[i] != 1)
      return false;

  return true;
  }

bool check_if_magic(const std::string& sq)
  {
  std::array<int, 9> numbers_count{};
  for(auto i = 0u; i < 9; ++i)
    ++numbers_count[sq[i]-'1'];
  for(auto i = 0u; i < 9; ++i)
    if(numbers_count[i] != 1)
      return false;

  if ((sq[0] + sq[1] + sq[2] != '5'*3)
    || (sq[3] + sq[4] + sq[5] != '5'*3)
    || (sq[6] + sq[7] + sq[8] != '5'*3)

    || (sq[0] + sq[3] + sq[6] != '5'*3)
    || (sq[1] + sq[4] + sq[7] != '5'*3)
    || (sq[2] + sq[5] + sq[8] != '5'*3)

    || (sq[0] + sq[4] + sq[8] != '5'*3)
    || (sq[2] + sq[4] + sq[6] != '5'*3))
    return false;

  return true;
  }

They are equally fast.

Round 2. All checks vs. fewer checks

Not all the checks are that effective though. Sure, the first three are independent but if you have established that all your rows sum up to 15 each that means that the total is 45. Now if we check that the first two columns sum up to 15, the third one has to be 45 - 15*2 = 15. The fifth check is redundant.

And we will gain a bit of performance if we throw it away. But in fact, we can also throw away the third check as well. If we assert that the numbers are unique then the total sum is 45 anyway.

So let's throw away two redundant lines and see what will happen.

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != '5'*3)
    || (sq[3] + sq[4] + sq[5] != '5'*3)
    || (sq[6] + sq[7] + sq[8] != '5'*3)

    || (sq[0] + sq[3] + sq[6] != '5'*3)
    || (sq[1] + sq[4] + sq[7] != '5'*3)
    || (sq[2] + sq[5] + sq[8] != '5'*3)

    || (sq[0] + sq[4] + sq[8] != '5'*3)
    || (sq[2] + sq[4] + sq[6] != '5'*3))
    return false;

  std::array<int, 9> numbers_count{};
  for(auto i = 0u; i < 9; ++i)
    ++numbers_count[sq[i]-'1'];
  for(auto i = 0u; i < 9; ++i)
    if(numbers_count[i] != 1)
      return false;

  return true;
  }

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != '5'*3)
     || (sq[3] + sq[4] + sq[5] != '5'*3)
//    || (sq[6] + sq[7] + sq[8] != '5'*3)

    || (sq[0] + sq[3] + sq[6] != '5'*3)
    || (sq[1] + sq[4] + sq[7] != '5'*3)
//    || (sq[2] + sq[5] + sq[8] != '5'*3)

    || (sq[0] + sq[4] + sq[8] != '5'*3)
    || (sq[2] + sq[4] + sq[6] != '5'*3))
    return false;

  std::array<int, 9> numbers_count{};
  for(auto i = 0u; i < 9; ++i)
    ++numbers_count[sq[i]-'1'];
  for(auto i = 0u; i < 9; ++i)
    if(numbers_count[i] != 1)
      return false;

  return true;
  }

They are equally fast.

Round 3. Naïve solution vs. micro-optimization

The naïve solution is the eight sum-is-15 checks and the digits-are-unique check. The micro-optimization concerns the latter part. To check if the string is a permutation of digits, we want to fill a map of bools and assert that its content is all true. Yes, with a binary map, if there are three equal digits in a square, the value in the map for the digit will be true. But then we wouldn't have enough digits left to fill all nine positions so it works.

Since we are here for the performance, we would use an uint_fast64_t instead of a real map and trigger its bits. The assertion then would be a mere integer comparison. Also, we will start with the loaded map and check it with 0. All for performance sake, right?

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != '5'*3)
    || (sq[3] + sq[4] + sq[5] != '5'*3)
    || (sq[6] + sq[7] + sq[8] != '5'*3)

    || (sq[0] + sq[3] + sq[6] != '5'*3)
    || (sq[1] + sq[4] + sq[7] != '5'*3)
    || (sq[2] + sq[5] + sq[8] != '5'*3)

    || (sq[0] + sq[4] + sq[8] != '5'*3)
    || (sq[2] + sq[4] + sq[6] != '5'*3))
    return false;

  std::array<int, 9> numbers_count{};
  for(auto i = 0u; i < 9; ++i)
    ++numbers_count[sq[i]-'1'];
  for(auto i = 0u; i < 9; ++i)
    if(numbers_count[i] != 1)
      return false;

  return true;
  }

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

They are equally fast.

Round 4. Microoptimized solution vs. the oddity heuristic

The heuristic proposed is meant to end the function as early as possible. It is based on a fact that in a magic square, the corner values are always even and the rest are odd. If one of the numbers fails to pass the oddity test, then there is no point in the full magic check at all.

This heuristic is rather effective. On random numbers, it narrows down our cases with a pretty good 2⁹ to 1 ratio. So only one string in 512 will get to the actual check. Well, technically our numbers are not entirely random, we have more odds than evens, but still.

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ( (sq[0] & 1) != 0 || (sq[1] & 1) == 0
    || (sq[2] & 1) != 0 || (sq[3] & 1) == 0
    || (sq[4] & 1) == 0
    || (sq[5] & 1) == 0 || (sq[6] & 1) != 0
    || (sq[7] & 1) == 0 || (sq[8] & 1) != 0)
    return false;

  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

They are equally fast.

Round 5. Direct solution vs. the central 5 heuristic

We can choose less effective but cheaper heuristic. For instance, we know that all the 3x3 magic squares have 5 in the center. Let's try and exploit this fact.

Also, this makes the main check a little bit cheaper since we don't have to check for the central 5 in a sum anymore.

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

auto magic_number = '5' * 3;
auto not_so_magic_number = '5' * 2;
uint_fast64_t ideal_char_map =
  static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;
bool check_if_magic(const std::string& sq)
  {
  if(sq[4] != '5')
    return false;

  if ((sq[0] + sq[1] + sq[2] != magic_number)
    || (sq[3] + sq[5] != not_so_magic_number)
    || (sq[6] + sq[7] + sq[8] != magic_number)

    || (sq[0] + sq[3] + sq[6] != magic_number)
    || (sq[1] + sq[7] != not_so_magic_number)
    || (sq[2] + sq[5] + sq[8] != magic_number)

    || (sq[0] + sq[8] != not_so_magic_number)
    || (sq[2] + sq[6] != not_so_magic_number))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

They are equally fast.

Round 6. Fair magic test vs. precached answers

There are only 8 3x3 magic squares in existence. So do we really have to test the string against the magic definition? Why couldn't we precache the magic values somewhere and simply compare with that?

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

const std::array<std::string, 8>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  for(auto i = 0u; i < 8; ++i)
    if(sq == all_magic_squares[i])
      return true;

  return false;
}

They are equally fast.

Round 7. String array vs. string set

Well, strings comparisons are relatively slow. But we don't have to do all eight comparisons every time. If we organize our comparison work-flow as a tree-like structure, we could minimize the number of comparisons to somewhere form 1 to 3 per string.

Luckily we don't have to reinvent the tree, we can take the standard set instead.

const std::array<std::string, 8>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  for(auto i = 0u; i < 8; ++i)
    if(sq == all_magic_squares[i])
      return true;

  return false;
}

const std::set<std::string>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  return all_magic_squares.find(sq) !=
    all_magic_squares.end());
}

They are equally fast.

Round 8. String set vs. string unordered set

We have another structure that seems promising in terms of performance. The unordered set is known to have constant complexity for the cases with no collisions. Since our data is small, it is unlikely to cause collisions in a hash map. Let's try that.

const std::set<std::string>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  return all_magic_squares.find(sq) !=
    all_magic_squares.end());
}

const std::unordered_set<std::string>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  return all_magic_squares.find(sq) !=
    all_magic_squares.end());
}

They are equally fast.

Round 9. String array vs. string array plus the oddity heuristic

Now, when we have heavier body, maybe we can gain more from the heuristic. Let's try our oddity one here.

const std::array<std::string, 8>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  for(auto i = 0u; i < 8; ++i)
    if(sq == all_magic_squares[i])
      return true;

  return false;
}

const std::array<std::string, 8>
  all_magic_squares = {
  "816357492", "492357816",
  "618753294", "294753618",
  "834159672", "672159834",
  "438951276", "276951438"
};

bool check_if_magic(const std::string& sq)
{
  if ( (sq[0] & 1) != 0 || (sq[1] & 1) == 0
    || (sq[2] & 1) != 0 || (sq[3] & 1) == 0
    || (sq[4] & 1) == 0
    || (sq[5] & 1) == 0 || (sq[6] & 1) != 0
    || (sq[7] & 1) == 0 || (sq[8] & 1) != 0)
    return false;

  for(auto i = 0u; i < 8; ++i)
    if(sq == all_magic_squares[i])
      return true;

  return false;
}

They are equally fast.

Round 10. Direct solution vs. answers in uint64_t.

Since we know that the center value is always five, we don't have to store it in our answers either. This leaves us with exactly 8 bytes per string that can be cheaply encoded as a single 64-bit value. Why don't we store these values instead of strings?

auto c15 = '5' * 3;
uint_fast64_t ideal_char_map =
    static_cast<uint_fast64_t>(0x1FF) << 49;
uint_fast64_t char_map_one = 1u;

bool check_if_magic(const std::string& sq)
  {
  if ((sq[0] + sq[1] + sq[2] != c15)
    || (sq[3] + sq[4] + sq[5] != c15)
    || (sq[6] + sq[7] + sq[8] != c15)

    || (sq[0] + sq[3] + sq[6] != c15)
    || (sq[1] + sq[4] + sq[7] != c15)
    || (sq[2] + sq[5] + sq[8] != c15)

    || (sq[0] + sq[4] + sq[8] != c15)
    || (sq[2] + sq[4] + sq[6] != c15))
    return false;

  auto char_map = ideal_char_map;
  for(auto i = 0u; i < 9; ++i)
    char_map ^= char_map_one << sq[i];
  if (char_map != 0)
    return false;

  return true;
  }

const std::array<uint64_t, 8> magic_numbers
{
  3545515123101087289, 3690191062107239479,
  3544956562637535289, 3978984379655991859,
  3689073941180135479, 4123101758198592049,
  3977867258728887859, 4122543197735040049
};
bool check_if_magic(const std::string& sq)
{
  if(sq[4] != '5')
    return false;

  uint64_t magic_number =
    *(reinterpret_cast<const uint32_t*>
      (sq.data()));
  magic_number <<= 32;
  magic_number +=
    *(reinterpret_cast<const uint32_t*>
      (sq.data()+5));

  for(auto i = 0u; i < 8; ++i)
    if(magic_number == magic_numbers[i])
      return true;
  return false;
}

They are equally fast.

Round 11. An array on uint64_t vs. a set of uint64_t.

Let's now use a set instead of an array here. There should be fewer comparisons anyway, but would it help?

const std::array<uint64_t, 8> magic_numbers
{
  3545515123101087289, 3690191062107239479,
  3544956562637535289, 3978984379655991859,
  3689073941180135479, 4123101758198592049,
  3977867258728887859, 4122543197735040049
};
bool check_if_magic(const std::string& sq)
{
  if(sq[4] != '5')
    return false;

  uint64_t magic_number =
    *(reinterpret_cast<const uint32_t*>
      (sq.data()));
  magic_number <<= 32;
  magic_number +=
    *(reinterpret_cast<const uint32_t*>
      (sq.data()+5));

  for(auto i = 0u; i < 8; ++i)
    if(magic_number == magic_numbers[i])
      return true;
  return false;
}

const std::set<uint64_t> magic_numbers
{
  3545515123101087289, 3690191062107239479,
  3544956562637535289, 3978984379655991859,
  3689073941180135479, 4123101758198592049,
  3977867258728887859, 4122543197735040049
};
bool check_if_magic(const std::string& sq)
{
  if(sq[4] != '5')
    return false;

  uint64_t magic_number =
    *(reinterpret_cast<const uint32_t*>
      (sq.data()));
  magic_number <<= 32;
  magic_number +=
    *(reinterpret_cast<const uint32_t*>
      (sq.data()+5));

  return magic_numbers.find(magic_number)
    != magic_numbers.end();
}

They are equally fast.

Round 12. An array on uint64_t vs. an unordered set of uint64_t.

Could we gain some speed from using unordered set?

const std::array<uint64_t, 8> magic_numbers
{
  3545515123101087289, 3690191062107239479,
  3544956562637535289, 3978984379655991859,
  3689073941180135479, 4123101758198592049,
  3977867258728887859, 4122543197735040049
};
bool check_if_magic(const std::string& sq)
{
  if(sq[4] != '5')
    return false;

  uint64_t magic_number =
    *(reinterpret_cast<const uint32_t*>
      (sq.data()));
  magic_number <<= 32;
  magic_number +=
    *(reinterpret_cast<const uint32_t*>
      (sq.data()+5));

  for(auto i = 0u; i < 8; ++i)
    if(magic_number == magic_numbers[i])
      return true;
  return false;
}

const std::unordered_set<uint64_t> magic_numbers
{
  3545515123101087289, 3690191062107239479,
  3544956562637535289, 3978984379655991859,
  3689073941180135479, 4123101758198592049,
  3977867258728887859, 4122543197735040049
};
bool check_if_magic(const std::string& sq)
{
  if(sq[4] != '5')
    return false;

  uint64_t magic_number =
    *(reinterpret_cast<const uint32_t*>
      (sq.data()));
  magic_number <<= 32;
  magic_number +=
    *(reinterpret_cast<const uint32_t*>
      (sq.data()+5));

  return magic_numbers.find(magic_number)
    != magic_numbers.end();
}

They are equally fast.

Conclusion

Probably, I shouldn't talk for everyone but in my case, the conclusion would have been: “bad measurements beat good intuition nine times out of ten”. Even though I do have a long history of making things fast, this extreme case — small data, cheap body — neglected my experience to naught. String operations seemed to be cheaper, heuristics — more effective and there were also high hopes for the standard containers. Well, at least slightly higher hopes.

I've learned something from this experience. Hope you did too. If so, then no matter what your score is, you are still a winner.