I moved on with my algorithm improvements, this time taking a step I had particularly waited for: I started using threads in the algorithm to enable several parts of the search advancing in parallel. This was rather interesting for me, as in my daily work I have very rarely done anything where this kind of parallelism would (have been thought to) bring much benefit, and thus threads have been used in very few of the projects I've worked in.

Number of threads

Before I even got started, I was already wondering what the optimal number of threads to use would be. My idea was that I'd use the maximum number of threads that could be used efficiently, i.e. maximize the advantage of parallel computation without yet suffering from the thread management overhead. I would then create a fixed-size thread pool with that number of threads, and they would automatically pick up more work whenever they became idle.

To keep the algorithm suitable for general use, I thought I'd go with a thread count equal to the number of CPU cores available on the machine it was run on. For example, I do my development on a 6-core MacBook Pro, and thus felt the algorithm should run with six threads on my machine.

However, reading some online articles and discussions about this I quickly learned that the question of optimal number of threads isn't quite this straightforward. I'm definitely still no expert in threading but as far as I understood it, each core can run two threads, and thus a number bigger than the number of cores can (sometimes) perform better. I didn't dive too deep into the hardware details before starting to write the actual implementation, and it turned out Java has a built-in method for getting the number of available processors that seemed very promising, Runtime.getRuntime().availableProcessors(). I decided to at least start with that.

The needed changes

As my original algorithm implementation used a single solver class with the search progress stored in instance variables, they were actually global variables from the algorithm's standpoint. That didn't enable parallelism very well so I had to figure out how to divide the search into independent parts.

I picked an approach that was rather simple to implement but should still provide pretty good performance improvements: I decided to run the search from each starting point in a separate thread. That way each thread could start with basically an empty game board and no complex search state would need to be copied when starting a new thread. The only bigger downside of this approach that I can think of is that it only uses as many threads as there are unique starting points, and for smaller boards that can be less than what the machine could efficiently handle; then again, both a 5×5 board and a 6×6 board already have 6 unique starting points which is the number of cores on my machine. I don't know if using more threads would make the search faster or slower but making the algorithm actually able to use more would take considerably more effort, so this seemed like a good approach — again, at least to start with.

With this approach picked out, the actual implementation was eventually very easy to do. Creating a thread pool was as simple as calling Executors.newWorkStealingPool(), a method that uses the number of available processors – the exact value that I'd thought I'd use – as the parallelism value by default. The biggest task was to create a new solver class that only finds the solutions from one given starting position and uses its own internal state for the search, but that was also rather straightforward and easy. Then I only made that class runnable by a thread, and was done with the whole change.

The results

So, I ended up with an algorithm that uses 6 parallel search threads for finding the solutions from different starting points, after which the rotation and mirroring of those solutions is done in the main thread. I had kind of hoped this would cut the running time of the algorithm to perhaps one fifth (with 6 threads) of what it was before the change. I guess that wasn't really a realistic goal, but I was anyway slightly disappointed by the results. I ran the algorithm several times for a 6×6 board, and the running times were between 47 and 55 seconds. So, about one third of what they were before this latest change.

Even if that's not as good as I had expected, I should probably still be quite happy about that; it's still another ~65% performance improvement to the previous round, and compared to the original naive algorithm, I have now shaved off almost 97% of the running time.