Cilk Plus Solver for a Chess Puzzle or: How I Learned to Love Fast Rejection

In honor of Valentine’s Day, I’ll note that there is much to be said for fast rejection. It saves time and effort that can be better spent searching elsewhere. In this article I’ll discuss a parallel algorithm for solving a chess puzzle that exploits fast rejection. It makes a good demonstration of basic Intel® Cilk™ Plus programming to solve an interesting puzzle.

The puzzle is whether a player’s eight chess pieces (excluding pawns) can attack all squares on a chess board, assuming that the two bishops must be on opposite-color squares. Some others and I published a serial algorithm for the problem in 1989. The algorithm relies on an interesting rejection test that quickly rejects large portions of the search space. It places more than eight pieces on the board at once and checks whether all squares are under attack, ignoring the blocking effects of the pieces. If not, then any subset of those pieces cannot attack all squares. The opening paragraph is worth reading for Skiena’s sly remark about pruning – we were surprised that editors of an academic journal kept it.

The paper notes that the original program took 75 minutes in 1988 on a Sun 3/360. Machines have gotten much faster since then. I lost the original code, but was able to rewrite a parallel version from scratch, without the one-level look-head mentioned in the paper. The parallel version can solve the same problem in less than two seconds on a high-end 16-core machine (a two-socket Intel(R) Xeon(R) Processor E5-2670L).

Here I will explain the parallelization of the algorithm. I’ll assume that you have already at least skimmed over sections 2-3 of the paper to understand the serial algorithm.

The code is attached. It is a single source file. I recommend reading it top to bottom. Two macros affect its behavior:

Compile with –DPARALLEL=0 to compile as serial code.
Compile with -DBISHOPS_CAN_BE_ON_SAME_COLOR=0 to solve the original problem for which I stated times.

Removing the bishop constraint approximately doubles the work. I made the harder problem (unconstrained bishops) the default because enables the program to show some solutions, and modern machines are fast enough to solve it within my patience limit.

Parallelization

Parallelizing with Cilk Plus requires only minor changes to the serial code. The following sections explain these changes.

Fork-Join

The algorithm performs recursive divide-and-conquer. See the paper for details. Here is the key routine:

 void Search( const Board& b ) {
    if( !b.reject() ) {
        int i = b.chooseAxis();
        if( i<0 ) {
            // Found a weak solution
            ++WeakCount;
            if( b.strongAttacks().isAll() )
                // Found a strong solution
                if( !b.hasSuperposition() )
                    // Found solution with no superposition.  Print it.
                    Output << b << std::endl;
        } else {
            // Unfold on axis i and search both halves in parallel
            cilk_spawn Search( Board(b,i,0) );
            Search( Board(b,i,1) );
        }
    }
    // implicit cilk_sync
}

To parallelize it, I had to indicate that the two recursive calls to Search can run in parallel. To do this, I prefixed the first call with cilk_spawn, which says that the caller can keep on going without waiting for the callee to return. I could have also inserted a cilk_sync after the two calls, which would say to wait until the spawned callee returns, but I didn’t since it would be redundant in this example. Cilk Plus always has an implicit cilk_sync at the end of a routine.

Reducers

There is more parallel magic in routine Search than meets the eye. Note the two lines “++WeakCount;” and “Output << b << std::endl;”. Both line operate on global variables unprotected by locks. If I were writing ordinary multithreaded code, these lines would almost surely lead to missing updates to WeakCount and non-deterministic output. But the program is deterministic because I declared WeakCount and Output as reducers.

Reducers are Cilk Plus objects for which different threads get different “views”, and the views are automatically merged in a way to deliver the same result as the equivalent serial program. The views of WeakCount are partial sums that are automatically added together to get the correct total. The program checks that the total matches the value (8715) reported in the paper (bishops constrained). The reducer Output acts like a std::ostream, except that it cleverly merges partial output such that the final output is identical to what the serial version of the program prints.

Scaling

The code scales well because it has a lot of parallel slack (excess available parallelism) and is not memory intensive. If you have Cilk Plus on your system, I invite you to time the serial versus parallel versions of the code. Here are the recommended command lines for compiling it with the Intel compiler on Linux* or Windows* using the Intel compiler:

	Linux	Windows
serial	icc -O2 -xHost chess-cover.cpp –lrt –DPARALLEL=0	icl /O2 /QxHost chess-cover.cpp /DPARALLEL=0
parallel	icc -O2 -xHost chess-cover.cpp –lrt	icl /O2 /QxHost chess-cover.cpp

The options presume that your compiler paths are set up for using TBB, which I used for its portable wallclock timing facility. The option -xHost tells the compiler to optimize for the host machine processor . Using it gained me about a 15% improvement. Theoretical analysis of the program’s parallel speedup requires two numbers:

Work: The total number of instructions executed.
Span: The number of instructions on the critical path.

The ratio work/span is a formal measure of parallelism in the program. For example, if work=span, the parallelism equals one; that is the program is serial.

My program does relatively little work (only 1,832 instructions on average) between fork/join actions, so it is better to use something called “Burdened Span”, which accounts for synchronization overheads.

Since leaves in the search tree have different depths, estimating the span is a bit tricky. So I let the Cilk view scalability analyzer (you can get it from the Intel(R) Cilk(TM) Plus SDK at http://www.cilkplus.org/download) do the work for me. It reports the following statistics for solving the problem with unconstrained bishops:

   Work :                                162,169,367,032 instructions
   Span :                                58,705 instructions
   Burdened span :                       1,123,705 instructions
   Parallelism :                         2762445.57
   Burdened parallelism :                144316.67
   Number of spawns/syncs:               129,851,246
   Average instructions / strand :       416
   Strands along span :                  43

A strand is a sequence of serial instructions between synchronization operations (spawn or sync operations). The average strand is only 416 instructions. Given that small amount, Cilk Plus’s low-cost fork/join is helpful. The “burdened parallelism” shows the ratio “work”/“burdened span”. The value indicates that this program can theoretically scale to 100 thousand hardware threads on an ideal machine. Of course it is just an estimate for an ideal machine. However, I’ve seen the program speed up by 28x on a real 40-core machine.

Summary

Intel Cilk Plus enabled speeding up the puzzle solver with a few changes, and the resulting program scales well and behaves deterministically.

For more information about Intel Cilk Plus, see the website http://cilkplus.org. For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.