Last time, we covered a list of 10 common performance pitfalls that can prevent users from seeing speedup in a Cilk™ Plus program. To quickly recap, the list is repeated below:
- Parallel regions with insufficient work.
- Parallel regions with insufficient parallelism.
- Code that explicitly depends on the number of workers/cores.
- Tasks that are too fine-grained.
- Different compiler optimizations between serial and parallel versions.
- Reducer inefficiencies.
- Data races or contention from sharing.
- Parallel regions that are memory-bandwidth bound.
- Bugs in timing methodology.
- Nested calls to code parallelized using operating-system threads or OpenMP.
Part 1 discussed the first 5 items. In this article, I briefly describe the 5 remaining pitfalls on the list and some ways to avoid them.
6. Reducer inefficiencies
Reducers are a convenient construct for eliminating races on shared variables. But there are three common reducer-related performance pitfalls that one may encounter.
First, reducers can sometimes interact with compiler optimizations
in unexpected ways, as discussed in Pitfall #5 (Different compiler optimizations between
serial and parallel versions). Consider a simple parallel loop
with a nested serial loop, shown in Figure 1(a), that updates a
reducer. In an unoptimized implementation of reducers, every update
to sum
would perform a reducer
lookup — a call to the Cilk Plus runtime to find
the appropriate view of a reducer for each update. Since the inner
for
loop is serial, however, the reducer view is actually
guaranteed to remain the same through all m
iterations of
each instance of the inner loop.
// (a) Nested loops that access a reducer. cilk::reducer_opadd<int> sum(0); cilk_for (int i = 0; i < n; ++i) { for (int j = 0; j < m; ++j) { sum += (i+j); } } // (b) A manually optimized version of the loop from (a). cilk::reducer_opadd<int> sum(0); cilk_for (int i = 0; i < n; ++i) { int tmp_sum = 0; for (int j = 0; j < m; ++j) { tmp_sum += (i+j); } sum += tmp_sum; } |
Figure 1: A serial loop accessing a reducer, nested inside a parallel loop. The code in (a) shows the normal version, while (b) shows a manually optimized version that eliminates reducer lookups from the inner loop. |
For a simple code example like Figure 1(a), the compiler is
usually able to optimize by hoisting the lookup of the reducer out of
the inner serial loop, so that at most one lookup happens for each
iteration of the cilk_for
. This optimization can result
in a significant speedup compared to the unoptimized version. The
effect of this optimization may be further magnified if the compiler
manages to vectorize the inner loop.
Unfortunately, as we discussed in pitfall #5, the compiler may sometimes miss this optimization, if some code changes prevent the compiler from recognizing this pattern. In these situations, one can often get most (but not necessarily all) of the benefits of this optimization by manually transforming the loop into the code in Figure 1(b).
Second, a Cilk Plus program may have performance problems if it
has a large number of reducers active at once. The Cilk Plus runtime
creates reducer maps on each worker thread,
which roughly speaking, stores the worker's current views for reducers
in a hash map. When a function invoked by a cilk_spawn
returns, or when the function reaches a cilk_sync
statement, the runtime merges reducer maps together, combining the
views stored in the maps. Each merge of a reducer map requires time
proportional to the number of reducer views stored in the map. Thus,
although having a program with tens of different reducers is usually
ok, having programs with hundreds or thousands of different reducers
active at once is unlikely to perform as well, since the runtime
incurs significant overhead to merge reducers.
If a parallel computation densely accesses a large set of reducers, it may be better to combine these reducers together into a single custom reducer. If a computation always accesses a group of reducers together, then combining them reducers overhead. Note however, that the runtime initializes reducer views lazily, so a reducer that is never accessed in a parallel subcomputation does not generate any views to merge from that subcomputation. Thus, in other cases, combining unrelated reducers from otherwise disjoint computations could actually hurt performance.
Finally, a Cilk Plus program may have some scalability issues if
it uses custom reducers with expensive (i.e., non-constant-time)
reduce
operations. In general, the overhead of
reductions generally increases as the number of workers threads
increases, since more workers means more successful steals, and each
successful steal may create more reducer views that need to be merged.
When the reduce
function of a custom reducer is
expensive, and the number of workers is large (e.g., on an Intel Xeon
Phi coprocessor), the work required to execute reductions may become
significant. This problem does not arise too often in practice, but
it is an issue to keep in mind when using an expensive custom reducer.
7. Data races or contention from sharing
Suppose you have a Cilk Plus program that does not suffer from any of the previous pitfalls, i.e., it has sufficient work, and parallelism, and its running time on a single worker thread is comparable to the time required to execute the serialization. You notice, however, that it is still not speeding up as you increase the number of workers. Then, the program might have a performance problem due to a sharing conflict, i.e., either a true sharing conflict (e.g., because of a data race), or a false sharing conflict.
The following loop in Figure 2 has a problem with false sharing.
This loop is free of data races, since each of the 10 iterations of
the cilk_for
loop modifies a different position in the
array X
. But since X[0]
, X[1]
,
..., X[9]
are stored contiguously, they will likely fall
on the same one or two cache lines. In this example, the cache lines
for X
array will constantly bounce back and forth between
processors, since we have multiple processors repeatedly trying to
write to the same cache line. This contention can negate any speedups
we might otherwise see from parallelization.
int X[10]; cilk_for(int i = 0; i < 10; ++i) { for (int j = 0; j < 100000; ++j) { X[i] += f(j); } } |
Figure 2: A parallel loop in Cilk Plus that may exhibit a problem with false sharing. |
To avoid false sharing in the example in Figure 2, one can pad the
elements of the array, thereby ensuring that each iteration of the
cilk_for
loop updates a different cache line. Also, see
Avoiding and Identifying False Sharing Among
Threads to read more about this issue.
Unfortunately, problems with false sharing are not always as obvious or consistent as for the example in Figure 2. A problem might occur only intermittently, especially if it depends on how the memory allocator happens to place memory for shared variables on the heap during a particular execution. When padding data structures for more complicated examples, one many also need to watch out for alignment of objects, since an unaligned cache-line-size object may be split across two cache lines. One symptom of this kind of false sharing problem is if the performance of a parallel execution of a program varies significantly between runs, i.e., the program exhibits significant speedups on some runs, but no speedup or slowdown on on others.
Finally, real races can also cause sharing. For example, the loop in Figure 3 has a race on a random-number generator. Fortunately, we can use Intel Cilk™ screen to check our program for races. For more information, see this introduction to Cilk Screen (part 1 and part 2).
// A loop with a race. cilk_for(int i = 0; i < n; ++i) { int x = rand(); ... } |
Figure 3: A parallel loop with a data race. |
8. Parallel regions that are memory-bandwidth bound
A simple parallel loop that has a high ratio of memory accesses to
computation is unlikely to speed up when parallelized using
cilk_for
, since the performance of the loop is likely to
be bound by memory bandwidth.
Consider the following loop in Figure 4, which reads in two arrays
a
and b
and stores the result into an output
array c
:
This loop performs only 3 arithmetic operations for every 3 array
elements being read from or written to. The time to execute this
benchmark is likely to be dominated by how fast the memory system can
bring the data in arrays a
, b
, and
c
to the processor. Thus, we do not expect to see much
of any benefits from parallelization using a cilk_for
loop compared to a normal for
loop.
Unfortunately, there is usually not a "simple" fix for this kind of problem. Often, the best way to speedup a program whose performance is limited by memory bandwidth is to restructure the code to use an algorithm that is more efficient in its cache usage, i.e., that performs more arithmetic operations per memory element. Fortunately, there exists a large body of existing work in this area, so for a given computational problem, it is possible that someone has already found a cache-efficient algorithm to solve it already.
In particular, many recursive divide-and-conquer algorithms are known to be cache-oblivious, i.e., they are cache-efficient for any cache size, without needing any explicit tuning parameter for cache size. Many common cache-oblivious algorithms can be easy to parallelize using Cilk Plus, since their divide-and-conquer algorithms are naturally expressed using recursive task parallelism.
9. Bugs in timing methodology
Accurately measuring program performance can be quite tricky. Here are some easy-to-make mistakes to watch out for.
- Measuring startup or cache-warmup effects in only one
version of a benchmark when comparing multiple versions.
For a variety of reasons, sometimes the first run of a benchmark may run noticeably slower than subsequent runs. This situation can happen if the first run needs to brings data into cache from main memory, while subsequent runs only need to read that data from cache. Similarly, in a Cilk Plus program, the runtime is initialized lazily, the first time the program encounters parallelism because of a Cilk Plus keyword. Thus, the first run of a benchmark might include the overhead of starting the Cilk Plus runtime, while subsequent runs would not.
Is it fair for us to ignore the first run when measuring performance? Ultimately, the answer depends on our eventual use case, and what we hope to learn from our measurements. But if we are comparing two versions of a code to see which is faster, we should deal with the startup effects consistently for both versions, i.e., include them for both versions or ignore them for both versions.
- Confusing a timer that measures processor-cycles with one
that measures wall-clock time.
This issue is obvious once one realizes it can be a problem, but it can catch the unwary by surprise. This difference in measurement can cause confusion when trying to determine whether there is speedup.
To be more concrete, suppose we have a parallel loop that takes 100 ms to execute serially, and it speeds up linearly when executed on 4 cores. A measurement of wall-clock time should report that the loop takes 25 ms to execute on 4 processors. A measurement of total processor time should report that the loop takes 100 processor-ms to finish the loop, i.e., 25 ms multiplied by 4 processors.
The
clock()
method in C++ is specified to return a processor time that may advance faster or slower than the wall clock time. Thus, this method is not a reliable way to measure times for speedups. Measuring wall clock time is generally a more reliable to test for speedups for Cilk Plus programs. For example, in the Cilk Plus code sample for calculatingfib
, we usegettimeofday
on Unix orGetTickCount
on Windows to measure wall-clock time. TBB also provides a reliable wall-clock timer via the calltbb::tick_count::now()
, as described in the TBB reference manual. - Measuring times smaller than the timer resolution. If the timer being used for benchmarking can only measure times accurately to a few microseconds, and the running times that you are measuring are not much larger than that, then the effects of roundoff can be noticeable when you are trying to calculate speedups.
10. Nested calls to code parallelized using operating-system threads or OpenMP
When calling a library function from within a Cilk Plus program, we
may have unexpected performance problems if the function itself is
parallelized using threads underneath. For example, suppose a
cilk_for
loop of 100 iterations calls a library function
f
, and each instance of f
is itself
parallelized by creating P
threads (e.g., using OpenMP),
where P
is the number of cores available on the machine.
Then, since Cilk Plus itself creates P
worker threads and
may start P
instances of f
, we may have a
total of P*P
threads active at once. In this situation,
the system is said to be oversubscribed, and performance usually
suffers as a result.
This bug can catch users by surprise, especially when calling code
from an optimized 3rd party library that has been parallelized. A
quick test for hidden parallelism within a library function is to run
a Cilk Plus program with CILK_NWORKERS=1
, and monitor the
CPU usage on the system while the program is running. If the program
appears to be using more than one CPU, then something in the library
itself may be parallelized using multiple threads.
A stopgap measure for alleviating this kind of performance problem
is to carefully manage the number of Cilk Plus threads created, and to
limit the number of threads allocated to each instance of
f
so that the total number of threads allocated is at
most P
. Unfortunately, this approach can be suboptimal,
especially if the instances of f
are imbalanced in their
workloads. A better approach would be, of course, to parallelize a
version of the library in question using Cilk Plus, and avoid the
overheads of mixing threading platforms. :)
Summary
That concludes our discussion of common performance pitfalls for Cilk Plus programs. Debugging performance problems can be a challenging task, especially for users who are unfamiliar with programming for multicores using Cilk Plus. Keeping some of these pitfalls and potential solutions in mind can help eliminate some of the mystery and make it easier to get good performance out of your next Cilk Plus program!
For more information about Intel Cilk Plus, see the website http://cilkplus.org . For questions and discussions about Intel Cilk Plus, see the forum http://sofwtare.intel.com/en-us/forums/intel-cilk-plus.