Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 312

Why is Cilk™ Plus not speeding up my program? (Part 2)

$
0
0

Last time, we covered a list of 10 common performance pitfalls that can prevent users from seeing speedup in a Cilk™ Plus program. To quickly recap, the list is repeated below:

  1. Parallel regions with insufficient work.
  2. Parallel regions with insufficient parallelism.
  3. Code that explicitly depends on the number of workers/cores.
  4. Tasks that are too fine-grained.
  5. Different compiler optimizations between serial and parallel versions.
  6. Reducer inefficiencies.
  7. Data races or contention from sharing.
  8. Parallel regions that are memory-bandwidth bound.
  9. Bugs in timing methodology.
  10. Nested calls to code parallelized using operating-system threads or OpenMP.

Part 1 discussed the first 5 items. In this article, I briefly describe the 5 remaining pitfalls on the list and some ways to avoid them.


6. Reducer inefficiencies


Reducers are a convenient construct for eliminating races on shared variables. But there are three common reducer-related performance pitfalls that one may encounter.

First, reducers can sometimes interact with compiler optimizations in unexpected ways, as discussed in Pitfall #5 (Different compiler optimizations between serial and parallel versions). Consider a simple parallel loop with a nested serial loop, shown in Figure 1(a), that updates a reducer. In an unoptimized implementation of reducers, every update to sum would perform a reducer lookup — a call to the Cilk Plus runtime to find the appropriate view of a reducer for each update. Since the inner for loop is serial, however, the reducer view is actually guaranteed to remain the same through all m iterations of each instance of the inner loop.

// (a) Nested loops that access a reducer.
cilk::reducer_opadd<int> sum(0);
cilk_for (int i = 0; i < n; ++i) {
   for (int j = 0; j < m; ++j) {
     sum += (i+j);
   }
}

// (b) A manually optimized version of the loop from (a).
cilk::reducer_opadd<int> sum(0);
cilk_for (int i = 0; i < n; ++i) {
   int tmp_sum = 0;
   for (int j = 0; j < m; ++j) {
     tmp_sum += (i+j);
   }
   sum += tmp_sum;
}
Figure 1: A serial loop accessing a reducer, nested inside a parallel loop. The code in (a) shows the normal version, while (b) shows a manually optimized version that eliminates reducer lookups from the inner loop.

For a simple code example like Figure 1(a), the compiler is usually able to optimize by hoisting the lookup of the reducer out of the inner serial loop, so that at most one lookup happens for each iteration of the cilk_for. This optimization can result in a significant speedup compared to the unoptimized version. The effect of this optimization may be further magnified if the compiler manages to vectorize the inner loop.

Unfortunately, as we discussed in pitfall #5, the compiler may sometimes miss this optimization, if some code changes prevent the compiler from recognizing this pattern. In these situations, one can often get most (but not necessarily all) of the benefits of this optimization by manually transforming the loop into the code in Figure 1(b).


Second, a Cilk Plus program may have performance problems if it has a large number of reducers active at once. The Cilk Plus runtime creates reducer maps on each worker thread, which roughly speaking, stores the worker's current views for reducers in a hash map. When a function invoked by a cilk_spawn returns, or when the function reaches a cilk_sync statement, the runtime merges reducer maps together, combining the views stored in the maps. Each merge of a reducer map requires time proportional to the number of reducer views stored in the map. Thus, although having a program with tens of different reducers is usually ok, having programs with hundreds or thousands of different reducers active at once is unlikely to perform as well, since the runtime incurs significant overhead to merge reducers.

If a parallel computation densely accesses a large set of reducers, it may be better to combine these reducers together into a single custom reducer. If a computation always accesses a group of reducers together, then combining them reducers overhead. Note however, that the runtime initializes reducer views lazily, so a reducer that is never accessed in a parallel subcomputation does not generate any views to merge from that subcomputation. Thus, in other cases, combining unrelated reducers from otherwise disjoint computations could actually hurt performance.

Finally, a Cilk Plus program may have some scalability issues if it uses custom reducers with expensive (i.e., non-constant-time) reduce operations. In general, the overhead of reductions generally increases as the number of workers threads increases, since more workers means more successful steals, and each successful steal may create more reducer views that need to be merged. When the reduce function of a custom reducer is expensive, and the number of workers is large (e.g., on an Intel Xeon Phi coprocessor), the work required to execute reductions may become significant. This problem does not arise too often in practice, but it is an issue to keep in mind when using an expensive custom reducer.


7. Data races or contention from sharing


Suppose you have a Cilk Plus program that does not suffer from any of the previous pitfalls, i.e., it has sufficient work, and parallelism, and its running time on a single worker thread is comparable to the time required to execute the serialization. You notice, however, that it is still not speeding up as you increase the number of workers. Then, the program might have a performance problem due to a sharing conflict, i.e., either a true sharing conflict (e.g., because of a data race), or a false sharing conflict.

The following loop in Figure 2 has a problem with false sharing. This loop is free of data races, since each of the 10 iterations of the cilk_for loop modifies a different position in the array X. But since X[0], X[1], ..., X[9] are stored contiguously, they will likely fall on the same one or two cache lines. In this example, the cache lines for X array will constantly bounce back and forth between processors, since we have multiple processors repeatedly trying to write to the same cache line. This contention can negate any speedups we might otherwise see from parallelization.

int X[10];
cilk_for(int i = 0; i < 10; ++i) {
   for (int j = 0; j < 100000; ++j) {
      X[i] += f(j);
   }
}
Figure 2: A parallel loop in Cilk Plus that may exhibit a problem with false sharing.

To avoid false sharing in the example in Figure 2, one can pad the elements of the array, thereby ensuring that each iteration of the cilk_for loop updates a different cache line. Also, see Avoiding and Identifying False Sharing Among Threads to read more about this issue.

Unfortunately, problems with false sharing are not always as obvious or consistent as for the example in Figure 2. A problem might occur only intermittently, especially if it depends on how the memory allocator happens to place memory for shared variables on the heap during a particular execution. When padding data structures for more complicated examples, one many also need to watch out for alignment of objects, since an unaligned cache-line-size object may be split across two cache lines. One symptom of this kind of false sharing problem is if the performance of a parallel execution of a program varies significantly between runs, i.e., the program exhibits significant speedups on some runs, but no speedup or slowdown on on others.

Finally, real races can also cause sharing. For example, the loop in Figure 3 has a race on a random-number generator. Fortunately, we can use Intel Cilk™ screen to check our program for races. For more information, see this introduction to Cilk Screen (part 1 and part 2).

// A loop with a race.
cilk_for(int i = 0; i < n; ++i) {
   int x = rand();
   ...
}
Figure 3: A parallel loop with a data race.

8. Parallel regions that are memory-bandwidth bound


A simple parallel loop that has a high ratio of memory accesses to computation is unlikely to speed up when parallelized using cilk_for, since the performance of the loop is likely to be bound by memory bandwidth.

Consider the following loop in Figure 4, which reads in two arrays a and b and stores the result into an output array c:

int N = 10000000;
double* a = new double[N];
double* b = new double[N];
double* c = new double[N];
cilk_for (int i = 0; i < N; ++i)
{
   double x = a[i];
   double y = b[i];
   c[i] = x*x + y*y;
}
Figure 4: A loop in that is bound by memory bandwidth.

This loop performs only 3 arithmetic operations for every 3 array elements being read from or written to. The time to execute this benchmark is likely to be dominated by how fast the memory system can bring the data in arrays a, b, and c to the processor. Thus, we do not expect to see much of any benefits from parallelization using a cilk_for loop compared to a normal for loop.

Unfortunately, there is usually not a "simple" fix for this kind of problem. Often, the best way to speedup a program whose performance is limited by memory bandwidth is to restructure the code to use an algorithm that is more efficient in its cache usage, i.e., that performs more arithmetic operations per memory element. Fortunately, there exists a large body of existing work in this area, so for a given computational problem, it is possible that someone has already found a cache-efficient algorithm to solve it already.

In particular, many recursive divide-and-conquer algorithms are known to be cache-oblivious, i.e., they are cache-efficient for any cache size, without needing any explicit tuning parameter for cache size. Many common cache-oblivious algorithms can be easy to parallelize using Cilk Plus, since their divide-and-conquer algorithms are naturally expressed using recursive task parallelism.


9. Bugs in timing methodology


Accurately measuring program performance can be quite tricky. Here are some easy-to-make mistakes to watch out for.

  • Measuring startup or cache-warmup effects in only one version of a benchmark when comparing multiple versions.

    For a variety of reasons, sometimes the first run of a benchmark may run noticeably slower than subsequent runs. This situation can happen if the first run needs to brings data into cache from main memory, while subsequent runs only need to read that data from cache. Similarly, in a Cilk Plus program, the runtime is initialized lazily, the first time the program encounters parallelism because of a Cilk Plus keyword. Thus, the first run of a benchmark might include the overhead of starting the Cilk Plus runtime, while subsequent runs would not.

    Is it fair for us to ignore the first run when measuring performance? Ultimately, the answer depends on our eventual use case, and what we hope to learn from our measurements. But if we are comparing two versions of a code to see which is faster, we should deal with the startup effects consistently for both versions, i.e., include them for both versions or ignore them for both versions.

  • Confusing a timer that measures processor-cycles with one that measures wall-clock time.

    This issue is obvious once one realizes it can be a problem, but it can catch the unwary by surprise. This difference in measurement can cause confusion when trying to determine whether there is speedup.

    To be more concrete, suppose we have a parallel loop that takes 100 ms to execute serially, and it speeds up linearly when executed on 4 cores. A measurement of wall-clock time should report that the loop takes 25 ms to execute on 4 processors. A measurement of total processor time should report that the loop takes 100 processor-ms to finish the loop, i.e., 25 ms multiplied by 4 processors.

    The clock() method in C++ is specified to return a processor time that may advance faster or slower than the wall clock time. Thus, this method is not a reliable way to measure times for speedups. Measuring wall clock time is generally a more reliable to test for speedups for Cilk Plus programs. For example, in the Cilk Plus code sample for calculating fib, we use gettimeofday on Unix or GetTickCount on Windows to measure wall-clock time. TBB also provides a reliable wall-clock timer via the call tbb::tick_count::now(), as described in the TBB reference manual.

  • Measuring times smaller than the timer resolution. If the timer being used for benchmarking can only measure times accurately to a few microseconds, and the running times that you are measuring are not much larger than that, then the effects of roundoff can be noticeable when you are trying to calculate speedups.

10. Nested calls to code parallelized using operating-system threads or OpenMP


When calling a library function from within a Cilk Plus program, we may have unexpected performance problems if the function itself is parallelized using threads underneath. For example, suppose a cilk_for loop of 100 iterations calls a library function f, and each instance of f is itself parallelized by creating P threads (e.g., using OpenMP), where P is the number of cores available on the machine. Then, since Cilk Plus itself creates P worker threads and may start P instances of f, we may have a total of P*P threads active at once. In this situation, the system is said to be oversubscribed, and performance usually suffers as a result.

This bug can catch users by surprise, especially when calling code from an optimized 3rd party library that has been parallelized. A quick test for hidden parallelism within a library function is to run a Cilk Plus program with CILK_NWORKERS=1, and monitor the CPU usage on the system while the program is running. If the program appears to be using more than one CPU, then something in the library itself may be parallelized using multiple threads.

A stopgap measure for alleviating this kind of performance problem is to carefully manage the number of Cilk Plus threads created, and to limit the number of threads allocated to each instance of f so that the total number of threads allocated is at most P. Unfortunately, this approach can be suboptimal, especially if the instances of f are imbalanced in their workloads. A better approach would be, of course, to parallelize a version of the library in question using Cilk Plus, and avoid the overheads of mixing threading platforms. :)


Summary


That concludes our discussion of common performance pitfalls for Cilk Plus programs. Debugging performance problems can be a challenging task, especially for users who are unfamiliar with programming for multicores using Cilk Plus. Keeping some of these pitfalls and potential solutions in mind can help eliminate some of the mystery and make it easier to get good performance out of your next Cilk Plus program!

For more information about Intel Cilk Plus, see the website http://cilkplus.org . For questions and discussions about Intel Cilk Plus, see the forum http://sofwtare.intel.com/en-us/forums/intel-cilk-plus.


Viewing all articles
Browse latest Browse all 312

Trending Articles