Last time, I introduced DotMix, a deterministic
parallel random-number generator (DPRNG) for
Intel® Cilk™ Plus. To briefly summarize the last post, DotMix is a new
contribution that you can download from our Contributed Code section. I previously mentioned
two advantages that DotMix has over a traditional serial random-number
generator (RNG) such as
rand()
:
- DotMix avoids some of the scalability bottlenecks of a traditional serial RNG, and
- DotMix generates random numbers deterministically for most Cilk Plus programs when using the same seed and program input.
Today, I will discuss these advantages in greater detail, and explain why the problem of deterministic random-number generation for Cilk Plus turns out to be more complicated than one might initially expect.
Origins of DotMix
DotMix is a code developed by Tao B. Schardl and myself, with input from Charles E. Leiserson. DotMix comes out of some recent research done at MIT. For those of you who are not familiar with the history of Cilk, Cilk Plus is a descendant of the Cilk project from MIT in the mid 90's. Since that time, there has been a long tradition of Cilk-related research coming out of MIT, particularly from the SuperTech research group, headed by Charles E. Leiserson. In general, at any point in time, there are a number of ongoing research projects relevant to the Cilk Plus community, happening both at MIT and elsewhere. But I am able to tell you the most details about DotMix, since I had the opportunity to participate in this particular research effort.
Before diving in any further, it is useful for me to give you a bit of background about myself. My day job is Intel® Cilk™ Plus runtime developer, which means I am one of a number of Intel employees whose job it is to make sure that Cilk Plus works as it should, and to help fix it when it doesn’t. Before joining Intel however, I was an MIT student, member of the SuperTech group, and a frequent user of Cilk technologies. Thus, you could say that one of my long-time hobbies is writing Cilk code and experimenting with new ways to make Cilk better. In today’s post, I'll remove my Intel® Cilk™ Plus runtime developer hat, put on my Cilk researcher and individual contributor hat, and tell you the story of DotMix.
To the best of my knowledge, the research into what eventually led
to DotMix began sometimes around 2010, while I was a student at
MIT. As those of you who have worked on similar projects may be aware,
it is often the case with any kind of research that the origins of any
particular idea or research topic are often impossible to attribute to
any one person or event. Moreover, the journey towards a solution one
takes towards a solution is inevitably more convoluted than it “should
have been” in hindsight. Nevertheless, for the purposes of telling
this story, I will describe a relatively straightforward path to the
solution we (my coauthors at MIT and I) ended up with. I will also
pretend that our investigations began with a variant of our favorite
Cilk program: fib
. For the latter, I suspect that I'm not
really stretching the truth too far. :)
In particular, we started with (the conceptual equivalent of) the following Cilk Plus program.
#include <cstdio> #include <cstdlib> #include <cilk/cilk.h> #include <cilk/reducer_opadd.h> // Global reducer to add up random numbers. cilk::reducer_opadd<uint64_t> sum(0); int rand_fib(int n) { if (n < 2) { // Generate a random number, add it to sum. sum += rand() * n; return n; } else { int x, y; sum += rand() * n; x = cilk_spawn rand_fib(n-1); sum += rand() * n; y = rand_fib(n-2); cilk_sync; sum += rand() * n; return x+y; } } int main(void) { int n = 30; int ans = rand_fib(n); std::printf("fib(%d) = %d\n", n, ans); std::printf("sum is %llu\n", sum.get_value()); return 0; }
In this program, the rand_fib()
method computes
the nth Fibonacci number, but also uses a (pseudo)random-number
generator (RNG) to generate a random number at various points in the
computation. These random numbers are multiplied by the current value
of n
, and added together into the reducer
sum
. We would like this program to run efficiently in
parallel. After all, the whole point of parallelizing a program is to
make it run faster!
On the other hand, we might also want the program to generate
deterministic output, that is, output the same final sum on each run
of the program. Using a typical implementation of rand()
,
this program has deterministic output for a serial execution, since
rand()
generates a deterministic stream of random
numbers. For a parallel execution of this program, however, we don't
always get deterministic behavior. As an aside, the multiplication of
the random number by n
is significant here. If we simply
added all the random numbers together, we might still get
deterministic output because integer addition is commutative!
I'm sure many of you have experienced firsthand the frustration that stems from trying to track down a bug in a program that you can't consistently reproduce because the program behaves differently every time you run it. All other things being equal, for a parallel execution it would be nice to have a deterministic parallel random-number generator — an RNG that generates the same random numbers each time you run the code, even in parallel. Using such an RNG eliminates a source of nondeterminism in a parallel program, which I assert makes parallel programming easier.
From this point forward, I'm going to use the acronym defined earlier: DPRNG. I claim this acronym rolls off the tongue a bit more easily than "deterministic parallel random-number generator." :)
Can we achieve both good performance and deterministic
execution? As we thought about the problem, we realized that achieving
both goals simultaneously was trickier than we first
thought.
Stream RNGs and Parallel Loops
To understand some of the issues, it is helpful to review how
a typical RNG such as rand()
is implemented. RNGs like
rand()
are typically stream
RNGs— RNGs designed to output a stream of numbers, one
at a time, based on the generator's internal state. Seeding the RNG
initializes the RNG’s internal state. After every call to
rand()
, the RNG transforms this state, and then uses it
to output the next number in the stream.
We can be slightly more mathematical in this description. Let
xi
be the internal state of the RNG after the
i
th call to rand()
, and suppose each call to
rand()
uses some function f
to transform the
internal state. Then, a sequence of n
calls to the RNG
transforms the internal state into xn
as
follows:
x0
x1 = f(x0)
x2 = f(x1) = f(f(x0))
x3 = f(x2) = f(f(f(x0)))
x4 = f(x3) = f(f(f(f(x0))))
...
xn = f(xn-1) = f(n)(x0)
Over the years, many people have put significant effort into
designing good RNGs, and the functions used as f
can get
pretty complicated. I won't pretend I understand enough about them to
explain all the details. Fortunately, understanding the exact
function f
is not actually important for our
discussion. The important thing that the math tells us is that for a
generic f
, there is a serial chain of dependencies to
compute xn
from
x0
.
With that fact in mind, let's consider what happens when we
use a stream RNG in a simple parallel loop. In the following program,
each iteration of the parallel loop generates a random number and uses
it to update a reducer sum
.
#include <cstdio> #include <cstdlib> #include <cilk/cilk.h> #include <cilk/reducer_opadd.h> int main(void) { cilk::reducer_opadd<int> sum(0); int n = 1000000; cilk_for(int i = 0; i < n; ++i) { sum += (i+1) * rand(); } std::printf("Final sum = %d\n", sum); return 0; }
How does this program behave? One of two scenarios seems
likely, depending on the specific implementation of
rand()
.
-
Suppose the
rand()
implementation uses a single stream RNG for all worker threads. In this case, multiple worker threads in Cilk are trying to manipulate the RNG's internal state concurrently to generate a random number. Consider two subcases.- If the implementation of
rand()
is not properly synchronized, then there is a race condition, since multiple worker threads are trying to changes the RNG's internal state concurrently. According to theman
page on my Linux desktop, the functionrand()
is not thread-safe, so technically the originalrand_fib()
function has a bug! - If the implementation of
rand()
were thread-safe, e.g., the internal state is protected by a lock, then the program should run correctly. But in a parallel execution, the internal RNG state that iterationi
sees for its call torand()
depends on the order that the iterations manage to acquire the lock, and this order is likely to vary between program runs. Thus, this approach does not guarantee deterministic output.
cilk_for
loop above is likely to exhibit in either no speedup or even slowdown when executed using more than one worker. - If the implementation of
-
Suppose the
rand()
implementation maintains a separate stream RNG for each worker thread, for example, in thread-local storage. This implementation avoids the performance bug of the previous scenario, since each thread has its own private RNG state. But it still has the problem that the random numbers that are generated are nondeterministic, since the runtime scheduler may execute a given iteration of the cilk_for loop on a different worker between different runs.
To give a concrete example, in two consecutive runs of the same Cilk Plus program on 3 workers, the mapping of iterations to worker threads could be:
Run 1 takes 500000 numbers from worker 0's RNG, while Run 2 takes only 250000 numbers. Thus, the final sum is likely to be different between the two runs.Run 1:
Iterations 0 to 499999 on worker 0
Iterations 750000 to 999999 on worker 1
Iterations 500000 to 749999 on worker 2
Run 2:
Iterations 0 to 249999 on worker 0
Iterations 500000 to 749999 on worker 1
Iterations 750000 to 999999, 250000 to 499999 on worker 2
The nondeterminism in the second case comes from the combination of the dynamic load balancing with thread-local storage. In particular, the Cilk Plus runtime uses a randomized work-stealing scheduler, where one worker can steal work from another when it runs out of work to do. This scheme has the advantage of efficient scheduling even when the work of each loop iteration varies significantly. The tradeoff is, however, that with dynamic load balancing, the identity of the worker that executes a given iteration now varies from run to run in our sample program.
As an aside, this nondeterministic behavior is exactly why we should generally frown upon the use of thread-local storage in Cilk Plus. By default, thread-local storage in Cilk Plus is in some sense, "worker-local" storage, because the state gets associated with a given worker thread. The use of worker-local storage exposes the underlying nondeterminism in the runtime scheduler to the programmer writing a Cilk Plus application, something that the design of Cilk Plus specifically tries to avoid. But I digress, as this topic is best left for another day...
Returning to the story of DotMix, after some investigation of the
simple parallel loop, we concluded that these two implementations of
rand()
left us with nondeterministic Cilk Plus programs,
a situation that was less than ideal. Could we make these programs
deterministic?
Strand-local RNGs
After thinking about the parallel loop example for a bit, we came
up with a third, deterministic implementation for rand()
,
based on "strand-local" RNGs. In the previous code example, each
iteration of the cilk_for
loop represents a separate
strand, a serial chain of instructions without any parallel control in
it. Moreover, each strand has an obvious
name that does not change between program runs — the value of
the loop index for the iteration. If we create a new RNG for each loop
iteration, with the loop index being the RNG's seed, then each
iteration uses an RNG whose identity is independent of the worker the
Cilk Plus runtime used to execute that iteration. Thus, we get a
deterministic parallel RNG for the parallel loop.
This notion could be generalized to an arbitrary Cilk Plus program by creating a new RNG for every strand. As long as you can find a unique and deterministic name for every strand to use as the initial seed for the RNG, then you get a deterministic RNG. One can think of this approach as using a "strand-local" RNG, since each strand has a separate RNG.
As we were investigating prior work on the topic of parallel
random-number generation, we discovered that this notion of a
strand-local RNG was conceptually equivalent to solutions that were
already implemented and used in practice. For instance, SPRNG, a popular
library for parallel random-number generation, provides an API for
essentially spawning new RNGs. If we spawned a new RNG for every
cilk_spawn
statement in rand_fib
, then we
get a deterministic parallel RNG. Problem solved... right? :)
As you might guess, the answer was "not quite." Using strand-local
RNGs leads to a number of issues in Cilk Plus, but the biggest one we
encountered was overhead. Creating a new stream RNG can be an
expensive operation, possibly much more expensive than the overhead of
a cilk_spawn operation itself. For a program like
rand_fib
, creating a new RNG for each strand of execution
is analogous to going on a vacation and buying a new car at my
destination instead of renting one at the airport. While it might be
nice to have my own car while I am traveling, it is not going to be
worth the cost unless I am going to be on vacation for a really long
time.
The situation with a typical stream RNG is analogous. A good
stream RNGs is designed to generate a long stream of n
random numbers, where n
is usually measured in at least
millions or billions. On the other hand, a typical Cilk Plus program
like rand_fib
might have millions of strands, but with
each strand asking for at most ten or a hundred random
numbers. Creating and seeding a new stream RNG for each strand can be
quite expensive, and most of its capabilities are going to waste.
In parallel programs where users create and manage pthreads
explicitly, creating a new stream RNG for each pthread is usually
affordable, since creating a new pthread is relatively expensive. In
contrast, strands in Cilk Plus are created at every
cilk_spawn
or cilk_sync
, and the cost of
creating a new RNG is often unacceptable. Using a new stream RNG for
every strand works, but it does not really use the RNG in the way it
was originally meant to be used.
Hashing for DPRNGs
After that long explanation, I've finally arrived at the approach we settled upon for implementing a DPRNG in Cilk Plus. We decided to implement the DotMix DPRNG based on two ideas: hashing and pedigrees.
Hashing enables DotMix to generate random numbers for strands in a Cilk Plus program. Consider a special case, where every strand only needs to generate a single random number. One can conceptually break a strand into multiple strands if more than one random number is required. Instead of creating a new stream RNG for each strand, one can instead generate a random number by applying a "good" hash function to the strand. At the extreme, one could imagine using a cryptographic hash function, like a variant of SHA or md5sum to generate a number that looks sufficiently random. Alternatively, using a simpler hash function might produce numbers that are "less random", but might be more efficient to compute. But what does it mean to hash a “strand”? That is where pedigrees come in.
Pedigrees provide the key to guaranteeing deterministic execution for DotMix. Suppose that the Cilk Plus runtime provided support for figuring out the pedigree of a currently executing strand — a unique name for that strand that remains the same between different runs of the program. Then, one can generate a random number by hashing the pedigree of strand. Conceptually, the idea of pedigrees seems fairly simple and intuitive. The hard part is, of course, figuring out exactly what the pedigrees should be... more on this topic in a bit.
Once we had a way to assign pedigrees to strands, hashing pedigrees gave us a way to implement an efficient DPRNG for Cilk Plus. Our DPRNG is called DotMix because it hashes a pedigree by computing a dot product of the pedigree with entries from a small table of pre-generated random numbers, and then mixes the result to output a random number. In our paper, we show that the hash function satisfies some nice mathematical properties, namely that the probability of getting collisions is reasonably small.
We also ran some tests on DotMix's hash function using the Dieharder test suite, and observed that DotMix seems to generate random numbers of reasonably high quality. If your application has stringent requirements on random-number quality, then DotMix may or may not be appropriate. But DotMix's hash function is intended to be more than good enough for codes where you simply need output that "looks" random.
If you don't want to take our word for it that DPRNG via hashing is a viable approach, I would encourage you to look at this recent paper on parallel random-number generation, which won the Best Paper Award at Supercomputing 2011. This work, which was done independently from our research at about the same time, discusses ways to generate random numbers by hashing counter values. This approach is conceptually equivalently to hashing a pedigree which is the loop index in a parallel loop. Their paper focused more on the question of picking good hash functions, as opposed to describing a way to maintain pedigrees in programs with complex parallelism structure.
There is more that one might say about hashing for a DPRNG, but I will refer the interested reader to the relevant papers for details. Instead, I'll spend the remainder of this post talking about pedigrees, since in my (obviously biased) opinion, pedigrees are the more exciting concept for Cilk Plus. :)
Pedigrees in Cilk Plus?
For programs whose parallelism has a nice, predictable structure,
there are often obvious ways to conceptually assign pedigrees to
strands. For example, in a simple cilk_for
loop where
each iteration requires a single random number, one can use the loop
index as the pedigree. In a slightly more complicated case, if
iteration i
requires ni
random
numbers, the pedigree could be a pair of numbers — the loop
index i
, plus the count of how random numbers we've taken
from iteration i
thus far. This latter case conceptually
requires two terms in the pedigree instead of one, because in a
parallel execution we need to be able to figure out the pedigree of
each iteration i
independently, without knowing exactly
how many random numbers will be needed in other iterations.
For a Cilk Plus program with a more complicated parallelism
structure (e.g., rand_fib
), how can we define pedigrees?
Recall the sample rand_fib
program using DotMix that I
showed last time (also repeated in the figure below). Notice that the
code above makes no mention of pedigrees anywhere, which means that
Cilk Plus and the DotMix library are secretly working to maintain and
utilize pedigrees under the covers.
How does it all work? Since this post is long enough already, I will actually postpone that discussion until next time. Instead, I'll conclude with an open-ended puzzle.
The figure below shows a dag (directed acyclic graph) representing
the calls to the DotMix DPRNG in an execution of
rand_fib(4)
. Each letter conceptually represents a call
to generate a random number. For example, A
is the first
call to get()
in rand_fib(4)
before the
cilk_spawn
of rand_fib(3)
, and
K
is the call to get()
immediately after the
cilk_spawn
. Each call to get()
needs to read
in a unique but deterministic pedigree. How might we assign pedigree
values to each letter?
![]() | int rand_fib(int n) { if (n < 2) { sum += global_rng.get() * n; return n; } else { int x, y; sum += global_rng.get() * n; x = cilk_spawn rand_fib(n-1); sum += global_rng.get(); y = rand_fib(n-2); cilk_sync; sum += global_rng.get() * n; return x+y; } } |
In general, many answers are possible. Next time, I'll describe the
answer we ended up with. Keep in mind, however, that a good scheme
for assigning pedigrees should not introduce additional dependencies
into the dag below. For example, the pedigree value for
K
, should not depend on how many calls to
get()
happen within rand_fib(3)
. For those
of you who want the answers now, you can always check our paper or
other documentation. :)
Summary
The DotMix deterministic parallel random-number generator (DPRNG) avoids the performance problems and nondeterminism that you can run into when using a traditional serial stream-based random-number generator in Cilk Plus. DotMix generates deterministic random numbers by hashing pedigrees, names for strands in the execution of a Cilk Plus program. Stay tuned for my next post, where I will talk more about pedigrees and how you can use them in Cilk Plus!
For more information about Intel Cilk Plus, see the website http://cilkplus.org. For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.