Random Number Generation. Was Re: Beowulf Questions

Mon Jan 6 07:15:37 PST 2003

On Mon, 6 Jan 2003, Florent Calvayrac wrote:

> > Taking this method a step further still, MPI latency might even able to
> > be used (delta time between the compute nodes and the head) 
> 
> this is certainly much simpler to implement than downloading sprng,
> (on http://sprng.cs.fsu.edu/ )
> or just implementing a node-number dependent recurrence formula on each 
> machine, as RGB is also doing. Measuring an MPI latency time
> (in order of several thousands cycles), communicating back and forth,
> and running the same calculation on another cluster to predict the 
> results as you suggest is also certainly faster and more elegant to 
> achieve independence of the streams.

Sure, but a) this sort of thing is very dangerous and b) why bother?
/dev/random is ALREADY doing a moderately sophisticated (and probably
much more sophisticated and better thought through) job of doing this
sort of thing.

If the issue is one of "random" (by which I mean "uncorrelated by all
measures of correlation") vs "uniform deviate generator" (a term that
avoids the oxymoronic descriptor "random number generator":-) then it is
extremely difficult to come up with entropy from ANY source or sources
available on OTC systems on an OTC network sufficient to provide "truly
random" numbers fast enough for a random number-hungry application.
There are all kinds of subtle time correlations in nearly anything you
choose to use as an entropy source short of a piece of specifically
engineered hardware desirnged for that purpose, and you have to wait for
several of the longest correlation times of each source in order for its
next output to be "random" relative to the previous one.  By definition,
by the way.

The problem is: what are these correlation times?  They turn out in many
cases to be order milliseconds to seconds, not nanoseconds to
microseconds (relative to 32 bits).  Or if you prefer, order of 100
microseconds per "random" bit. And there's the rub.

If a pair of systems were executing some fairly deterministic MPI code
and passing a systematic pattern of same-size messages, you would have
to do a lengthy and nontrivial computation just to MEASURE the
correlation time, and the correlation time for a nominally periodic
process could be VERY LONG.  If you fail to do this measurement and just
"assume" that you know when the times are adequately decorrelated, you
could end up with very significant correlation in your "random" number
stream.  Worse, since your measurement will be heavily state dependent,
you have to ensure that a state with a longer correlation time than
whatever you measure can "never" occur, which is all but impossible on a
system running on a fixed clock.  Basically, you should read "man 4
random" (which uses a variety of sources of entropy, not just one)
before deciding that you can do better, and you should recognize that
/dev/random is really only suitable for generating seeds or other
infrequently required "really random" numbers.

Or, to argue the other way, if you INSIST on using e.g. /dev/random as a
source of a stream of "really random" random numbers, accepting that it
will block until it accumulates enough entropy (by its internal
measures) to return the next one AND accepting that abusing it by USING
it to generate a stream in this way, where each number is almost by
definition "barely random enough" according to the standards of "barely"
implemented by the designers and may or may not be random "enough" for
your application, then this is indeed a suitable sort of thing to
implement on a cluster basis!

Inserting /dev/random reads into my cpu_rate timing harness, I measure a
/dev/random return rate on the order of 5 milliseconds.  This is a very
believable number, suggesting correlation times averaging maybe 1
millisecond in the entropy sources used (relative to 32 bits -- ~100
microsecond bit correlation times).  In order to saturate even 100BT
(call it two and a half million 32 bit random numbers per second) one
would need, lessee, 200 per second per node into 2.5x10^6, um, order of
10^4 nodes, PRESUMING that entropy accumulates as rapidly on nodes with
no keyboard, mouse, or significant multitasking as it does on my "noisy"
desktop where I just ran the benchmark.

Somehow I think that buying a hardware RNG device based on e.g.
radioactivity or thermal noise would be cheaper and would probably work
better as well.

Now, if you only wish to generate "a few" really random (heh!;-) numbers
to use as seeds for UDG's so that they can produce distinct sequences of
uniform deviates (accepting whatever degree of high-dimensional
correlation associated with the method, as usual), you simply won't
"build" an entropy-based source of the random seed that is any MORE
suitable than /dev/random, and implementing /dev/random is (with
unsigned int seed):

 if ((fp = fopen("/dev/random","r")) == NULL) {
   if(verbose == 10) printf("Cannot open /dev/random, setting seed to 0\n");
   seed = 0;
 } else {
   fread(&seed,sizeof(seed),1,fp);
   if(verbose == 10) printf("Got seed %u from /dev/random\n",seed);
 }
 gsl_rng_set(random,seed);

Hard to get any simpler.  Then, as noted before, you can either wear
your pants with belt AND suspenders and keep track of all the seeds used
to ensure that the gods of randomness don't whack you via the 1/2^32
chance of getting any given unsigned long int over the sequence of your
jobs (a very good chance -- basically unity -- if you plan to run say a
million sequences, btw) or you can accept the overlap and ignore it
(reasonable for a few hundred or even a few thousand runs) or if you are
REALLY going to use a LOT of runs (order 10^9), you can create a list of
all the ulong ints and shuffle it with a good shuffle algorithm for
starters.  

However, few UDG's will likely turn out to be adequately decorrelated if
you start saturating the space of potential seeds -- iterated maps
aren't really random and if you literally sample the space of starting
points densely, you will almost certainly get significant but very
difficult to detect correlations in the generated streams.  So once
again, your error estimates (based on independence) will be incorrect,
and other problems can arise.

As always, if you are planning to do anything with very large numbers of
UDG's or random numbers or the like, you are well advised to do some
moderately significant research on UDG's and randomness before
implementing anything at all.  For some applications (e.g. writing a
game:-) it won't matter -- "any" UDG will suffice, as long as it yields
independent play experiences.  For others (doing a long running
simulation for the purposes of publication) it is essential -- there are
famous cases of results obtained at great expense and published in the
most respected journals only to discover (to the embarrassment of the
authors, the referees, everybody) that they were based on a "bad" UDG
(relative to the purpose) and were, alas, cosmic debris, incorrect,
totally erroneous, suitable for the trash can.

Truth be told, I'd say that most of us who actually work in this game
have this as one of our secret nightmares.  The oxymoron is a real one.
There are no RNG's, only UDG's, and it is very, very difficult to
predict whether the correlations that you KNOW are present in a UDG will
significantly affect your answers at any given level of precision,
provided that you avoid the UDGs with known and obvious problems.  Real
Computer Scientists (and theoretical physicists, mathematicians,
statisticians) spend careers studying this problem.  There is almost a
heisengbergian feeling to it -- if you only generate a few UD's, you can
easily enough get adequately decorrelated ones but your statistical
accuracy (in a simulation) will suck.  The more UD's you need, the
greater your statistical accuracy (presuming independence) but the
greater the contamination of those results with the generally occult
correlations.  

It Is Not Easy to know when one will hit the optimum -- the best results
one can obtain that have a reliable estimate of statistical accuracy.
Indeed, one way of determining the PRESENCE of the correlations is to
look for statistically significant deviations from outcomes
theoretically predicted for model problems for perfectly random numbers
(e.g. the mean length of a random walk in N dimensions).  When the
outcome isn't known a priori, and one is using a UDG that passes most of
the "simple" tests for correlation satisfactorily, one is basically
engaged in a crap shoot...maybe your problem is the one that
(eventually) will reveal a Weakness in your particular UDG, or UDGs.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu