[Beowulf] OT: public random numbers?

Thu Aug 25 05:11:07 PDT 2011

On Thu, 25 Aug 2011, Vincent Diepeveen wrote:

> I noticed that most generated semi-random numbers with software generators,
> had the habit to truely adress a search space of n always in O (n log n).
>
> So if you draw from most software RNG's a number and do it modulo n,
> with n being not too tiny, say quite some millions or even billions, then 
> every
> slot in your 'hashtable' will get hit at least once by the RNG, whereas data
> in reality simply happens to not have that habit simply.
>
> So true random numbers versus generated noise is in this manner easy
> to distinguish by this. Now i didn't study literature whether some other chap
> some long time ago already had invented this. That would be most interesting
> to know.

Some other chap named George Marsaglia (and to some extent another chap
named Donald Knuth) have already invented this.  A number of tests of
the tails of random number generators are already in dieharder.  All
"good" modern rngs pass these tests.

The Martingale betting system you are looking at is even older (at least
Marsaglia and Knuth are still alive).  It dates back to the 18th
century, and is well known to be flawed for a variety of reasons, not
the least of which is that gamblers don't have the infinite wealth
necessary to make this >>even<< a zero-sum strategy and casinos have
betting limits that de facto make it impossible to pursue the requisite
number of steps and in roulette in particular have 0 and/or 00 slots and
aren't zero-sum to begin with.  You can read a decent analysis of
outcomes based on the presumed binomial distribution of a zero-sum game
here:

   http://en.wikipedia.org/wiki/Martingale_%28betting_system%29

Your test below is interesting, though.  The only real problems I can
see with actually using it in dieharder are:

   a) One would need a theoretical estimate of the distribution of
filling given n log n draws on an n-slotted table (for largish n).  That
is, for a perfect rng, what SHOULD the distribution of success/failure
be.

   b) One would then need the CDF for this distribution, to be able to
turn the results of N trials (of n log n pulls each) into a p-value
under the null hypothesis -- the probability of obtaining the particular
number of successes and failures presuming a perfectly random generator.

That way dieharder could apply it rigorously to its 70 or 80 embedded
rngs or to any user's outboard generator.  There probably is theoretical
statistical support for the PD and/or CDF -- you're analyzing the tails
of a poissonian process -- but finding it or doing it yourself (or
myself), aye, that's the rub.  One cannot just say "high degree of
certainty that it is an RNG" (by which one means that the rng in
question fails the test for randomness) in the test.  HOW high?  Perfect
rngs or perfectly random processes will sometimes fill your table, but
how often?  How can you differentiate an "accident" when one does from
an actual failure?  All of those questions require a more rigorous
theory and quantitative result embedded in a test that can be
systematically cranked up to more clearly resolve failures until they
are unambiguous, not marginal maybe yes maybe no.

I suspect that the failures this test would reveal are already more than
covered in dieharder, in particular by the bit distribution tests and
the monkey tests, but I'm not terribly happy with the monkey tests and
would be perfectly thrilled to have a simpler to compute test that
revealed precisely this sort of flaw, systematically.  And it doesn't
hurt at all to have partially or fully redundant tests as long as the
test themselves are rigorously valid.  If you can find or compute the
CDF for your test below, I'd be happy to wrap it up and add it to
dieharder, in other words.  One can always SIMULATE a CDF, of course,
but that requires a known good generator and sort of begs the question
if you don't think that e.g. AES or threefish or KISS are good
generators that would actually pass your test.

Even hardware/quantum sources of random bits are suspect -- they often
are generated by a process that leaves in the traces of an underlying
distribution.  I'm not convinced that >>any<< process in the real world
is >>truly<< random.  Physics is ambiguous on the issue -- the quantum
description of a closed system is just as deterministic as the classical
one, and Master equation unpredictability on open subsets of a large
closed system reflects entropy/ignorance, not actual randomness (hence
Einstein's famous "doesn't play dice" remark).  But lots of this are
sufficiently random that one cannot detect any failure of randomness,
modern crypto class generators being a prime example.

    rgb

>
> In semi pseudo code, let's take an array of size a billion as an example,
> though usually a few million is more than ok:
>
> n = 2^30; // 2 to the power 30
>
> Function TestNumbersForRandomness(RNG,n) {
> declare array hashtable[size n];
>
> guessednlogn = 2 * (log n / log 2) * n;
>
> for( i = 0 ; i < n ; i++ )
>   hashtable[i] = FALSE;
>
> ndraws = filledn = 0;
> while( ndraws  < guessednlogn ) {
>    randomnumber = RNG();
>    r = randomnumber % n; //     randomnumber =  r  (mod n)
>    if( hashtable[r] == FALSE ) {
>       hashtable[r] = TRUE;
>       filledn++;
>       if( filledn >= n )
>         break;
>
>   }
>   ndraws++;
> }
>
> if( filledn >= n )
>    print "With high degree of certainty data generated by a RNG\n");
>  else
>    print "Not so sure it's a RNG\n";
>
> }
>
>
>
>
>
> Regards,
> Vincent
>
>
>
>
>> -- both unpredictable and
>> flat/decorrelated at all orders, and even though there aren't really
>> enough of them for my purposes, I've used them as one of the (small)
>> "gold standard" sources for testing dieharder even as I test them.  For
>> all practical purposes threefish or aes are truly random as well and
>> they are a lot faster and easier to use as gold standard generators,
>> though.
>> 
>> I don't quite understand why the single site restriction is important --
>> this site has been up for years and I don't expect it to go away soon;
>> it is quite reliable.  I don't think there is anything secret about how
>> the numbers are generated, and I'll certify that the numbers it produces
>> don't make dieharder unhappy.  So 1 is fixable with a bit of effort on
>> your part; 6 I don't really understand but the guy who runs the site is
>> clearly willing to construct a custom feed for cash customers, if there
>> is enough value in whatever it is you are trying to do to pay for
>> access.  If it's just a lottery, well, lord, I can think of a dozen ways
>> to make numbers so random that they'd be unimpeachable for any sort of
>> lottery, both unpredictable and uncorrelated, and they don't any of them
>> require any significant amount of entropy to get started.
>> 
>> I will add one warning -- "randomness" is a rather stringent
>> mathematical criterion, and is generally tested against the null
>> hypothesis.  Amateurs who want to make random number generators out of
>> supposedly "random" data streams or fancy algorithms almost invariably
>> fail, sometimes spectacularly so.  There are a half dozen or more
>> really, really good pseudorandom number generators out there and it is
>> easy to hotwire them together into an xor-based high entropy stream that
>> basically never repeats (feeding it a bit of real entropy now and then
>> as it operates).  I would strongly counsel you against trying to take
>> e.g. weather data and make something "random" out of it.  Unless you
>> really know what you are doing, you will probably make something that
>> isn't at all random and may not even be unpredictable.  Even most
>> sources of "quantum" randomness (which is at least possibly "truly
>> random", although I doubt it) aren't flat, so that they carry the
>> signature of their generation process unless/until you manage to
>> transform them into something flat (difficult unless you KNOW the
>> distribution they are producing).  Pseudorandom number generators have
>> the serious advantage of being amenable to at least some theoretical
>> analysis (so you can "guarantee" flatness out to some high
>> dimensionality, say) as well as empirical testing with e.g. dieharder.
>> 
>> HTH,
>>
>>     rgb
>> 
>>> 
>>> Thanks,
>>> 
>>> David Mathog
>>> mathog at caltech.edu
>>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>>> 
>> 
>> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>> Duke University Dept. of Physics, Box 90305
>> Durham, N.C. 27708-0305
>> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>> 
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu