[Beowulf] Mersenne primes?

Sun Jun 15 04:05:56 PDT 2008

hi Mark,

Your question is not so much about lucaslehmer, but rather about FFT  
and DWT.

No doubt some attempts have been made to parallellize FFT already to  
multiply big numbers.
DWT, as used by GIMPS, is also taking implicit modulo, basically  
speeding up a factor 2 nearly.

That's what you lose when using some generic approach to multiply.

Additionally the DWT from GIMPS has been written in SSE2 assembler,  
so it only works at x86
hardware that has some sort of SSE2 support. This is very bloody fast  
code optimized for intel P4 at the time.

You won't be able to approach that speed writing C code.

There is some code that parallellizes all this, yet it would prove  
more numbers to be composite
when running it embarrassingly parallel.

Not because it can't be done, but because that Woltman assembler is  
so bloody fast.

In itself such FFT's can be done a lot faster in integers, just on  
paper. However all the cpu's
seem to be faster for floating point.

AMD is by far fastest right now for integer multiplication (oldie K8  
beating core2 amazingly).
See GMP mailing list for explanation.

Intel seems to be faster for running the Woltman code. See benchmarks  
at mersenneforum.org

Those x86/x64 cpu's are rather slow for multiplication.

To be honest, finding a mersenne is not so interesting now, already  
too many join GIMPS.
A single test single core is like a month with odds like 1 in a 100k  
it is a prime.

Personally i'm trying to find some numbers p = (2^n + 1) / 3
busy around 1 million bits now.

Note those won't win you the $100k, for 2 reasons. First of all they  
can't get proven prime easily,
as they are "industry grade primes", secondly you need that army of 30 
+ Tflop or so that GIMPS
is using.

For future it would make sense IMHO to parallellize DWT within 1  
socket. That allows better
cache reusage. Bandwidth from RAM to the caches already is a problem.  
Let's not start discussing
bandwidth that the highend network cards can deliver :)

There is some C codes there that do this (note i have not managed to  
download it) and there is
rumours Woltman also parallellized his own code one day, but most of  
those codes are indeed only
well usable within 1 socket, already losing quite some throughput there.

If your intention is to just do a single FFT as fast as possible, i'm  
sure there must be something out there.
Would be fun to optimize code like that, yet unpaid without cluster  
access it is hard to get it done :)

Right now the quickest way to find a real big 100% prime is to join  
primegrids search for 321
(p = k * 2^n - 1 where k = 3).

They recently found another one.

It also can use the same transform. For k's above 31, the assembler  
code runs a lot slower,
so you lose big throughput already.

Because you've got to test that many numbers to find a single prime,  
nearly all the dudes in that
prime world, when at home, are busy with throughput anyway and not so  
much highend clusters.

Highend clusters in that domain get used for other purposes, those  
busy with that
will shut up and not post here data that is interesting to read. For  
me it is easy to shout something
there as i'm currently unemployed looking for a job (a job abroad  
outside Netherlands would be cool
too, feel free to ask my CV/resume) and in meantime am busy with an  
innocent
computer chess engine. Where GUI's sell real well in computer games,  
engines are not commercial at all,
they just search for the holy grail massively parallel, thereby  
combining hundreds of algorithms/enhancements.

Vincent

On May 1, 2008, at 10:10 PM, Mark Reynolds wrote:

> I'm fairly new to Beowulfery but am reading TFM from
> http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php (great book by
> the way). Does anyone know of any programs to find Mersenne primes
> that are suited to parallelised environments such as a cluster made up
> of nodes with multi-core processors? I've found mprimes from
> http://www.mersenne.org/ but the FAQ states:
> Although, a program could be written for dual-CPU systems (it would be
> quite time-consuming), the machine will still get more throughput by
> working on separate exponents.
> Has anybody tried this and can Lucas-Lehmer testing be effectively
> parallalised? Currently, you have to run an instance for each core for
> it to get any benefit. Also I'd like to modify it so it could, for
> example, search between certain numerical ranges independent of the
> work units sent out through the website or to take advantage of 64-bit
> hardware. Can someone more knowledgeable about this sort of thing tell
> me  whether this is practical or if it has been done already?
>
> Thanks.
>
> -- 
> "To win one hundred victories in one hundred battles is not the
> highest skill. To subdue the enemy without fighting is the highest
> skill."— Sun-Tzu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>