[Beowulf] Parallel memory

Tue Oct 18 21:11:57 PDT 2005

At 06:39 PM 10/18/2005 -0400, Mark Hahn wrote:
>> memory, if you do a memory access like that over openssi it will forward
>> in a very slow manner like 2048 bytes. 

>of course the correct answer is page size, since that's the granularity

Which is not so trivial to change from 2KB to 64 bytes

>at which a net-shm can hook app references.  (actually, the segv handler
>*could* trap each faulting instruction and deliver the data in smaller
>pieces.  it could even try to figure out when patching in a whole page 
>would be more effective.  some of this sort of thing is done in papers
>that involve "pointer swizzling" - various amounts of dynamic code patching,
>as well.  but using pages means that further references have zero overhead,
>which is always nice.)
>
>> This is ok if you are streaming in some sort of way.
>
>or at least high locality within the pages you do touch.
>
>> Anyway you'll need to dramatically rewrite the application for it.

>not at all.  the whole point of mmu-based net-shm is to present the app
>with the illusion that the pages are just *there*.  you'll probably want 
>to change your allocator so that you call netshm_alloc for a few big
>chunks of distributed memory, and leave most other allocs local.

Actually i meant to say you need to dramatically rewrite OpenSSI/OpenMosix
to get some speedup > 1.0 out of it for software of the Todd type.

A SGI 64 processor itanium2 1.6Ghz is like $1 million.

One way pingpong latency of it is 3-4 us.

If you could replace that by some 16 nodes dual opteron dual core system
2.2ghz, which is priced way under $60k with as software as pdsh being
free software.

One way pingpong latency also 3-4 us.

If that opteron machine ain't delivering enough gflops as compared to the
itanium2 machine, you could take a 32 nodes dual opteron dual core machine
for around $125k. 

Which definitely delivers more gflops.

So the real big difference is that in this case the example SGI machine is
SSI and the pdsh cluster is not.

I'm not sure how many systems SGI would still sell if a good SSI
alternative would be there.

If the SSI in itself is losing 'just' a factor 2 in performance compared to
MPI, that would be very acceptible in such a case. 

Just consider the price difference... ...and the ease of porting
applications to run parallel like they do on pc's without all the overhead
of all those nasty mpi calls.

SSI is not peanuts to make. It definitely isn't something hobbyists can
manage to do very well. 

I have a simple program written measuring what you can call 'two way
pingpong times SSI'. Just simply 2 times the one way pingpong latency from
8 bytes messages is very well comparing to it.

If you can run that on some openssi/openmosix clusters and then compare
with optimized myri/quadrics/dolphin drivers built with specialized kernels
for those highend clusters, then i really volunteer cooperating with those
tests.

If this program mine can be not slower than a factor 2 for such test than
the mpi equivalent (which is 2 times the one way pingpong latency for 8
byte messages), that would really kick some butt.

Right now it's more like factor 20, if highend cards work anyway with
openmosix/openssi, as they require special kernels, which do not work for
openmosix/openssi.

See the problem?

Vincent