[Beowulf] Seg Fault with pvm_upkstr() and Linux.

Wed Mar 16 13:07:51 PST 2005

At 01:25 PM 3/16/2005 -0700, Josh Zamor wrote:
>
>On Mar 16, 2005, at 10:37 AM, Vincent Diepeveen wrote:
>>
>> Did you configure GMP correctly?
>>
>> For math with big numbers it default does not use FFT calculations but 
>> way
>> slower methods. You might want to recompile it with FFT enabled in 
>> case you
>> didn't do this yet.
>>
>
>I actually haven't done this, though I'll certainly try that soon.

To quote a friend of mine: "Good programmers do not blink their eyes to
speedup scientific written software by a factor 1 million".

>>
>> In general in parallel programming the worst performance you get when 
>> all
>> processes must report to 1 central process.
>>
>> It's far more efficient when each process is equal and 'divides' the 
>> work
>> done.
>>
>> A simple calculation example of a problem i had at a 512 processor SGI 
>> is
>> that each 'hub' can handle at most 680MB data per second (for 4 
>> processors
>> in total yes).
>>
>> However if 499 other processors start reading/writing from/to this 
>> 'hub'
>> then real disasters will happen.
>>
>> Things will completely lock up. Not only because all processors must 
>> divide
>> the small bandwidth, but also because you will get switch latency 
>> overhead
>> problems of routers and switches.
>>
>> If they first must stream a few bytes data from A to B and then 
>> suddenly
>> from C to D, that's far less efficient than when 1 switch/router must
>> stream only from A to B.
>>
>> Switches and routers sometimes have their own cache which is optimized 
>> for
>> those benchmark streaming tests simply. Switch latency can cause 
>> serious
>> problems if all processors want to use the same communication 
>> resources.
>>
>> The general rule is to keep the routers/switches as less possible as 
>> busy
>> and try to make embarrassingly as possible parallel software.
>
>
>This is exactly the sort of thing that I will be looking for shortly, 
>do you have any recommendations on either books or online texts that 
>cover this sort of thing (best practices when programming for 
>clusters)? I'm currently just experimenting, but this is a field that I 
>think I want to get involved in.

Paranoia sir, is the only thing i can advice you. Never believe any
datapoint a manufacturer gives you until you can prove it yourself.

SGI claimed towards me for example that a random lookup at a remote
processor of Origin3800 at 512p partition would cost me no more than 460
nanoseconds to get 8-128 bytes.

Of course i benchmarked when the time was there at 460 processors and on
average a read of 8 bytes took 5.8 us. Something like 100MB ram per
processor. Each processor is doing every new read of 8 bytes a random
lookup to a random processor at a random memory location within that
460*100MB.

SGI is no exception.

Manufacturers in highend have the problem that competitors have such
unrealistic numbers that the only way they can sell their stuff is by doing
an even more incredible claim.

I had a cluster guy of another very large huge blue company swear to me
that the one way pingpong latency of the just build 20xx processor machine
was under 5 us, using network cards of the most sold highend network card
on the planet in supercomputers.

So i asked a friend to do that pingpong at just a partition of 128 nodes
and it was 8 us there. Let alone 1000+.

Later confronted that person friend with it and it was: "well i hope you
realize that the problem is that what you measure is including the MPI
overhead which can be significant, the numbers i quoted were measured
without that stupid overhead".

But well, you actual make software and *do* need to count at that 'stupid
overhead' to be true.

>
>>
>> which compiler do you compile with?
>>
>> I hope gcc only and not intel c++?
>>
>> intel c++ is notorious with floating points in order to get faster at
>> benchmarks.
>>
>> Are you busy with floating point or with integers?
>
>
>I am currently using gcc's c compiler, ver 3.3.x. And doing mostly 
>integer calculations currently.

gcc should have no bugs there, except for PGO.

>
>>
>> Are you using PGO with gcc?  (pgo = profile guided optimizations)
>>
>> There is major bugs even in latest 3.4.3 gcc in the PGO.
>>
>> Those guys are all volunteers and very cool guys.
>>
>> Very slow in bugfixing as they have other jobs too, and i don't blame 
>> them.
>>
>
>Actually, I haven't, but profilers are one of the things that I want to 
>get more familiar with... Thank you for the suggestions, I really 
>appreciate them having done limited parallel programming on this scale.

I'm not referring to profilers but to for example first compiling with for
example:

# gcc 3.3.3 (suse) in case of x86-64 :
  CFLAGS  = -O3 -fprofile-arcs -march=k8 -mcpu=k8 

Then run your program single cpu for a while. quit your program. remote all
object files.

Recompile then with:
  CFLAGS  = -O3 -fbranch-probabilities -march=k8 -mcpu=k8 

# note that gcc 3.4.x the 'mcpu' has been renamed to 'mtune'

Otherwise default use something like this which has the right processor
name you use:
  CFLAGS = -O2 -mcpu=athlon-xp -march=athlon-xp 

Take care you optimize the GMP for the processor in question, makes a
difference.  

>Regards,
>-J Zamor
>jzamor at gmail.com