[Beowulf] Threaded code

Tue Aug 17 10:44:51 PDT 2004

On Tue, 17 Aug 2004, Joe Landman wrote:

> Mark Hahn wrote:
> 
> >>variables? (do you have an NCPU=1 or something like that hanging around?)
> >>    
> >>
> >
> >afaikt, when threads are enabled, atlas compiles in the number of threads,
> >based on what it detects on the machine doing the compilation.  so, for
> >instance, if you happened to compile atlas on this machine with the uni
> >kernel, (or some other uni) you'd get no threads.  this is a bit
> >counterintuitive to anyone used to OMP_NUM_THREADS, but it certainly
> >makes sense for atlas.
> >  
> >
> 
> 
> Ok, I haven't used atlas in a while.  Are you saying that it hardcodes 
> the number of processors into the code itself?  Wouldn't this 
> effectively render binary RPMs of Atlas completely useless?  Would also 
> make building static binaries (don't know if it is possible with Atlas 
> libs) a waste of time if you need a portable binary.

The whole point of ATLAS is that it is Automatically Tuned.  The
fundamental flaw in the tuning process is that it is entirely (AFAIK)
build-time tuning -- it literally does a search in a high-dimensional
parameter space for build-time parameters that yield optimal
performance, and then compiles them right into the application.  It
isn't designed to be portable at all -- quite the contrary.  It is
designed to be built on EACH system on which it is to be used, and if
you happen to have an "identical" system on which you want to copy the
result well, maybe it is identical and maybe it isn't that's up to you.

So sure, ATLAS is packaged up but it really needs to be built and
packaged on a per-system-archetype basis, and doing even this sort of
voids the warranty (so to speak) that the installed library is truly
optimal on any system but the original RPM build system as even small
differences can push one across the carefully adjusted superlinear
speedup/slowdown thresholds preset in the library.

I'd like to see ATLAS redesigned so that it Automatically Tunes at
runtime, not build time, so that it becomes moderately portable.  Not
enough to actually do the work, mind you;-) but I think that it is
possible at only a small cost in overall efficiency and even have an
idea how to go about doing it.  I've been thinking about proposing it as
a project for some of the upper level CPS students here -- there is an
independent study course where this sort of thing is tackled and this is
an ideal project for the course.

> I have had codes that spent very little time in the parallel sections in 
> the past.  Simply adding another processor/thread does not automagically 
> half the run-time.  You would need to use some of the more advanced 
> query tools to see what is going on.

Hmmm, given that "top" or "ps" are more advanced query tools, I agree.
However, it shouldn't be horribly difficult.

It isn't clear to me that ATLAS is multithreaded anyway.  Does anybody
know for sure?  It has been a while since I looked at the code.  So the
only way you might see an SMP speedup is probably to run two instances
of the application and observe that they complete in the same time as
one, not run one instance of the application and see that it completes
in half the time as on a single CPU system.

   rgb

> 
> Joe
> 
> >regards, mark hahn.
> >  
> >
> 
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu