[Beowulf] Re: Pretty High Performance Computing

Vincent Diepeveen diep at xs4all.nl
Wed Sep 24 18:36:49 PDT 2008


2% comeon,

How do you plan to lose 'just 2%' if you make a lot of use from MPI?

let's be realistic; with respect to matrix calculations HPC can be  
relative efficient.
As soon as we discuss algorithms that have the habit to be  
sequential, then they are
rather hard to parallellize at a HPC box. Even very good scientists  
usually lose a factor 50 then or so
algorithmic.

It is questionable whether software that is embarassingly parallel  
should be run at
megamillion dollar machines that are easily factor 5 less efficient  
in power,
provided it can work somehow well at normal PC/CUDA/Brooke type hardware
(meaning that some scientists love RAM just a tad too much; i'd argue  
there is always algorithms
possible, though very complex sometimes, that can get a lot of  
performance with a tad less RAM,
after which you can move again to cheaper hardware).

I'd argue there is a very BIG market for a shared memory numa  
approach, one that has however a
better solution for i/o and timing (so not using some sort of central  
clock and central i/o processors
like SGI used to at the Origin boxes).

The few shared memory approaches that were historically faster than a  
PC, were that much more expensive
than a PC to just increase speed by a factor 2, that it is  
interesting to see what will happen here.

the step from writing multithreaded/multiprocessing software that  
works at NUMA hardware to
a MPI type model is really big.

What happens as a result of that is that those MPI type approaches  
usually are not very well optimized
software programs. The "one eyed software in the land of the blind",  
so to speak.

Sometimes that has very egoistic reasons. I've seen cases that doing  
more calculations gives bigger
round off errors which after a few months backtrack into the root  
bigtime, causing the scientist to be
able to draw sometimes the result he liked to draw, instead of  
objectively also being able to explain
why the 'commercial' model that gets calculated quickly, which  
sometimes exist which is why we know this,
doesn't have those weird 'random' results, so no new theory can get  
concluded.

I would be really amazed if more than 50% in this HPC list in their  
typical workloads gets an efficiency of over 2%.

We shouldn't praise ourselves to be better than we are simply. Having  
lots of processors also makes most scientists very lazy.
That isn't bad at all, the idea majority of scientists use HPC is  
that you can take a look into the future what happens,
giving an advantage over a PC.

That said there is a few fields where the efficiency IS real real high.

But other than some guys who are busy with encryption i wouldn't be  
able to mention a single one to you.
Yet you could also argue that those guys in fact waste most resources  
of everyone, as there is special co-processors
(for embedded for example) and special dedicated processors (using a  
LOT of watts) made that are thousands of
times faster than what you can do in a generic cpu, in which case the  
2% rule still is valid.

In HPC there is however 1 thing i really miss. I'm convinced it  
exists, a kind of GPU type cpu, with a  lot of memory controllers
attached, that's doing calculations in double precision. A smal team  
of 5 persons can build it and clock is oh 300-350Mhz or so?

So the investment in itself isn't big. Getting to 1 Teraflop double  
precision a cpu shouldn't be a big problem.

Where is that cpu?

Did no one care to design it as they can't make billions of dollars  
with it?

Vincent

On Sep 25, 2008, at 12:20 AM, Mark Hahn wrote:

>> that, perhaps serendipitously, these service level delays due to  
>> nodes
>> not being completely optimized for cluster use don't result in a
>> significant reduction of computation speed until the size of the
>> cluster is about at the point where one would want a full-time admin
>> just to run the cluster.
>
> no, not really.  the issue is more like "how close to the edge are  
> you?"
> it's the edge-closeness (relative to cluster capabilities) that  
> matters.
>
> that is, if your program has very frequent global synchronization,
> you're going to want low jitter.  yes, exponentially more so as the  
> size of the job grows, but the importance of the issue also grows  
> as your CPU increases in speed, as your interconnect improves, etc.
>
> similarly, if you have an app which is finely cache-tuned,
> it'll hurt, possibly a lot, when monitoring/etc takes a bite out.
>
>> don't worry about these service details too much, just do your work
>> knowing that you're maybe losing 2% speed (this number is a total
>> guesstimate).
>
> 2% might be reasonable if you're doing very non-edge stuff - for  
> instance, a lot embarassingly parallel or serial-farm workloads
> that don't use a lot of memory.  it's not that those workloads are  
> less worthy, just that they tolerate a lot more sloppiness.
>
> again, it's the nature of the workload, not just size of the cluster.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list