[Beowulf] Re: Pretty High Performance Computing
Vincent Diepeveen
diep at xs4all.nl
Wed Sep 24 18:36:49 PDT 2008
2% comeon,
How do you plan to lose 'just 2%' if you make a lot of use from MPI?
let's be realistic; with respect to matrix calculations HPC can be
relative efficient.
As soon as we discuss algorithms that have the habit to be
sequential, then they are
rather hard to parallellize at a HPC box. Even very good scientists
usually lose a factor 50 then or so
algorithmic.
It is questionable whether software that is embarassingly parallel
should be run at
megamillion dollar machines that are easily factor 5 less efficient
in power,
provided it can work somehow well at normal PC/CUDA/Brooke type hardware
(meaning that some scientists love RAM just a tad too much; i'd argue
there is always algorithms
possible, though very complex sometimes, that can get a lot of
performance with a tad less RAM,
after which you can move again to cheaper hardware).
I'd argue there is a very BIG market for a shared memory numa
approach, one that has however a
better solution for i/o and timing (so not using some sort of central
clock and central i/o processors
like SGI used to at the Origin boxes).
The few shared memory approaches that were historically faster than a
PC, were that much more expensive
than a PC to just increase speed by a factor 2, that it is
interesting to see what will happen here.
the step from writing multithreaded/multiprocessing software that
works at NUMA hardware to
a MPI type model is really big.
What happens as a result of that is that those MPI type approaches
usually are not very well optimized
software programs. The "one eyed software in the land of the blind",
so to speak.
Sometimes that has very egoistic reasons. I've seen cases that doing
more calculations gives bigger
round off errors which after a few months backtrack into the root
bigtime, causing the scientist to be
able to draw sometimes the result he liked to draw, instead of
objectively also being able to explain
why the 'commercial' model that gets calculated quickly, which
sometimes exist which is why we know this,
doesn't have those weird 'random' results, so no new theory can get
concluded.
I would be really amazed if more than 50% in this HPC list in their
typical workloads gets an efficiency of over 2%.
We shouldn't praise ourselves to be better than we are simply. Having
lots of processors also makes most scientists very lazy.
That isn't bad at all, the idea majority of scientists use HPC is
that you can take a look into the future what happens,
giving an advantage over a PC.
That said there is a few fields where the efficiency IS real real high.
But other than some guys who are busy with encryption i wouldn't be
able to mention a single one to you.
Yet you could also argue that those guys in fact waste most resources
of everyone, as there is special co-processors
(for embedded for example) and special dedicated processors (using a
LOT of watts) made that are thousands of
times faster than what you can do in a generic cpu, in which case the
2% rule still is valid.
In HPC there is however 1 thing i really miss. I'm convinced it
exists, a kind of GPU type cpu, with a lot of memory controllers
attached, that's doing calculations in double precision. A smal team
of 5 persons can build it and clock is oh 300-350Mhz or so?
So the investment in itself isn't big. Getting to 1 Teraflop double
precision a cpu shouldn't be a big problem.
Where is that cpu?
Did no one care to design it as they can't make billions of dollars
with it?
Vincent
On Sep 25, 2008, at 12:20 AM, Mark Hahn wrote:
>> that, perhaps serendipitously, these service level delays due to
>> nodes
>> not being completely optimized for cluster use don't result in a
>> significant reduction of computation speed until the size of the
>> cluster is about at the point where one would want a full-time admin
>> just to run the cluster.
>
> no, not really. the issue is more like "how close to the edge are
> you?"
> it's the edge-closeness (relative to cluster capabilities) that
> matters.
>
> that is, if your program has very frequent global synchronization,
> you're going to want low jitter. yes, exponentially more so as the
> size of the job grows, but the importance of the issue also grows
> as your CPU increases in speed, as your interconnect improves, etc.
>
> similarly, if you have an app which is finely cache-tuned,
> it'll hurt, possibly a lot, when monitoring/etc takes a bite out.
>
>> don't worry about these service details too much, just do your work
>> knowing that you're maybe losing 2% speed (this number is a total
>> guesstimate).
>
> 2% might be reasonable if you're doing very non-edge stuff - for
> instance, a lot embarassingly parallel or serial-farm workloads
> that don't use a lot of memory. it's not that those workloads are
> less worthy, just that they tolerate a lot more sloppiness.
>
> again, it's the nature of the workload, not just size of the cluster.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list