[Beowulf] Correct networking solution for 16-core nodes
Vincent Diepeveen
diep at xs4all.nl
Fri Aug 4 04:52:01 PDT 2006
Thanks Joachim,
This is indeed the case.
The problem of the mailing list is that there is very technical persons who
work for highend companies who just try to emphasize the best case
performance of their solution, whereas for software for those who are
programmer, or even worse just people who run software on it, the worst case
impacts most upon the
performance of applications, not the bestcase of the card without MPI
overhead nor checking whether it overflows from too many short messages.
One such worst case is the case where several processes, and as we nowadays
have at least 2 cores a node and in most luxury positions some networks have
4 cores a node and soon this number of cores goes up, we already saw a
posting of at least 2 groups who have 16 cores a node; the relevant thing is
the switch latency of the NIC itself.
And that one is ugly, so very relevant to know for programmers.
Another bad one used to be certain switches when a lot of short messages
flood the switch, though this is a different magnitude latency than the
above ones,
and more expensive switches which get quoted here usually solve it.
The worst one is the runqueue latency of 10-20 ms. On paper 10 ms.
Especially at those SGI machines i had a lot of problems with the runqueue
latency, additionally another problem there was the fact that timing was
getting done
central. If only 1 cpu out of a partition of 512 is a timing cpu, then each
process can't time its own performance simply, which is a major problem.
Of course my software times itself because i want to know what amount of
system time effectively was used for the program itself. The rest of the
time the process is busy polling (you can't idle because then you suffer the
runqueue latency when waking up), so only timing within the process itself
works.
At 130 cpu's without timing my chessprogram Diep achieved within 10 seconds
perfect scaling (not to confuse with the speedup in time to find the best
move
sooner, which is around 20% in case of such scaling), with timing: NEVER.
Lucky at clusters you don't have such problems which a single system image
has.
Vincent
----- Original Message -----
From: "Joachim Worringen" <joachim at dolphinics.com>
To: <beowulf at beowulf.org>
Sent: Friday, August 04, 2006 10:35 AM
Subject: Re: [Beowulf] Correct networking solution for 16-core nodes
> Greg Lindahl wrote:
>> Vincent wrote:
>>
>>> Only quadrics is clear about its switch latency (probably
>>> competitors have a worse one). It's 50 us for 1 card.
>>
>> We have clearly stated that the Mellanox switch is around 200 usec per
>> hop. Myricom's number is also well known.
>
> I think Vincent meant another latency, not the per-hop latency in the
> switches: the time to switch between different processes communicating to
> the NIC. I never heard of this latency being specified, nor being
> substantial. Can anybody comment?
>
> Joachim
>
> --
> Joachim Worringen, Software Architect, Dolphin Interconnect Solutions
> phone ++49/(0)228/324 08 17 - http://www.dolphinics.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list