[Beowulf] building Infiniband 4x cluster questions

Mon Nov 7 18:33:46 PST 2011

On Nov 8, 2011, at 2:46 AM, Gilad Shainer wrote:

>> I just test things and go for the fastest. But if we do theoretic  
>> math, SHMEM
>> is difficult to beat of course.
>> Google for measurements with shmem, not many out there.
>
> SHMEM within the node or between nodes?

shmem is the programming library that cray had and that quadrics had.
so basically your program doesn't need silly message catching mpi  
commands
everywhere. You only define at program start whether an array is  
getting tracked
  by elan4 and which nodes it gets updated to etc.

So no need to check for MPI overfows for the complex code of  
starting / stopping cpu's.
Can reuse code there easily to start remote nodes and cpu's.

So where the majority of the latency is needed for RDMA reads and/or  
reads from
remote elan memory, the tough yet in overhead neglectible complicated  
code to
start/stop cpu's, is a bit easier to program with SHMEM library.

the caches on the quadrics cards have shmem so you don't access the  
RAM at all,
it's already in the cards. didn't check whether those features got  
added to mpi somehow.

so you just need to read the card - it's not gonna go through pci-x  
at all at the remote node.

Yet of course all this is not so relevant to explain here - as  
quadrics is long gone,
and i just search for a cheapo solution :)

So you lose only 2x the pci-x latency, versus 4x pci-e latency in  
such case.

In case of a RDMA read i doubt latency of DDR infiniband is faster  
than quadrics.

that 0.7 you mentionned if it is microseconds sounds like a bit  
overestimated latency
for pci-x. From the 1.3 us that the MPI-one-way pingpong is at QM500,  
if we multiply it by
2 it's 2.6 us. From that 2.6 us, according to your math it's already  
2.8 us cost to pci-x,
then , which has a cost of 2x pci-x, receiving elan has a cost of 130  
ns, switch say 300 ns including cables
for a 128 port router, 100 ns from the sending elan. that's 530 ns,  
and that times 2 is 1060 ns. There's
really little left for the pci-x. as 2.6 - 1.06 = 1.44 us left for 4  
times pci-x.

1.44 / 4 = 0.36 us for pci-x.

I used the Los Alamos National Laboratory example numbers here for  
elan4.

In the end it is about price, not user friendliness of programming :)

>
>
>> Fact that so few standardized/rewrote their floating point  
>> software to gpu's,
>> is already saying enough about all the legacy codes in HPC world :)
>>
>> When some years ago i had a working 2 cluster node here with  
>> QM500- A , it
>> had at 32 bits , 33Mhz pci long sleeve slots a blocked read  
>> latency of under 3
>> us is what i saw on my screen. Sure i had no switch in between it.  
>> Direct
>> connection between the 2 elan4's.
>>
>> I'm not sure what pci-x adds to it when clocked at 133Mhz, but it  
>> won't be a
>> big diff with pci-e.
>
> There is a big different between PCIX and PCIe. PCIe is half the  
> latency - from 0.7 to 0.3 more or less.
>

Well i'm not so sure the difference is that huge. All those  
measurements in past was at oldie Xeon P4 machines,
and i've never really seen a good comparision there.

Furthermore fabrics like Dolphin at the time with a 66Mhz, 64 bits  
PCI card already got like 1.36 us one-way pingpong latencies,
not exactly a lot slower than DDR infinibands qlogics of a claimed  
1.2 us.

>> PCI-e  probably only has a bigger bandwidth isn't it?
>
> Also bandwidth ...:-)

That's a non discussion here. I need latency :)

If i'd really need big bandwidth for transport i'd use of course a  
boat - 90% of all cargo here
gets transported over the rivers and hand dug canal; especially river  
Rhine.

>
>> Beating such hardware 2nd hand is difficult. $30 on ebay and i can  
>> install 4
>> rails or so.
>> Didn't find the cables yet though...
>>
>> So i don't see how to outdo that with old infiniband cards which are
>> $130 and upwards for the connectx, say $150 soon, which would  
>> allow only
>> single rail
>>   or maybe at best 2 rails. So far didn't hear anyone yet who has  
>> more than
>> single rail IB.
>>
>> Is it possible to install 2 rails with IB?
>
> Yes, you can do dual rails

very well

>
>> So if i use your number in pessimistic manner, which means that  
>> there is
>> some overhead of pci-x, then the connectx type IB, can do 1  
>> million blocked
>> reads per second theoretic with 2 rails. Which is $300 or so,  
>> cables not
>> counted.
>
> Are you referring to RDMA reads?
>

As i use all cpu cores 100%, i simply cannot catch mpi messages, let  
alone overflow.
So anything that has the cards processor do the job of digging inthe  
RAM rather than bug
one of the very busy cores, is very welcome form of communication.

99.9% of all communication to remote nodes is 32 byte RDMA wites and  
128-256 byte reads.
I can set myself whether it's 128, 192 or 256.

Probably i'll make it 128. The number of reads is a few percent more  
than writes.
That other 0.01% is the very complex parallel algorithm that  
basically parallellizes a sequential algorithm.

That algorithm is a 150 pages of a4 roughly full of insights and  
proofs why it works correct :)

>
>