[Beowulf] Infiniband Advice which functions to use for what purpose

Mon Apr 9 17:14:52 PDT 2012

hi,

Trying to make an new model for infiniband for Diep.
I need some advice which functioncalls/libraries to use for fastest  
possible communication over infiniband (mellanox qdr)
from one node to another.

There is a lot of possibilities there but what's communicating fastest?

I need 2 different types of communication possibly 3 or more.
Still can setup the model there how to communicate now so let's test  
the water:

a) each node has a 1.5GB cache. so that's  1.5 GB * n
      each core of each node is randomly needing 192 bytes. Don't  
know which node in
      advance and don't know where in the gigabytes of cache  
(hashtable) it needs to read.

       what library and which function call is best to ask for this?

     Realize all 8 cores are busy, if i need to keep 1 core free  
handling all requests from all other
      nodes, that slows down each machine significantly as i lose 1  
core then.

b) for starting and stopping the difference cores (at all nodes) in a  
de-centralized manner,
      some variables are difficult to keep decentralized, you want  
them broadcasted to all nodes somehow
      updating shared memory at remote nodes in some sort of manner,  
so the mellanox card writing into the RAM
      without interrupting the probably 8 running cores, nor needing  
any of them to handle this.

      Is that possible somehow? If so, is it possible to update it  
with 1 function call to all n-1 other nodes?

c) memory migration - which possibilities are there to do this - i  
probably need to build a manual memory migration
     when a specific job gets taken over from 1 node to another.  
Which function calls would you advice to use there,
      is there documentation on how to efficiently implement memory  
migration?

  I need to migrate roughly around a 2 kilobyte at a time. This  
doesn't happen too much obviously, yet the algorithms
are so complex i can't avoid doing this if i want the utmost  
performance so i figured out on paper.
And yes i do know there is some stuff that already has this built in  
- but that's possibly too slow for what i need.

d) atomic reads/writes/spinlocks over infiniband. there probably is a  
function to set a lock at a remote memory adress,
      which one is it?
       Is there also a function call that sets a lock, and when lock  
is succesful directly returns you a bunch of bytes from a specific
      adress (nearby the lock); that would avoid me doing the  
procedure first setting a lock. Then sit duck and wait until lock is  
set.
      Then issue that read. Means we ship from node A to B something,  
then when lock set at B, goes back to A. Then A can read its
       bytes finally at B as it has the lock set. Is there a combined  
function that is faster than this and is just directly after it can get
      the lock at B return those bytes to A?

e) when doing the spinlock from A, is the core A.c  that tries to set  
the lock at node B, is that core spinning?
      My previous experience there is that nowadays and/or in past  
when trying to do this, some implementations instead of having your
      core spin for a bunch of microseconds, they put your core to  
idle, which means that it needs to get fired by the runqueue,
      to say it in a simple manner, once again, which again means a  
10-30 milliseconds delay until it has received that data.
      Do cores get put in prison for up to 30 years when trying to  
set a lock with the function call in D, do i have both options or am  
i so lucky?

Many thanks for taking a look at my questions and even more to those  
responding!

Kind Regards,
Vincent