[Beowulf] IP address mapping for new cluster

Wed Aug 1 20:47:58 PDT 2007

Carsten Aulbert wrote:
> Hi,
>
>
<scheme for assigning IP addresses to cluster components>

I clicked reply to say this seems like a lot of trouble to go through to 
make it easy to go from IP address to
location and function, but it turns out that we do something very 
similar in our machines.  A 972 node
SC5832 uses a class B IP address like A.B.y.z/16.  The interconnect 
fabric isn't solely an IP network, but we emulate
Ethernet/IP using IP addresses like A.B.200+<module ID>.100+<node 
ID>/18.  Each node also has a second IP
address that doesn't depend on the interconnect -- the control plane 
network -- with an ID like
A.B.0 + <module ID>.100 + <node ID>/18,  These are like your IPMI ports 
in function but are actually serial point to point IP
links, the other end of which  is an interface on a module service 
processor that does booting and so forth. 
The module service processors each have a control plane IP address like 
A.B.0.100+<module ID>/24.  Then
the fans, power supplies and so forth have addresses in A.B.0.20-99/24.  
The main service processor has A.B.0.1
as its interface on the control network.  The system has a third IP 
network connecting some gateway nodes
on some modules to the service processor.  These interfaces have address 
is A.B.150.X/24.

I was going to say "how often do you really deal with the A.B.C.D rather 
than DNS names anyway?" but I've
just spent a couple of weeks doing just that and it really is convenient 
when you are in the weeds.

One comment is that nearly all software that deals with dotted quads 
prints in decimal, which makes
binary encodings of the meaning awkward.  So using 4 bit fields for the 
X and Y coordinates is hard
to translate in your head.  Instead, making the third octet be 
(row*20)+column would be a lot easier
on the brain and supports 12 rows.  This is why we do things like 
A.B.200+<module ID>.100+<node ID>/18.
It's a little awkward to get started, but then it is trivial to map in 
your brain from IP to function
and position.

The next issue is how all this gets initialized.  Pretty much the only 
way to do it is to have the DHCP
servers configured to map MAC addresses to IP addresses in a stable 
way.  We don't really have that
problem because pretty much the only interfaces that have random MAC 
addresses are the module
service processors.  The MAC address maps to the manufacturing serial 
number, which is essential
for tracking faults, but the position (slot ID/module ID) is reported in 
the DHCP request in a <vendor>
field and the DHCP server knows what to do.

It seems like when you install something, you will have to enter its MAC 
addresses into the DHCP
server database and map to a stable IP address given database knowlege 
of the position and function
of the device.

It also seems like as boxes get pulled out of a rack for service, 
replaced by spares, and later put back
in service somewhere else that you should maintain a database of MAC 
address to device serial
number so you can recognize a lemon when it comes back with a different 
IP address but the
same symptoms.  The database will have to be clever to support coherent 
views of FRUs in cases
like when an interconnect card is moved from one flakey motherboard to 
another, changing the
MAC binding but not the failures.

For us, there were a number of benefits in going to "IP address maps to 
function": 
* Humans can debug given the IP addresses alone
* No DNS lookups required in performance critical paths
* Higher level configuration files for things like SLURM can be nearly 
static

Nevertheless, is the benefit of mapping IP to physical location really 
valuable?  Trying to
maintain this given the probable frequency of swapping out boxes will 
cause trouble with
DHCP and ARP.  Either you make the leases short and wait for them to 
expire before
powering on a replacement, or you have to go around manually flushing 
leases and arp
tables.  Ugh.  Instead, it may make more sense to give a type of device 
a stable IP address
without regard to position, and to maintain a database mapping MAC/IP to 
location
separately.  For a few 1000's of devices, grepping the location file 
will be faster than
walking over to the right rack anyway.  We have this problem with 
modules.  The service
guys want to swap modules in the backplane to see if a problem follows 
it and it has
cost us some DHCP hackery to let the addressing respond smoothly.

-Larry