[Beowulf] IP address mapping for new cluster
Larry Stewart
larry.stewart at sicortex.com
Wed Aug 1 20:47:58 PDT 2007
Carsten Aulbert wrote:
> Hi,
>
>
<scheme for assigning IP addresses to cluster components>
I clicked reply to say this seems like a lot of trouble to go through to
make it easy to go from IP address to
location and function, but it turns out that we do something very
similar in our machines. A 972 node
SC5832 uses a class B IP address like A.B.y.z/16. The interconnect
fabric isn't solely an IP network, but we emulate
Ethernet/IP using IP addresses like A.B.200+<module ID>.100+<node
ID>/18. Each node also has a second IP
address that doesn't depend on the interconnect -- the control plane
network -- with an ID like
A.B.0 + <module ID>.100 + <node ID>/18, These are like your IPMI ports
in function but are actually serial point to point IP
links, the other end of which is an interface on a module service
processor that does booting and so forth.
The module service processors each have a control plane IP address like
A.B.0.100+<module ID>/24. Then
the fans, power supplies and so forth have addresses in A.B.0.20-99/24.
The main service processor has A.B.0.1
as its interface on the control network. The system has a third IP
network connecting some gateway nodes
on some modules to the service processor. These interfaces have address
is A.B.150.X/24.
I was going to say "how often do you really deal with the A.B.C.D rather
than DNS names anyway?" but I've
just spent a couple of weeks doing just that and it really is convenient
when you are in the weeds.
One comment is that nearly all software that deals with dotted quads
prints in decimal, which makes
binary encodings of the meaning awkward. So using 4 bit fields for the
X and Y coordinates is hard
to translate in your head. Instead, making the third octet be
(row*20)+column would be a lot easier
on the brain and supports 12 rows. This is why we do things like
A.B.200+<module ID>.100+<node ID>/18.
It's a little awkward to get started, but then it is trivial to map in
your brain from IP to function
and position.
The next issue is how all this gets initialized. Pretty much the only
way to do it is to have the DHCP
servers configured to map MAC addresses to IP addresses in a stable
way. We don't really have that
problem because pretty much the only interfaces that have random MAC
addresses are the module
service processors. The MAC address maps to the manufacturing serial
number, which is essential
for tracking faults, but the position (slot ID/module ID) is reported in
the DHCP request in a <vendor>
field and the DHCP server knows what to do.
It seems like when you install something, you will have to enter its MAC
addresses into the DHCP
server database and map to a stable IP address given database knowlege
of the position and function
of the device.
It also seems like as boxes get pulled out of a rack for service,
replaced by spares, and later put back
in service somewhere else that you should maintain a database of MAC
address to device serial
number so you can recognize a lemon when it comes back with a different
IP address but the
same symptoms. The database will have to be clever to support coherent
views of FRUs in cases
like when an interconnect card is moved from one flakey motherboard to
another, changing the
MAC binding but not the failures.
For us, there were a number of benefits in going to "IP address maps to
function":
* Humans can debug given the IP addresses alone
* No DNS lookups required in performance critical paths
* Higher level configuration files for things like SLURM can be nearly
static
Nevertheless, is the benefit of mapping IP to physical location really
valuable? Trying to
maintain this given the probable frequency of swapping out boxes will
cause trouble with
DHCP and ARP. Either you make the leases short and wait for them to
expire before
powering on a replacement, or you have to go around manually flushing
leases and arp
tables. Ugh. Instead, it may make more sense to give a type of device
a stable IP address
without regard to position, and to maintain a database mapping MAC/IP to
location
separately. For a few 1000's of devices, grepping the location file
will be faster than
walking over to the right rack anyway. We have this problem with
modules. The service
guys want to swap modules in the backplane to see if a problem follows
it and it has
cost us some DHCP hackery to let the addressing respond smoothly.
-Larry
More information about the Beowulf
mailing list