[scyld-users] IPC Network vrs Management Network

Thu Feb 24 15:01:23 PST 2005

On Thu, 24 Feb 2005, Joel Krauska wrote:

> I am exploring using my cluster with a different IPC networks.
> IE: Using MPI RNICs or InfiniBand in my cluster.
>
> Some design questions:
>
> 1. Should the head node also be connected to the high-speed IPC network?

The Scyld software is designed to work with either configuration, but
there are significant advantages to putting the master on the high speed
network.

A few of them are

  - A master is the only type of node that deals with compute node
    additions.  Some networks, such as Myrinet, require a re-mapping
    process when new machines are added.  That's easily done by the
    master when it's part of the network and difficult otherwise.

    Why do other cluster systems not consider this an issue?
    Almost all other cluster systems assume a fixed cluster configuration,
    with hand modification (or equivalently, custom scripts) needed when
    anything changes.  You can use Scyld in that way, but it discards
    the advantage of incremental, on-line scalability possible with clusters.

  - The master can monitor, manage and control the network.  A Beomap
    plug-in can use the network statistics to create a better schedule.

  - Some MPI programs, especially those converted from PVM, expect the
    rank 0 process to be able to do I/O.  This expectation is reflected
    in the default Beomap scheduler, which puts the first process on the
    master.  (See the beomap manual and "--no-local" to change this

  - Our preferred MPI library implementation is a true library.  A single
    process runs on the master until the MPI initialization call.  The
    MPI initialization function creates the remote processes with a
    remote fork system call.  This approach copies the initialization to
    the remote processes exactly.
    Most or all other cluster MPI implementations start all processes
    simultaneously with an auxiliary program, usually a script named
    'mpirun' or 'mpiexec'.  This means that the process count is fixed

The single reason for not putting the master on a high speed network is
   - Most switches have even port counts e.g. 16, 64 or 128 ports, and
     many applications want to run on a power-of-two processor count.
     The next switch size up often costs more than twice as much.
     [[ This is very appealing for pricing optimization, but consider
     that the first failure removes this advantage. ]]

There are many other reasons, which I can (and will) go on about at
length over a beer.  Almost all of the reason are summarized as "Yes, we
can do things the way everyone else does, but that would be throwing out
advances that I personally consider really important for usability or
performance".

> 2. How do I tweak the /etc/beowulf/config file to support this?

You may not need to do anything, except set the PXE server interface.
See the PXE parameter page in the Beosetup->Settings menu.
This sets the "pxeinterface" line in /etc/beowulf/config.

There is an opportunity that most users are not aware of.
The 'transportmode' keyword controls the underlying caching filesystem.
The default system uses TCP/IP/Ethernet to cache library and executables
from the master.  By plugging in different "get-file" programs you can
tune the system to make caching faster or more efficient.  By changing
the boot parameters those get-file programs can be directed to use an
alternate server, rather than the primary master.

> 3. Is it possible to pxeboot/dhcp on one interface, but issue bproc
> starts over the high speed interface?

That's exactly what the 'pxeinterface' configuration setting does.  The
original motivation was working with the Chiba City cluster at Argonne,
which had Myrinet on 32 out of every 33 nodes (uhhhhgggg -- an example
of "don't do this").  A later reinforcing motivation was motherboards
with that had both Fast and Gigabit Ethernet ports, but only PXE
booted off of the Fast Ethernet.

> It seems benchmarks like hpl (linpack) issue lots of Master->Slave
> communication in their normal operation. (This as opposed to pallas,
> which seems to do a lot of  slave<->slave communication..)
>
> This seems to imply that Linpack is somewhat bound to your rsh/ssh/bproc
> choice of spawning mpi apps.  Which seems flawed to me as it's not
> stressing mpi in this way. (comments?)
>
> The above seems to encourage using a higher speed interconnect
> from the head node to issue the bproc calls. (leaving the normal
> ethernet only for pxe and "management things" like stats)

If job spawning performance is a concern, there are better solutions
that we have prototyped.  The current method is
     generate a map for process placement,  map[0] .. map[N-1]
     the parent process remote forks to node map[1] .. map[N-1]
     if map[0] is not the master, the parent process moves to node map[0]
A slight variation is
     generate a map for process placement,  map[0] .. map[N-1]
     if map[0] is not the master, the parent process moves to node map[0]
        the parent process remote forks to node map[1] .. map[N-1]
     else
        the parent process remote forks to node map[1]
        that child remote forks to node map[2] .. map[N-1]

> The "interface" keyword in the config coupled with the
> "pxeinterface" keyword seems to encourage this type of setup,
> but I find that if "interface foo" is set, the pxeserver doesn't
> want to restart if the iprange command doesn't map to the IP
> subnet on interface foo.  (suggesting that the dhcp functionality
> wants to bind to "foo" and not the given pxeinterface)

That's not quite the way it works.

    interface <clusterif>	# Sets the cluster private network
    pxeinterface <pxeif>		# Enables true PXE

If the pxeinterface keyword is not used, the PXE server reverts to
DHCP+TFTP on the cluster interface.  This is slightly different than
true PXE.  It passes most of the PXE options to the client, but doesn't
use an intermediate "port 4011" agent.

If <pxeif> is the same <clusterif>, the server behavior changes to true
PXE protocol on the cluster private network.

If <pxeif> is a different interface, that interface is assumed to be up
and assigned a valid IP address.  That IP address is typically *not* in
the cluster IP address range.

A source of confusion here is a decision we made several years ago to
switch which part of the system controls network interfaces.

Originally our tools handled the network interface configuration,
including setting the IP address info and bringing the interface up and
down.  This worked very well.

Over time the de facto "standard" Linux tools became increasingly
insistent on managing the network interface settings.
Those administrative tools really, *really* want to control every
network connection, especially the automatic start-up, shutdown and
firewall configuration.

Thus our infrastructure now has to assume that the interfaces are
correctly configured for the cluster, and can only log a complaint and
quit if they are not configured or inconsistent.

> Thus "interface" must be the pxeinterface. (maybe someone's not parsing
> the pxeinterface command?)

The only known bug in this area is that older versions of Beosetup would
fail to read in an existing pxeinterface specification.  You would have
to set it by hand each time you started Beosetup.  That bug is now fixed.

BTW, a few of the PXE-related keywords are:

   nodeassign <manual|append|insert|locked>
   pxefileserver <serverip>
   pxebootfile [NODE-RANGE] <BOOTFILE>
   pxebootcfgfile [NODE-RANGE] <BOOTFILE>
   pxebootdir <DIR>
   pxeinterface <ifname>
   pxebandwidth <Mbpersecond>

  Syntax:
    pxefileserver <serverip>
  Use SERVERIP as the machine that serves the boot images.

  Syntax:
    pxebootfile [NODE-RANGE] <BOOTFILE>
    pxebootcfgfile [NODE-RANGE] <BOOTFILE>
  Use BOOTFILE as the node bootstrap program or configuration file.
  An unspecified NODE-RANGE means use this BOOTFILE as the default.

  Syntax:
    pxebootdir <DIR>
  Use DIR as the root directory for the TFTP file server.

  Syntax:
    pxeinterface <IFNAME>
  Use IFNAME as the network interface.  Note that this must be a physical
  interface, as it watches for broadcast packets and responds with
  broadcast packets.

  Syntax:
    pxefileserver <serverip>
  Specify an alternate boot file server.  This instructs the booting
  machine to retrieve all subsequent files from the machine <serverIP>.
  This is typically used only in very large cluster configurations, where
  the network load of booting machine may interfere with this master's
  operation.

  Syntax:
    pxebandwidth <Mbpersecond>
  Limit the bandwidth used by the boot subsystem to the integer value
  <Mbpersecond>.  Note that this is in bits, not bytes.  This value
  should not be set unless there is a specific performance problem noted
  while groups of new nodes are booting.

Donald Becker				becker at scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993