[Beowulf] 512 nodes Myrinet cluster Challanges

Mark Hahn hahn at physics.mcmaster.ca
Tue May 2 14:20:26 PDT 2006


> > moving it, stripped them out as I didn't need them.  (I _do_ always require
> > net-IPMI on anything newly purchased.)  I've added more nodes to the cluster
> 
> Net-IPMI on all hardware?  Why? Running a second (or 3rd) network isn't
> a trivial amount of additional complexity, cables, or cost.  What do

I really like being able to reset remotely, as well as power up/down,
fetch temperatures and fan speeds, etc.

> you figure you pay extra on the nodes (many vendors charge to add IPMI,
> sun, tyan, supermicro, etc), cables, switches, etc.  As a data point on
> a x2100 I bought recently the IPMI card was $150.

the IPMI add-in for many Tyan boards is a lot less than that ($50?),
but quite a few servers already have it.  (such as the HP DL145 G2).

and it's not a "real" nother network, since each rack's worth of IPMI
net ports can just go to an in-rack switch.  if you have 32-40 nodes/rack
with a better-than-ethernet interconnect, then you've probably already
got another switch (gigabit) in the rack so all the extra stuff is in-rack.

> Seems like collecting fan speeds and temperatures in-band seems reasonable,
> after all much of the data you want to collect isn't available via IPMI
> anyways (cpu utilization, memory, disk I/O, etc.).

true.  though it's not clear to me how important those extras are to 
the kind of HPC cluster I run.  a job gets complete ownership of its 
CPUs (and usually multiple whole nodes), so it's quite unlike a 
load-balancing cluster, where you actually want realtime info on 
cpu or memory utilization.  doing load-balanced clusters is not 
unreasonable for more cores-per-node, or perhaps for strictly 
serial workloads.  for anything that's nontrivially parallel, the job
_must_ completely own all its resources, so there's really no reason 
to worry about unused memory on an already occupied node...

> Upgrading a 208 3phase PDU to a switched PDU seems like it costs on the
> order of $30 per node list.  As a side benefit you get easy to query
> load per phase.

that's nice.  but it only lets you power up/down.  you can't do a 
warm reset, only hard ones that limit your life.

> After dealing with a few clusters with PDUs in the airflow blocking
> airflow and physical access to parts of the node I now specify the
> zero-u variety that are outside the airflow.

that's nice.  HP's PDUs have a breaker section which consume about 
1u each, and a set of outlet bars which mount zero-u (but which 
have far too many (or too low-power) outlets.

interestingly, our racks are bayed together, which means that there's
enough space for some airflow between racks.  unfortunately, Quadrics
switches are fairly narrow, so there's enough room for a noticable 
counter-circulation.




More information about the Beowulf mailing list