[Beowulf] 512 nodes Myrinet cluster Challanges

Fri Apr 28 14:10:35 PDT 2006

On Friday 28 April 2006 05:04, Mark Hahn wrote:
> > Does any one know what types of problems/challanges for big clusters?
>
> cooling, power, managability, reliability, delivering IO, space.

I'd add: sysadmin or other professional resources to manage the cluster.

Certainly, the more manageable and reliable the cluster is, the less time 
the admin(s) will have to spend at basically keeping the cluster in good 
health.  But given manageability and reliability, the bigger issue is: How 
many users and how many different codebases do you have?  Given the variety 
in individual needs, you can end up spending quite a bit of time helping 
users get new code working well, and/or making adjustments to the cluster 
software to accommodate their needs.  At least this has been my experience.

I'm the only admin for a 1024-node cluster with 70+ authorized users (49 
unique users in the past 31 days, about 30of whom are frequent users, I'd 
estimate), and probably a couple dozen user applications.  Having other 
non-sysadmin local staff helping me, as well as having good hardware and 
software vendor support, has been critical to multiply the force I can 
bring to bear in solving problems.

You know all those best practices you hear about when you're a sysadmin 
managing a departmental network?  Well, when you have a large cluster, best 
practices become critical -- you have to arrange things so that you don't 
have to touch hardware but rarely, nor login to fix problems on individual 
nodes.  Such attention to individual nodes takes far too much time away 
from more productive pursuits, and will lead to lower cluster availability, 
which means extra frustration and stress for you and your users.

A few elements of manageability that I use all the time:

* the ability to turn nodes on or off in a remote, scripted, customizable
  manner

* the ability to reinstall the OS on all your nodes, or specific nodes,
  trivially (e.g. as provided by Rocks or Warewulf)

* the ability to get remote console so you can fix problems without getting
  out the crash cart -- hopefully you don't have to use this much (because
  it means paying attention to individual systems), but when you need it, it
  will speed up your work compared to the alternative

* the ability to gather and analyze node health information trivially, using
  embedded hardware management tools and system software tools

* the ability to administratively close a node that has problems, so that
  you can deal with the problem later, and meanwhile jobs won't get assigned
  to it

Think of your compute nodes not as individuals, but as indistinguishable 
members of a Borg Collective.  You shouldn't care very much about 
individual nodes, but only about the overall health of the cluster.  Is the 
Collective running smoothly?  If so, great -- make sure you don't have to 
sweat the details very much.

> > we are considering having a 512 node cluster that will be using
> > Myrinet as its main interconnect, and would like to do our homework

I've had excellent experience with Myrinet, in terms of reliability, 
functionality, and technical support.  It's probably the most trouble-free 
part of my cluster and my best overall vendor experience.  Myrinet gets 
used continuously by my users, but I rarely have to pay attention to it at 
all.

> how confident are you at addressing especially the physical issues above?
> cooling and power happen to be prominent in my awareness right now
> because of a 768-node cluster I'm working on.  but even ~200 node
> clusters need to have some careful thought applied to managability
> (cleaining up dead jobs, making sure the scheduler doesn't let jobs hang
> around consuming myrinet ports, for instance.)  reliability is a fairly
> cut and dried issue, IMO - either you make the right hardware decisions
> at purchase, or not.

A few comments from my personal experience.  On my cluster, perhaps 1 in 
10,000 or 100,000 job processes ends up unkilled, taking up compute node 
resources.  It's not been a big problem for me, although it certainly does 
come up.  Generally the undead processes have been a handful out of a set 
of processes that have something in common -- a bad run, a user doing 
something weird, or some anomalous system state (e.g. central filesystem 
going down).

I've never had a problem with consumed Myrinet ports, but I'm sure that's 
going to depend on the details of your local cluster usage patterns.  Most 
often the problem has been a job spinning using CPU, slowing down 
legitimate jobs.  If I configured my scheduler properly (LSF), I'm pretty 
sure I could avoid even that problem -- just set a threshold on CPU 
idleness or load level.  I *have* made a couple of scripts to find nodes 
that are busier than they should be, or quieter than they should be, based 
on the load that the scheduler has placed on them versus the load they're 
actually carrying.  That helps identify problems, and more frequently it 
helps to give confidence that there *aren't* any problems. :)

I'm not sure I agree with Mark that reliability is cut and dried, depending 
only on initial hardware decisions.  (Yes, I removed or changed a couple of 
important qualifying words in there from what Mark wrote. :)  Vendor 
support methods are critical -- consider that part of the initial hardware 
choice if you like.  My point here is that it's hardware and vendor choice 
taken together, not just hardware choice.

By the way, the idea of rolling-your-own hardware on a large cluster, and 
planning on having a small technical team, makes me shiver in horror.  If 
you go that route, you better have *lots* of experience in clusters. and 
make very good decisions about cluster components and management methods.  
If you don't, your users will suffer mightily, which means you will suffer 
mightily too.

David