[Beowulf] 512 nodes Myrinet cluster Challanges
David Kewley
kewley at gps.caltech.edu
Fri Apr 28 14:10:35 PDT 2006
On Friday 28 April 2006 05:04, Mark Hahn wrote:
> > Does any one know what types of problems/challanges for big clusters?
>
> cooling, power, managability, reliability, delivering IO, space.
I'd add: sysadmin or other professional resources to manage the cluster.
Certainly, the more manageable and reliable the cluster is, the less time
the admin(s) will have to spend at basically keeping the cluster in good
health. But given manageability and reliability, the bigger issue is: How
many users and how many different codebases do you have? Given the variety
in individual needs, you can end up spending quite a bit of time helping
users get new code working well, and/or making adjustments to the cluster
software to accommodate their needs. At least this has been my experience.
I'm the only admin for a 1024-node cluster with 70+ authorized users (49
unique users in the past 31 days, about 30of whom are frequent users, I'd
estimate), and probably a couple dozen user applications. Having other
non-sysadmin local staff helping me, as well as having good hardware and
software vendor support, has been critical to multiply the force I can
bring to bear in solving problems.
You know all those best practices you hear about when you're a sysadmin
managing a departmental network? Well, when you have a large cluster, best
practices become critical -- you have to arrange things so that you don't
have to touch hardware but rarely, nor login to fix problems on individual
nodes. Such attention to individual nodes takes far too much time away
from more productive pursuits, and will lead to lower cluster availability,
which means extra frustration and stress for you and your users.
A few elements of manageability that I use all the time:
* the ability to turn nodes on or off in a remote, scripted, customizable
manner
* the ability to reinstall the OS on all your nodes, or specific nodes,
trivially (e.g. as provided by Rocks or Warewulf)
* the ability to get remote console so you can fix problems without getting
out the crash cart -- hopefully you don't have to use this much (because
it means paying attention to individual systems), but when you need it, it
will speed up your work compared to the alternative
* the ability to gather and analyze node health information trivially, using
embedded hardware management tools and system software tools
* the ability to administratively close a node that has problems, so that
you can deal with the problem later, and meanwhile jobs won't get assigned
to it
Think of your compute nodes not as individuals, but as indistinguishable
members of a Borg Collective. You shouldn't care very much about
individual nodes, but only about the overall health of the cluster. Is the
Collective running smoothly? If so, great -- make sure you don't have to
sweat the details very much.
> > we are considering having a 512 node cluster that will be using
> > Myrinet as its main interconnect, and would like to do our homework
I've had excellent experience with Myrinet, in terms of reliability,
functionality, and technical support. It's probably the most trouble-free
part of my cluster and my best overall vendor experience. Myrinet gets
used continuously by my users, but I rarely have to pay attention to it at
all.
> how confident are you at addressing especially the physical issues above?
> cooling and power happen to be prominent in my awareness right now
> because of a 768-node cluster I'm working on. but even ~200 node
> clusters need to have some careful thought applied to managability
> (cleaining up dead jobs, making sure the scheduler doesn't let jobs hang
> around consuming myrinet ports, for instance.) reliability is a fairly
> cut and dried issue, IMO - either you make the right hardware decisions
> at purchase, or not.
A few comments from my personal experience. On my cluster, perhaps 1 in
10,000 or 100,000 job processes ends up unkilled, taking up compute node
resources. It's not been a big problem for me, although it certainly does
come up. Generally the undead processes have been a handful out of a set
of processes that have something in common -- a bad run, a user doing
something weird, or some anomalous system state (e.g. central filesystem
going down).
I've never had a problem with consumed Myrinet ports, but I'm sure that's
going to depend on the details of your local cluster usage patterns. Most
often the problem has been a job spinning using CPU, slowing down
legitimate jobs. If I configured my scheduler properly (LSF), I'm pretty
sure I could avoid even that problem -- just set a threshold on CPU
idleness or load level. I *have* made a couple of scripts to find nodes
that are busier than they should be, or quieter than they should be, based
on the load that the scheduler has placed on them versus the load they're
actually carrying. That helps identify problems, and more frequently it
helps to give confidence that there *aren't* any problems. :)
I'm not sure I agree with Mark that reliability is cut and dried, depending
only on initial hardware decisions. (Yes, I removed or changed a couple of
important qualifying words in there from what Mark wrote. :) Vendor
support methods are critical -- consider that part of the initial hardware
choice if you like. My point here is that it's hardware and vendor choice
taken together, not just hardware choice.
By the way, the idea of rolling-your-own hardware on a large cluster, and
planning on having a small technical team, makes me shiver in horror. If
you go that route, you better have *lots* of experience in clusters. and
make very good decisions about cluster components and management methods.
If you don't, your users will suffer mightily, which means you will suffer
mightily too.
David
More information about the Beowulf
mailing list