[Beowulf] [OT] HPC and University IT - forum/mailing list?
hahn at physics.mcmaster.ca
Tue Aug 15 14:50:03 PDT 2006
> - integration of a cluster into a larger University IT infrastructure
> (storage, authentication, policies, et. al.)
just say no. we consciously avoid taking any integration steps that
would involve the host institution trusting us or vice versa. well,
not quite - we managed to live for ~5 years with our machines dual-homed,
network-wise (albeit only ssh access.) but most of the (now 16) institutions
are now beginning to worry (though we've _not_ had any security breaches).
IMO, we'll continue to avoid any intimate integration, trust-wise,
and stick to ssh, possibly augmented with a http-based user portal.
the goal of loose coupling is _not_ inherently in conflict with good
design (usability, robustness, managability, cost-effectiveness.)
> - funding models (central funds, grant based) both for equipment and
yeah, our primary funding is pure-hardware, and we have to do some funny
book-keeping to get people money out of it. (especially people other than
"direct infrastructure support" - for instance, we have parallel programming
consultants, who are not sysadmins, and who can potentially participate in
research. the main funding agency will support sysadmins, but not these.)
> - centralized research IT versus local departmental/school support.
IMO, centralization breeds contempt ;)
the institution where I sit does have central IT, but it's mostly limited
to providing network/phones and groups that do the "business" side of
things like grades, money, registration. pretty much everything research-
related (incl a prof's desktop) is handle outside the central IT. most
depts have some local experts, but a lot of the expertise is in a contract
consulting group which reports to the VP-research. this seems to lead to
a more responsive organization (though that could easily be a matter of
people, history, innate niceness of Canadians, etc ;)
> - education and training
we run classes/seminars/symposia/etc several times a year at each site.
mostly, it doesn't seem too much to expect people to pick up the basics
with little help (use an ssh client, choose a text editor, here's how to
compile, and submit, where to store your files.) it's not all that clear
how HPC fits into curricula, undergrad or higher. we have relatively little
involvement by CS-ish people, and lots from in-domain expert researchers.
we're just starting to figure out how to put some for-credit HPC training
into the computational streams of some depts (physics, etc).
> - deployment issues (who pays F&M?)
facilities and maintenance? it's fuzzy for us. I have a compressor in
a liebert reporting low suction right now, and I'm not sure who's going
to fix it. technically, the hosting university owns the equipment, and
signed a letter stating that they'd provide infrastructural support.
our compute hardware is bought with 3yr NBD onsite support. is chiller
maintenance different from server maintenance?
> - sustainability and growth
donno. our first round of hardware was installed in 2001 and was
based on ~400p of alphas. some smallish refreshes happened, including
1-200p clusters, but the main refresh was installed in 1q06 (~7k cpus,
mostly opterons.) (and all the new stuff was instantly full, of course).
but I wouldn't claim a 75% annual growth rate was sustainable, and
I'm not sure how I'd plan for the next round. since I'm a technologist,
my ideal would be a constant sampling of good-looking parts, and when
the time is right, move quickly to buy a substantial facility. for instance,
2005 was a very bad year for most of our users, since we actually had
_fewer_ cycles available due to renovations. pipelining the acquisitions
would have been a lot beter. not to mention that you really want to respond
to products, not be driven entirely by funding cycles. there's a sweet-spot
for buying any product, after the initial wrinkles are clarified, and when
its unique properties are still unique. what we're learning about Core2
right now, for instance, is quite fascinating, and should influence anyone
buying a cluster in the next 6-9 months. after that, perhaps AMD K8L will
need to be considered. interconnect-wise, InfiniPath seems to be still
quite a wise choice, but perhaps the 10G market will finally do something...
that said, it's entirely possible to sustain a "rolling cluster": start
with one generation, and incrementally move it forward. this is easiest
if you have standard parts (plain old ethernet, IPMI, PXE, x86, 110/220
auto-sensing PS, 1U). people will still use old hardware, if you make
it available and easy.
More information about the Beowulf