[Beowulf] [OT] HPC and University IT - forum/mailing list?

Tue Aug 15 14:50:03 PDT 2006

> - integration of a cluster into a larger University IT infrastructure 
> (storage, authentication, policies, et. al.)

just say no.  we consciously avoid taking any integration steps that 
would involve the host institution trusting us or vice versa.  well, 
not quite - we managed to live for ~5 years with our machines dual-homed,
network-wise (albeit only ssh access.)  but most of the (now 16) institutions 
are now beginning to worry (though we've _not_ had any security breaches).

IMO, we'll continue to avoid any intimate integration, trust-wise,
and stick to ssh, possibly augmented with a http-based user portal.

the goal of loose coupling is _not_ inherently in conflict with good 
design (usability, robustness, managability, cost-effectiveness.)

> - funding models (central funds, grant based) both for equipment and 
> personnel.

yeah, our primary funding is pure-hardware, and we have to do some funny
book-keeping to get people money out of it.  (especially people other than
"direct infrastructure support" - for instance, we have parallel programming
consultants, who are not sysadmins, and who can potentially participate in 
research.  the main funding agency will support sysadmins, but not these.)

> - centralized research IT versus local departmental/school support.

IMO, centralization breeds contempt ;)
the institution where I sit does have central IT, but it's mostly limited
to providing network/phones and groups that do the "business" side of
things like grades, money, registration.  pretty much everything research-
related (incl a prof's desktop) is handle outside the central IT.  most 
depts have some local experts, but a lot of the expertise is in a contract
consulting group which reports to the VP-research.  this seems to lead to 
a more responsive organization (though that could easily be a matter of 
people, history, innate niceness of Canadians, etc ;)

> - education and training

we run classes/seminars/symposia/etc several times a year at each site.
mostly, it doesn't seem too much to expect people to pick up the basics
with little help (use an ssh client, choose a text editor, here's how to 
compile, and submit, where to store your files.)  it's not all that clear
how HPC fits into curricula, undergrad or higher.  we have relatively little
involvement by CS-ish people, and lots from in-domain expert researchers.
we're just starting to figure out how to put some for-credit HPC training
into the computational streams of some depts (physics, etc).

> - deployment issues (who pays F&M?)

facilities and maintenance?  it's fuzzy for us.  I have a compressor in 
a liebert reporting low suction right now, and I'm not sure who's going
to fix it.  technically, the hosting university owns the equipment, and 
signed a letter stating that they'd provide infrastructural support.
our compute hardware is bought with 3yr NBD onsite support.  is chiller
maintenance different from server maintenance?

> - sustainability and growth

donno.  our first round of hardware was installed in 2001 and was 
based on ~400p of alphas.  some smallish refreshes happened, including
1-200p clusters, but the main refresh was installed in 1q06 (~7k cpus,
mostly opterons.)  (and all the new stuff was instantly full, of course).
but I wouldn't claim a 75% annual growth rate was sustainable, and 
I'm not sure how I'd plan for the next round.  since I'm a technologist,
my ideal would be a constant sampling of good-looking parts, and when 
the time is right, move quickly to buy a substantial facility.  for instance,
2005 was a very bad year for most of our users, since we actually had 
_fewer_ cycles available due to renovations.  pipelining the acquisitions
would have been a lot beter.  not to mention that you really want to respond
to products, not be driven entirely by funding cycles.  there's a sweet-spot
for buying any product, after the initial wrinkles are clarified, and when 
its unique properties are still unique.  what we're learning about Core2
right now, for instance, is quite fascinating, and should influence anyone
buying a cluster in the next 6-9 months.  after that, perhaps AMD K8L will
need to be considered.  interconnect-wise, InfiniPath seems to be still
quite a wise choice, but perhaps the 10G market will finally do something...

that said, it's entirely possible to sustain a "rolling cluster": start 
with one generation, and incrementally move it forward.  this is easiest 
if you have standard parts (plain old ethernet, IPMI, PXE, x86, 110/220
auto-sensing PS, 1U).  people will still use old hardware, if you make 
it available and easy.