[Beowulf] Re: GPU boards and cluster servers.
Mark Hahn
hahn at mcmaster.ca
Tue Sep 9 15:41:01 PDT 2008
>> I _do_ wish it was a bit more common to have onsite spares. not sure
>> why vendors (HP at least) don't like to do this. maybe just that it
>> might
>> get kicked around or otherwise abused...
>
> You don't have your own spares kit? For big clusters like yours, it
> doesn't cost much.
could be we don't know how to ask; I'm not aware of HP actually
offering such a kit. or how much we'd be willing to pay.
it is an interesting question: not just how much does downtime cost you,
but what are the kinds of failures you see and expect? our clusters
have been remarkably robust, in spite of having pretty mundane hardware.
plain old sata disks, for instance. we have several instances (sites)
with a ~400 disk filesystem, but I think we're around 1-2% annual failure
rate. we use raid6, but spares for those disks are the most obvious
thing I'd want. the failure rate for PSU's, motherboards, dimms, etc
are quite a lot lower (maybe 2 psu's of 768 nodes per year.)
OTOH, most of this hardware is approaching its third birthday. magic
warranty-related number there :|
More information about the Beowulf
mailing list