[Beowulf] A couple of interesting comments

Wed Sep 24 07:33:57 PDT 2008

Prentice Bisbal wrote:
> Oops. e-mailed to the wrong address. The cat's out of the bag now! No
> big deal.  I was 50/50 about CC-ing the list, anyway. Just remove the
> phrase "off-list" in the first sentence, and that last bit about not
> posting to the list because...
> 
> Great. I'll never get a job that requires security clearance now! ;)
> 
> --
> Prentice <---- still can't figure out how to use e-mail properly

Recently, I was proven to be unable to handle spreadsheets.  That can be 
embarrassing when I claim to be able to manage and write numerical models...

> Prentice Bisbal wrote:
>> Gerry,
>>
>> I wanted to let you know off-list that I'm going through the same
>> problems right now. I thought you'd like to know you're not alone.  We
>> purchased a cluster from the *allegedly* same vendor. The PXE boot and
>> keyboard errors were the least of our problems.
>>
>> First, our cluster was delayed 2 months due to shortages of the network
>> hardware we specified. It was not the vendor standard for clustering,
>> but still a brand they resold.
>>
>> When it did arrive, the doors were damaged by the inadequately equipped
>> delivery co.
>>
>> When the technician arrived to finish setting up the cluster, he
>> discovered that the IB cables provided were too short to be within spec:
>> the bend radius would be too tight, and were too short to be supported
>> from above the connectors.
>>
>> And, the final problem I'm going to mention: the fiber network cables to
>> connect our ethernet switches to each other (we have Ethernet and IB
>> networks in this cluster) were missing.
>>
>> It's been over two weeks since our cluster arrived, and one week since
>> the technician noticed these shortages and reported them. Still haven't
>> had these problems rectified, and the technician will have to fly to our
>> site again in a couple weeks to complete the installation.
>>
>> I'm writing an article about this experience for Doug to publish. I
>> haven't posted this to the mailing list b/c I'm not sure what my
>> management will be happy with me sharing (the article will be reviewed
>> by them before publishing).

I'll add that we paid for next-day service, but I continue to be amaze 
that this means Matt or I have to evaluate and troubleshoot the node 
before the vendor sends out service. We can manage to drag "next 
business day" out a few more days, somehow.

Our iSCSI cables were partially sent, but we were told we'd gotten what 
they interpreted to be the right number; we bought more and it only took 
a week or so to get them in.  We discovered the RAID shelves we'd 
gotten, where the RFQ specifically called out RAID6 hardware-capable, 
weren't, so we're doing JBOD/software RAID6 (our experience has proven 
that we NEED RAID6).  When we enquired about giving back the RAID 
shelves we were told that wasn't a possibility.

My impression is that the vendor is well-suited for small-medium 
business-based clusters, but unfamiliar with how things work in the *nix 
world, overall (I know there are exceptions).  I am concerned that each 
of our compute nodes is, to them, just another webserver, and if it's 
mission critical, we should have bought all sorts of additional services 
and a shelf-spare server.  Or maybe we should just virtualize (yeah! 
that's the ticket!  a virtual HPC cluster?).

We're starting to look again for HPC resources, but I doubt they'll be 
asked to bid.

gerry

>>> We recently purchased a set of hardware for a cluster from a hardware 
>>> vendor.  We've encountered a couple of interesting issues with bringing 
>>> the thing up that I'd like to get group comments on.  Note that the RFP 
>>> and negotiations specified this system was for a cluster installation, 
>>> so there would be no misunderstanding...
>>>
>>> 1.  We specified "No OS" in the purchase so that we could install CentOS 
>>> as our base.  We got a set of systems with a stub OS, and an EULA for 
>>> the diagnostics embedded on the disk.  After clicking thru the EULA, it 
>>> tells us we have no OS on the disk, but does not fail to PXE.
>>>
>>> 2.  BIOS had a couple of interesting defaults, including warn on 
>>> keyboard error (Keyboard?  Not intentionally.  This is a compute node, 
>>> and should never require a keyboard.  Ever.)  We also find the BIOS is 
>>> set to boot from hard disk THEN PXE. But due to item 1, above, we never 
>>> can fail over to PXE unless we load up a keyboard and monitor, and hit 
>>> F12 to drop to PXE.
>>>
>>> In discussions with our sales rep, I'm told that we'd have had to pay 
>>> extra to get a real bare hard disk, and that, for a fee, they'd have 
>>> been willing to custom-configure the BIOS. OK, with the BIOS this isn't 
>>> too unreasonable: They have a standard BIOS for all systems and if you 
>>> want something special, paying for it's the norm...  But, still, this is 
>>> a CLUSTER installation we were quoted, not a desktop.
>>>
>>> Also, I'm now told that "almost every customer" ordered their cluster 
>>> configuration service at several kilobucks per rack.  Since the team I'm 
>>> working with has some degree of experience in configuring and installing 
>>> hardware and software on computational clusters, now measured in at 
>>> least 10 separate cluster installations, this seemed like an unnecessary 
>>> expense.  However, we're finding vendor gotchas that are annoying at the 
>>> least, and sometimes cause significant work-around time/effort.
>>>
>>> Finally, our sales guy yesterday was somewhat baffled as to why we'd 
>>> ordered without OS, and further why we were using Linux over Windows for 
>>> HPC.  Not trying to revive the recent rant-fest about Windows HPC 
>>> capabilities, can anyone cite real HPC applications generally run on 
>>> significant clusters (I'll accept Cornell's work, although I remain 
>>> personally convinced that the bulk of their Windows HPC work has been 
>>> dedicated to maintaining grant funding rather than doing real work)?
>>>
>>> No, I won't identify the vendor.
>>> -- 
>>> Gerry Creager -- gerry.creager at tamu.edu
>>> Texas Mesonet -- AATLT, Texas A&M University
>>> Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
>>> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843