[Beowulf] A bit OT - scientific workstations - recommendations
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Douglas Eadline deadline at clustermonkey.netSun Mar 5 13:45:44 PST 2006
- Previous message: [Beowulf] A bit OT - scientific workstations - recommendations
- Next message: [Beowulf] A bit OT - scientific workstations - recommendations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> On Fri, 3 Mar 2006, Joe Landman wrote: > >> >> >> Douglas Eadline wrote: >>>> - 24/7/365 next day on site support >>> >>> >>> Let's consider this idea in light of commodity hardware. >>> (i.e. why I don't buy service contracts on light bulbs) >>> >>> Assuming that you can ship a node back for no cost repair within a >>> warranty period, the question to ask is how many spare nodes >>> can I buy for the price of a service contract for all the nodes >>> in my cluster? >> >> We live in the era of disposable computing. >> >> If your business case demands that you have no down time, then you >> engineer >> around that. If it demands you minimize costs, you need to adjust your >> expectations on what you will get for those costs. > > Your metaphor is a bit strained, Doug. Nodes more often resemble > employees more than they resemble light bulbs. Losing an employee, or > even two, is rarely fatal to a business, but there are often highly > nonlinear costs associated with things like an "epidemic" that affects > your entire workforce. Allow me to refine my metaphor, if nodes work as advertised, then they can be treated like light bulbs. IMO, the situation you faced was due to not getting what your expected. I assume a product should work as stated. I also know that there is a "real world" kind of failure rate. I also was assuming a 1 year "fix or repair" warranty. > > So, the issue is do you get health insurance for your employees that > more or less guarantees that EVEN an epidemic will cost you only a day > or so of downtime and a minimal amount of personal stress, or do you > plan on "playing doctor" for your employees, one at a time, as they > become ill, living with whatever downtime that process generates and > paying the cost of their medicines out of pocket. Well, I think Joe is right, if you are concerned about an epidemic, then you need health insurance. However, I believe epidemics are product defects and if a vendor sold you a node it should work. Based on your experience, it would seem wise to negotiate a clause that basically says "should the defect rate for any given part exceed X, then there needs to be complete replacement at vendor expense." Similar to the the "lemon laws" they have for cars (whoops, I mentioned a car analogy). Now I admit there is a gray area where you can build you own epidemic by building your own systems. > > [Hey, at least it isn't a car metaphor;-)] > > So Joe's observation is apropos. You engineer for your own particular > perception of costs of downtime and willingness to accept risks, > INCLUDING the substantial cost of your own time screwing around with > things. Sure, my definition of screwing around with things is putting it in a box and sending it to the vendor for repair. > > Having lived through one "nightmare" cluster where every node had to be > re-engineered on the fly, where every node had to have their bios > reflashed to become semistable, where every node had to be "fixed", > where every node had to have its cooling fans replaced -- I will not > willingly do that again. We saved 10% or so on the purchase price, but > paid it out 10 times over in a mix of direct costs for parts and human > time. Then I would say your "light bulbs" were defective. > > For a home cluster, a hobby cluster, a small prototyping cluster (<= 8 > nodes) sure, working without a net is reasonable. For a professional > grade production cluster you CAN consider doing it and might have an > entire career where it works for you. Or you could be OH SO SORRY on > your very first time when the whole damn thing breaks and is out of its > paltry 90 day mfrs warranty (assuming that you get the cheapest of > commodity boxes while seeking to save maximal money:-). Well, that is the issue. If the whole damn thing breaks, then IMO you are off the bell curve. With clusters in particular, there is a reasonable expectation that nodes are independent and the failure of one or two nodes will not bring down the cluster. If all the nodes fail for some pathological reason, then, IMO, they never really worked to begin with. My point is really the cost of commodity hardware allows one to re-evaluate the traditional "service contracts" model developed for proprietary vendor hardware. Of course, it is always nice to have someone whose job it is to listen to your problems. Doug
- Previous message: [Beowulf] A bit OT - scientific workstations - recommendations
- Next message: [Beowulf] A bit OT - scientific workstations - recommendations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
