[Beowulf] Re: Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Karen Shaeffer shaeffer at neuralscape.comThu Jan 27 00:48:44 PST 2005
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Jan 23, 2005 at 11:14:14PM -0800, Greg Lindahl wrote: > On Mon, Jan 24, 2005 at 01:57:16AM -0500, Robert G. Brown wrote: > > > Otherwise, what I was basically doing is describing the bathtub > > Didn't look like that to me, but I just read your rants, I wasn't the > guy who wrote them. > > > As was pointed out by Karen (and I agree) the mfr warranty period is > > perhaps a better number for most people to pay attention to than MTBF > > I disagree. The warranty period tells you about disk lifetime. The > MTBF tells you about the failure rate in the bottom of the > bathtub. These are nearly independent quantities; I already pointed > out that the fraction of disks which fail in the bottom of the bathtub > is small, even if you multiply it by 2X or 3X. So the major factor in > the price and length of that warranty is the lifetime. > > Lifetime and MTBF are simply different measures. Depending on what you > are thinking about, you pay attention to one or the other or neither. These numbers are defined by their collective usage in the industry. I accept your assertion about their definitions. But the MTBF number has no consequential significance to a disk drive manufacturer, and thus has a poor confidence associated with it -- and I am going to explain why. As stated previously, the disk drive business is an extremely high volume, low margin, technology intensive business. Product cycles last about 6 months. A typical disk drive comes out of development and ramps up from zero units to several million units within about 6 weeks. This is an operational miracle in of its self, but it is standard buisness in this industry. Now, I assert the warranty period and the integration of the failure rate during the six month production run is the only issue the disk drive manufacturer (DDM) cares about. During this period, as production is in progress, the failure rate is dominated by the infant mortality of recently sold drives from this production run. Even the first batch of production drives sold are only 6 months into their lifecycle at the end of the production run. (Let's define the infant mortality time window to be a weighted 6 week period. This is the left wall of the bathtub.) The rate of failure, from the drives that make it through the infant mortality time period, is a very small component compared to the drives that are failing during infant mortality. In other words, with several million drives a month being sold, this infant mortality of recently sold drives is the dominating term in the equation during the life of the production run. Now, failed drives are classified in numerous ways by the DDM, but the most important issue is how long it lived in production. Was it an infant mortality rate death? If it was, then the DDM is very interested in it. If it survived the infant mortality failure time window, then the DDM has very little (if any) interest in it during the production run. (Someone asserted the DDM will take failed drives and determine the failure mode. This is only true of failures during infant mortality. Drives that fail in the bottom of the bathtub are generally thrown away and simply replaced. You would need to exceed the standard deviation for the MTBF significantly, before the DDM would start analyzing the failure modes in this case. On the other hand, any perceived aberation in the expected infant mortality rate would start a fire drill.) The point is, as in ALL mass produced products, and especially in semiconductors and disk drives, early detection of statistically significant failure modes is ABSOLUTELY ESSENTIAL to the profitability of the firm. If you have a production problem and don't figure it out until a million drives are out there in the market, then you have just lost a huge amount of money. I'm talking about a whole quarter's profits or worse. If you have a role in such a disaster -- then your career in the DDM industry just ended, which is why everyone is keenly focused on the issues that matter. In summary, for a specific disk drive, the bulk of the failed drives that the DDM will replace based on the warranty will have in fact failed during their infant mortality window immediately after being put into production. If you integrate these over the product's production and then calculate the drives that fail based on MTBF numbers associated with the bottom of the bathtub, you will find this is a minor term in total number of failed drives during the warranted time. (This is clearly true in the normal case.) The DDM is entirely focused on these infant mortality failures, because they provide early warning for preventing large scale problems. (This includes the case where actual midlife failure rates would far exceed the projections one could expect based on MTBF. This is an essential element in this discussion. Please keep it in mind.) The scrutiny of these infant mortality failures is intense. Any aberations of the expected numbers causes the whole production and engineering teams associated with the particular disk drive product to become available resources to resolve the problem. The time to resolving problems is counted in hours. Its that intense. Do the math. Several million units produced per month translates to 27,397 drives produced per hour. If you have a problem, it can become a disaster quickly. (Note the DDM actually puts several thousand drives in a QA lab about 6 weeks prior to shipping the first drive. So they actually have the initial statistical results of the infant mortality before shipping any drives.) Now, once production ends, then the infant mortality deaths drop off as soon as the supply of drives in the channel are all sold. At this point, there is nothing the DDM can do. The drives are out there. Whether or not the total number of warranted drives that fail will have exceeded expectations is already known in almost all cases. All resources are now turned to the next product release. Nobody at the DDM gives a whoot about the MTBF numbers. They are not even discussed within the internal workings of the DDM business. The POINT IS, even if the MTBF numbers are not accurate, and the drives fail at a much higher rate, there is nothing the DDM can actually do about it. Production is over. It is this reality that relegates these numbers to be nothing more than window dressing on marketing literature. And there are numerous products where MTBF rates have been wildly understated WRT the actual midlife failure rates -- where the DDM took a big loss. But the reaction, after the fact, would all be focused on why the early detection and component QA processes failed. It would not even consider how the MTBF numbers were derived. Because they need to catch the problem early or it is not helpful. So, now that we know what interests the DDM with respect to failed drives, let's consider the MTBF numbers that are published. As Greg and others have pointed out, these numbers speak of the rate of failure at the bottom of the bathtub. The definition is not in question here. The question is the confidence you can place on those published numbers. I and others have asserted you cannot place much confidence in these numbers, because they have no financial consequence to the DDM. (Except of course if they are wildly wrong -- which brings with it the particular problem of being too late to do anything about it.) I have explained why this is so. I have also explained how the DDM assigns all it's resouces to the critical problems, as the rate of production is so high, time is the essence in protecting profits. Once production ends, all resources are reassigned to the next product to be released. It is my understanding that these MTBF numbers are derived from thermal cycling in ovens as part of the QA process. All the likely failure modes in a disk drive are quite sensitive to thermal conditions. These are the media, the heads, the spindle, bearings, lubricants, etc comprising the critical mechanical structure, the temperature dependence on band gaps and other calibrating circuitry within the electronics, nominal currents within the microeclectronics and espectially the power mosfet arrays, the servo system cailibration, etc. As the thermal cycling QA processes proceed, defects in these systems can be forced to manifest during the testing, and the normal state characteristics and stability of these subsystems can also be extracted from the experiments. These results are then rigorously integrated within the observed profiles and characteristics of drives failing within the infant mortality window. It is all highly integrated within statistical models for expectations. MTBF numbers are also extrapolated from the results. In effect, the MTBF numbers become the long term projections that are extrapolated from this data. But the primary focus and optimization of processess is intended to create the statistical underpinning from which to analyze infant mortality drive failures. The uncertainty in these numbers naturally increases for the MTBF extrapolations. It's all perfectly logical. > Yes. And I have yet to see anything in your complaint that is anything > but misinterpretation on your part. Reality check time, indeed. You > can't use MTBF by itself as a measure of quality, period, so complaining > that it isn't a good single item to measure disk quality is, well, > operator error. > > -- greg I think the problem with the logic you and others have embraced, is that it is not well correlated with the operational priorities of the DDM industry. As with all industries, competitors publish normalized metrics for customers to compare. (and of course, they want you to think these metrics are really important! They want you to buy their product.) I believe the MTBF number is more of a marketing number than something the DDM goes to great lengths to formulate. On the other hand, in a well executed production run, where everything goes as planned, the MTBF numbers are likely to be accurate. After all, MTBF and infant mortality rates clearly share dependencies in the normal case -- and in fact, the MTBF numbers are derived from the processes optimized for anticipating the expected infant mortality rate during the production run. If DDMs were interested in helping customers discriminate based on the actual expected lifetime of drives, they would all publish running infant mortality rates, updated weekly, during the production run of their disk drives. Afterall, this is the one metric the entire organization is focused on during production. But, what they hand out is this MTBF number to prospective customers. A number they pay no attention to internally. HTH, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
