[Beowulf] SSDs for HPC?
landman at scalableinformatics.com
Tue Apr 8 08:25:38 PDT 2014
On 4/8/14, 11:05 AM, Michael Di Domenico wrote:
> On Tue, Apr 8, 2014 at 10:57 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> From a general purpose point of view, Intel and Samsung make great lower end
>> devices. SanDisk makes great higher end devices. We are working on getting
>> some Toshiba's and a few others for enterprise to ultra-high-end testing.
>> With some of the SSDs, we found that a hot plug event was permanently
>> terminal to the device. Neat, huh? Other SSDs we played with had 40+%
>> failure rates.
> is that 40% infant mortality or after some period of time?
That was 40% across a very large swath of parts, within a 2 week window
of each other, for lightly used boot drive SSDs. We ripped them out,
globally, and replaced them. Including non-failed parts.
> i've held off on ssd's in our environment mostly because of the
> general feeling that ssd's still have a much shorter life expectancy
> then hdd's. some anecdotal evidence would be helpful.
The cheap drives are crap. The good drives will cost you. The good
drives will be as reliable as spinning rust, if not more so. The meh
drives have 2-5 random drive writes per day (DWPD) over a 5 year
window. The crappy drives have sub 1 (usually sub 0.1). The good drives
have 10+ DWPD.
Huge hint: if they don't give explicit figures on durability, there is
a very good reason for that.
Huge hint 2: You can take the analysis Prentiss suggested to calculate
the number of single block erasures that the drive can tolerate during
its lifetime. Crap drives are way sub 3k. Meh drives are 3k-7k
(nothing important on them, avoid them in write amplified ... RAID5/6
... scenarios). Good drives are 10+k erasures.
For 1PB of total writes during lifetime, a 100GB drive would be written
10k times. If this is over 5 years (call it 1825 days), then you get
roughly 10k/1825 -> 5.5 DWPD. Upper end of meh into "lower good"
range. This is 10k erasure/rewrite cycles.
Note that this analysis is *highly* oversimplified, and a good academic
would take strong issue with it. But it also appears to match reality
quite well from what we observe.
Our high end SSDs in our siflash box have a lower average yearly failure
rate than our high end spinning rust drives.
Good SSDs will cost you more than the crap ones. But you will not
regret buying the good ones. You will regret buying the crap ones.
Just remember this if you are specing out a new storage
box/cluster/computing system, that you need to make engineering and cost
tradeoffs. And in the ultracompetitive academic cluster market, it just
may be that the margins are so incredibly thin to begin with, that
anything that helps increase the margin is a good thing for the company
offering the system. I know people here may not be sympathetic to this
viewpoint, and thats OK. Until, that is, you are on the other side,
trying to pay your team with the slivers of margins you make on these
sales. I'd recommend, instead of automatically picking the cheapest
(acquisition) cost item, that you focus upon the best. The latter will
cost you more and you will have less headache.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
More information about the Beowulf