[Beowulf] SSDs for HPC?

Mon Apr 7 13:40:11 PDT 2014

On 04/07/2014 03:44 PM, Prentice Bisbal wrote:
>> As long as you use enterprise-grade SSDs (e.g., Intel's stuff) with
>> overprovisioning, the nand endurance shouldn't be an issue over the
>> lifetime of a cluster.  We've used SSDs as our nodes' system disks for
>> a few years now (going on four with our oldest, 324-node production
>> system), and there have been no major problems.  The major problems
>> happened when we were using the cheaper commodity SSDs.  Don't give in
>> to the temptation to save a few pennies there.
>
> Thanks for the info. Enterprise typically means MLC instead of SLC, right?

There is a lot of cruft to filter through in the SSD space to understand 
what the hell is really going on.  First of all, "enterprise" can really 
mean anything, but one thing is for certain: it is more expensive. 
Enterprise can mean the same material (flash cells) but a different 
(better) flash translation layer (wear-leveling/garbage collection/etc) 
or a different feature size (bigger generally means more reliable and 
less dense) or something fancier fab tech.  Blog with more info on the 
topic here:

http://www.violin-memory.com/blog/flash-flavors-mlc-emlc-or-vmlc-part-i/

Either way, "more bits-per-cell" are generally seen as less enterprise 
than "fewer bits-per-cell."  So, SLC is high-reliability enterprise, MLC 
can be enterprise in some cases (marketing has even taken it upon itself 
to brand some "eMLC" or enterprise MLC, which has about as much meaning 
as can be expected) and TLC is arguably just commodity.  Less cells also 
means faster latencies, particularly for writes/erases.

I guess I disagree with the previous poster that saving by going the 
commodity route, which, by the way, is not pennies but often upwards of 
50%, is always bad.  It really depends on your situation/use-case.  I 
wouldn't store permanent data on outright commodity SSDs, but as a LOCAL 
scratch-pad, they can be brilliant (and replacing them later may be far 
more advisable than spending a ton up front and praying they don't).

For instance, since you mention Hadoop, you are in a good situation to 
consider commodity SSDs since it will automatically failover to another 
node if one node's SSD is dead.  It's not going to kill off your whole 
job.  Hadoop is built to cope with that.  This being said, I am not 
suggesting you necessarily should go the route of putting HDFS on SSDs. 
The bandwidth and capacity concerns you raise are spot on there.  What I 
am suggesting is perhaps using a few commodity SSDs for your 
mapreduce.cluster.local.dir, or where your intermediate data will 
reside.  You suggest "not many users will take advantage of this."  If 
your core application is Hadoop, every user will take advantage of these 
SSDs (unless they explicitly override the tmp path, which is possible), 
and the gains can be significant over HDDs.  Moreover, you aren't 
multiplexing persistent and temporary data onto/from your HDFS HDDs, so 
you can get speedups getting to persistent data as well since you've 
effectively created dedicated storage pools for both types of accesses. 
  This can be important.

Caveat #1: Make sure to determine how much temporary space will be 
needed, and acquire enough SSDs to cover that across the cluster.  That, 
or, instruct your users that "jobs generating up to XTB of intermediate 
data can run on the SSDs, which is done by default, but for jobs 
exceeding that use these extra parameters to send the tmp data to HDDs." 
  More complexity though.  Depends on the user-base.

Caveat #2: If you're building incredibly stacked boxes (e.g., 16+ HDDs) 
you may be resource-limited in a number of ways that makes adding SATA 
SSDs unwise.  May not be worth the effort to squeeze more SSDs in there, 
or PCIe SSDs (tending to be more enterprise anyhow) might be the way to go.

Caveat #3: Only certain types of Hadoop jobs really hammer intermediate 
space.  Read- and write-intensive jobs often won't, but those special 
ones that do (e.g., Sort) benefit by immense amounts with a fast 
intermediate space.

Caveat #4: There are probably more caveats.  My advice is to build a two 
mock-up machines with and without it and run a "baby cluster" Hadoop 
instance.  This way, if SSDs really don't bring the performance gains 
you want, you avoid buying a bunch, wasting money, and time probably 
replacing them down the road.

More on Wear-Out: This is becoming an issue again for /modern/ feature 
sizes and commodity bit-levels (especially TLC).  For modern drives with 
last gen or two gen back feature sizes, fancy wear-leveling and 
egregious amounts of over-provisioning have more or less made wear-out 
impossible for the lifetime of your machine.

Best,

ellis

-- 
Ph.D. Candidate
Department of Computer Science and Engineering
The Pennsylvania State University
www.ellisv3.com