[Beowulf] SSDs for HPC?
Ellis H. Wilson III
ellis at cse.psu.edu
Mon Apr 7 13:40:11 PDT 2014
On 04/07/2014 03:44 PM, Prentice Bisbal wrote:
>> As long as you use enterprise-grade SSDs (e.g., Intel's stuff) with
>> overprovisioning, the nand endurance shouldn't be an issue over the
>> lifetime of a cluster. We've used SSDs as our nodes' system disks for
>> a few years now (going on four with our oldest, 324-node production
>> system), and there have been no major problems. The major problems
>> happened when we were using the cheaper commodity SSDs. Don't give in
>> to the temptation to save a few pennies there.
> Thanks for the info. Enterprise typically means MLC instead of SLC, right?
There is a lot of cruft to filter through in the SSD space to understand
what the hell is really going on. First of all, "enterprise" can really
mean anything, but one thing is for certain: it is more expensive.
Enterprise can mean the same material (flash cells) but a different
(better) flash translation layer (wear-leveling/garbage collection/etc)
or a different feature size (bigger generally means more reliable and
less dense) or something fancier fab tech. Blog with more info on the
Either way, "more bits-per-cell" are generally seen as less enterprise
than "fewer bits-per-cell." So, SLC is high-reliability enterprise, MLC
can be enterprise in some cases (marketing has even taken it upon itself
to brand some "eMLC" or enterprise MLC, which has about as much meaning
as can be expected) and TLC is arguably just commodity. Less cells also
means faster latencies, particularly for writes/erases.
I guess I disagree with the previous poster that saving by going the
commodity route, which, by the way, is not pennies but often upwards of
50%, is always bad. It really depends on your situation/use-case. I
wouldn't store permanent data on outright commodity SSDs, but as a LOCAL
scratch-pad, they can be brilliant (and replacing them later may be far
more advisable than spending a ton up front and praying they don't).
For instance, since you mention Hadoop, you are in a good situation to
consider commodity SSDs since it will automatically failover to another
node if one node's SSD is dead. It's not going to kill off your whole
job. Hadoop is built to cope with that. This being said, I am not
suggesting you necessarily should go the route of putting HDFS on SSDs.
The bandwidth and capacity concerns you raise are spot on there. What I
am suggesting is perhaps using a few commodity SSDs for your
mapreduce.cluster.local.dir, or where your intermediate data will
reside. You suggest "not many users will take advantage of this." If
your core application is Hadoop, every user will take advantage of these
SSDs (unless they explicitly override the tmp path, which is possible),
and the gains can be significant over HDDs. Moreover, you aren't
multiplexing persistent and temporary data onto/from your HDFS HDDs, so
you can get speedups getting to persistent data as well since you've
effectively created dedicated storage pools for both types of accesses.
This can be important.
Caveat #1: Make sure to determine how much temporary space will be
needed, and acquire enough SSDs to cover that across the cluster. That,
or, instruct your users that "jobs generating up to XTB of intermediate
data can run on the SSDs, which is done by default, but for jobs
exceeding that use these extra parameters to send the tmp data to HDDs."
More complexity though. Depends on the user-base.
Caveat #2: If you're building incredibly stacked boxes (e.g., 16+ HDDs)
you may be resource-limited in a number of ways that makes adding SATA
SSDs unwise. May not be worth the effort to squeeze more SSDs in there,
or PCIe SSDs (tending to be more enterprise anyhow) might be the way to go.
Caveat #3: Only certain types of Hadoop jobs really hammer intermediate
space. Read- and write-intensive jobs often won't, but those special
ones that do (e.g., Sort) benefit by immense amounts with a fast
Caveat #4: There are probably more caveats. My advice is to build a two
mock-up machines with and without it and run a "baby cluster" Hadoop
instance. This way, if SSDs really don't bring the performance gains
you want, you avoid buying a bunch, wasting money, and time probably
replacing them down the road.
More on Wear-Out: This is becoming an issue again for /modern/ feature
sizes and commodity bit-levels (especially TLC). For modern drives with
last gen or two gen back feature sizes, fancy wear-leveling and
egregious amounts of over-provisioning have more or less made wear-out
impossible for the lifetime of your machine.
Department of Computer Science and Engineering
The Pennsylvania State University
More information about the Beowulf