[Beowulf] SSDs for HPC?
Prentice Bisbal
prentice.bisbal at rutgers.edu
Tue Apr 8 09:05:12 PDT 2014
On 04/07/2014 04:40 PM, Ellis H. Wilson III wrote:
> On 04/07/2014 03:44 PM, Prentice Bisbal wrote:
>>> As long as you use enterprise-grade SSDs (e.g., Intel's stuff) with
>>> overprovisioning, the nand endurance shouldn't be an issue over the
>>> lifetime of a cluster. We've used SSDs as our nodes' system disks for
>>> a few years now (going on four with our oldest, 324-node production
>>> system), and there have been no major problems. The major problems
>>> happened when we were using the cheaper commodity SSDs. Don't give in
>>> to the temptation to save a few pennies there.
>>
>> Thanks for the info. Enterprise typically means MLC instead of SLC,
>> right?
After reading the link below, I think I got my concept of SLC and MLC
backwards. Sorry. Not an SSD expert by any stretch of the imagination.
>
> There is a lot of cruft to filter through in the SSD space to
> understand what the hell is really going on. First of all,
> "enterprise" can really mean anything, but one thing is for certain:
> it is more expensive. Enterprise can mean the same material (flash
> cells) but a different (better) flash translation layer
> (wear-leveling/garbage collection/etc) or a different feature size
> (bigger generally means more reliable and less dense) or something
> fancier fab tech. Blog with more info on the topic here:
>
> http://www.violin-memory.com/blog/flash-flavors-mlc-emlc-or-vmlc-part-i/
Thanks. This was an excellent read.
>
> Either way, "more bits-per-cell" are generally seen as less enterprise
> than "fewer bits-per-cell." So, SLC is high-reliability enterprise,
> MLC can be enterprise in some cases (marketing has even taken it upon
> itself to brand some "eMLC" or enterprise MLC, which has about as much
> meaning as can be expected) and TLC is arguably just commodity. Less
> cells also means faster latencies, particularly for writes/erases.
>
> I guess I disagree with the previous poster that saving by going the
> commodity route, which, by the way, is not pennies but often upwards
> of 50%, is always bad. It really depends on your situation/use-case.
> I wouldn't store permanent data on outright commodity SSDs, but as a
> LOCAL scratch-pad, they can be brilliant (and replacing them later may
> be far more advisable than spending a ton up front and praying they
> don't).
>
> For instance, since you mention Hadoop, you are in a good situation to
> consider commodity SSDs since it will automatically failover to
> another node if one node's SSD is dead. It's not going to kill off
> your whole job. Hadoop is built to cope with that. This being said,
> I am not suggesting you necessarily should go the route of putting
> HDFS on SSDs. The bandwidth and capacity concerns you raise are spot
> on there. What I am suggesting is perhaps using a few commodity SSDs
> for your mapreduce.cluster.local.dir, or where your intermediate data
> will reside. You suggest "not many users will take advantage of
> this." If your core application is Hadoop, every user will take
> advantage of these SSDs (unless they explicitly override the tmp path,
> which is possible), and the gains can be significant over HDDs.
> Moreover, you aren't multiplexing persistent and temporary data
> onto/from your HDFS HDDs, so you can get speedups getting to
> persistent data as well since you've effectively created dedicated
> storage pools for both types of accesses. This can be important.
I don't know much about Hadoop, but this seems like a good
middle-ground. I like technology to be transparent to users. If they
have to do something specific on their end to improve performance, 9/10
times, they won't either because they're unaware it's an available
feature, don't understand the impact of using it, or are just too lazy
to change their habits.
>
> Caveat #1: Make sure to determine how much temporary space will be
> needed, and acquire enough SSDs to cover that across the cluster.
> That, or, instruct your users that "jobs generating up to XTB of
> intermediate data can run on the SSDs, which is done by default, but
> for jobs exceeding that use these extra parameters to send the tmp
> data to HDDs." More complexity though. Depends on the user-base.
That could be a problem. Because really don't know the user habits yet,
since this will be my first Hadoop resource, so I have no job-size data
to go from. Also, see my previous comment.
>
> Caveat #2: If you're building incredibly stacked boxes (e.g., 16+
> HDDs) you may be resource-limited in a number of ways that makes
> adding SATA SSDs unwise. May not be worth the effort to squeeze more
> SSDs in there, or PCIe SSDs (tending to be more enterprise anyhow)
> might be the way to go.
>
> Caveat #3: Only certain types of Hadoop jobs really hammer
> intermediate space. Read- and write-intensive jobs often won't, but
> those special ones that do (e.g., Sort) benefit by immense amounts
> with a fast intermediate space.
If that's the case, than your suggestion may not be worth it at this
point, since I don't have any usage data to determine whether it's a
wise investment.
>
> Caveat #4: There are probably more caveats. My advice is to build a
> two mock-up machines with and without it and run a "baby cluster"
> Hadoop instance. This way, if SSDs really don't bring the performance
> gains you want, you avoid buying a bunch, wasting money, and time
> probably replacing them down the road.
This is the ideal approach. Unfortunately, I don't have the time or
resources for this. :(
>
> More on Wear-Out: This is becoming an issue again for /modern/ feature
> sizes and commodity bit-levels (especially TLC). For modern drives
> with last gen or two gen back feature sizes, fancy wear-leveling and
> egregious amounts of over-provisioning have more or less made wear-out
> impossible for the lifetime of your machine.
>
> Best,
>
> ellis
>
More information about the Beowulf
mailing list