[Beowulf] SSDs for HPC?

Tue Apr 8 09:05:12 PDT 2014

On 04/07/2014 04:40 PM, Ellis H. Wilson III wrote:
> On 04/07/2014 03:44 PM, Prentice Bisbal wrote:
>>> As long as you use enterprise-grade SSDs (e.g., Intel's stuff) with
>>> overprovisioning, the nand endurance shouldn't be an issue over the
>>> lifetime of a cluster.  We've used SSDs as our nodes' system disks for
>>> a few years now (going on four with our oldest, 324-node production
>>> system), and there have been no major problems.  The major problems
>>> happened when we were using the cheaper commodity SSDs.  Don't give in
>>> to the temptation to save a few pennies there.
>>
>> Thanks for the info. Enterprise typically means MLC instead of SLC, 
>> right?

After reading the link below, I think I got my concept of SLC and MLC 
backwards. Sorry. Not an SSD expert by any stretch of the imagination.
>
> There is a lot of cruft to filter through in the SSD space to 
> understand what the hell is really going on.  First of all, 
> "enterprise" can really mean anything, but one thing is for certain: 
> it is more expensive. Enterprise can mean the same material (flash 
> cells) but a different (better) flash translation layer 
> (wear-leveling/garbage collection/etc) or a different feature size 
> (bigger generally means more reliable and less dense) or something 
> fancier fab tech.  Blog with more info on the topic here:
>
> http://www.violin-memory.com/blog/flash-flavors-mlc-emlc-or-vmlc-part-i/

Thanks. This was an excellent read.
>
> Either way, "more bits-per-cell" are generally seen as less enterprise 
> than "fewer bits-per-cell."  So, SLC is high-reliability enterprise, 
> MLC can be enterprise in some cases (marketing has even taken it upon 
> itself to brand some "eMLC" or enterprise MLC, which has about as much 
> meaning as can be expected) and TLC is arguably just commodity.  Less 
> cells also means faster latencies, particularly for writes/erases.
>
> I guess I disagree with the previous poster that saving by going the 
> commodity route, which, by the way, is not pennies but often upwards 
> of 50%, is always bad.  It really depends on your situation/use-case.  
> I wouldn't store permanent data on outright commodity SSDs, but as a 
> LOCAL scratch-pad, they can be brilliant (and replacing them later may 
> be far more advisable than spending a ton up front and praying they 
> don't).
>
> For instance, since you mention Hadoop, you are in a good situation to 
> consider commodity SSDs since it will automatically failover to 
> another node if one node's SSD is dead.  It's not going to kill off 
> your whole job.  Hadoop is built to cope with that.  This being said, 
> I am not suggesting you necessarily should go the route of putting 
> HDFS on SSDs. The bandwidth and capacity concerns you raise are spot 
> on there.  What I am suggesting is perhaps using a few commodity SSDs 
> for your mapreduce.cluster.local.dir, or where your intermediate data 
> will reside.  You suggest "not many users will take advantage of 
> this."  If your core application is Hadoop, every user will take 
> advantage of these SSDs (unless they explicitly override the tmp path, 
> which is possible), and the gains can be significant over HDDs.  
> Moreover, you aren't multiplexing persistent and temporary data 
> onto/from your HDFS HDDs, so you can get speedups getting to 
> persistent data as well since you've effectively created dedicated 
> storage pools for both types of accesses.  This can be important.
I don't know much about Hadoop, but this seems like a good 
middle-ground.  I like technology to be transparent to users. If they 
have to do something specific on their end to improve performance, 9/10 
times, they won't either because they're unaware it's an available 
feature,  don't understand the impact of using it, or are just too lazy 
to change their habits.
>
> Caveat #1: Make sure to determine how much temporary space will be 
> needed, and acquire enough SSDs to cover that across the cluster. 
> That, or, instruct your users that "jobs generating up to XTB of 
> intermediate data can run on the SSDs, which is done by default, but 
> for jobs exceeding that use these extra parameters to send the tmp 
> data to HDDs."  More complexity though.  Depends on the user-base.
That could be a problem. Because really don't know the user habits yet, 
since this will be my first Hadoop resource, so I have no job-size data 
to go from. Also, see my previous comment.
>
> Caveat #2: If you're building incredibly stacked boxes (e.g., 16+ 
> HDDs) you may be resource-limited in a number of ways that makes 
> adding SATA SSDs unwise.  May not be worth the effort to squeeze more 
> SSDs in there, or PCIe SSDs (tending to be more enterprise anyhow) 
> might be the way to go.
>
> Caveat #3: Only certain types of Hadoop jobs really hammer 
> intermediate space.  Read- and write-intensive jobs often won't, but 
> those special ones that do (e.g., Sort) benefit by immense amounts 
> with a fast intermediate space.
If that's the case, than your suggestion may not be worth it at this 
point, since I don't have any usage data to determine whether it's a 
wise investment.
>
> Caveat #4: There are probably more caveats.  My advice is to build a 
> two mock-up machines with and without it and run a "baby cluster" 
> Hadoop instance.  This way, if SSDs really don't bring the performance 
> gains you want, you avoid buying a bunch, wasting money, and time 
> probably replacing them down the road.
This is the ideal approach. Unfortunately, I don't have the time or 
resources for this. :(
>
> More on Wear-Out: This is becoming an issue again for /modern/ feature 
> sizes and commodity bit-levels (especially TLC).  For modern drives 
> with last gen or two gen back feature sizes, fancy wear-leveling and 
> egregious amounts of over-provisioning have more or less made wear-out 
> impossible for the lifetime of your machine.
>
> Best,
>
> ellis
>