[Beowulf] SSD caching for parallel filesystems

Ellis H. Wilson III ellis at cse.psu.edu
Sat Feb 9 07:16:40 PST 2013

On 02/09/13 13:16, Mark Hahn wrote:
>> They buy a controller design from one place (some
>> make this component), SSD packages from someplace else, some channel
>> controllers, etc, etc, and strap it all together.  Which is totally
> well, I only pay attention to the SATA SSD market, but the media
> controller is in the same chip as the flash controler, wear logic, etc.

Of course -- there is no reason to have to translate to PCIe.  You are 
already getting things encoded for SATA from the NAND flash, so 
everything is easy.  It's only when you use a different protocol outside 
that you need to translate, and this is the case I'm referring to. 
Pardon my snipping the previous aspects of this conversation -- for SATA 
SSDs this entire conversation is moot.  The NAND packages there are as 
good as they could be relative to the overhead I'm discussing.

> so yes, there is some shopping around of flash components, but having
> industry-wide flash interface standards is hardly a bad thing.

I like standards too, and this is why these guys made a standard for the 
PCIe protocol:

I'm going to make a bold statement here, and you can take it or leave 
it, but I truly believe SATA is on it's way out for NVM devices.  [I 
believe] We have to stop thinking about these things as "disks" and 
start thinking about them as slow, huge memory.  Just another step in 
that hierarchy (reg, L1, L2, DRAM, NVM).  Obviously we have some hurdles 
to overcome before it's totally usable as plain-ol' OS-managed memory, 
but we're getting there.  I think PCIe is a step in the right direction, 
and does a service to these devices in that it allows them to really 
perform (NVM memories are out-pacing SATA by a lot, at least right now). 
  But ultimately, these things need to either be on-board or in DIMMs. 
Just my opinion of course.

>> fine, but the problem arises because the volume for NAND flash packages
>> are for SATA based drives.  This results in most of the NAND packages
>> within to export a SATA protocol.
> that confuses me.  flash chips have a generic interface which I can't
> really see as being at all specific to a particular blockdev interface.

I'm obviously doing a horrible job explaining this -- my apologies! 
Please see page 4 of this whitepaper and the diagram at the top of page 
5 for what I think must be clear enough for me to convey this overhead:


Getting more technical than I was willing to earlier to summarize this: 
there's a conversion going on between protocol encoding between the SATA 
controller and the HBA controller in bridge-oriented PCIe-based SSDs 
that can incur as high as 25% overheads in the bandwidth.  Is this a 
game-ender for the end-user?  No.  Is this something to keep in mind? 
Sure.  That's all I was angling at.  It was just a suggestion for 
Prentice to research up on prior to committing to a bunch of PCIe 
drives.  I don't work for Micron, Samsung, Intel, or any other company 
in this space.  No tomfoolery going on here :D.

> no, it doesn't.  Micron has simply invented their own
> flash-to-disk-interface.  if you're saying "skipping SATA is important",
> well, maybe.  it looks from Micron's whitepapers that they are focused
> almost entirely on small random reads (not unreasonable).  but that's a
> workload that doesn't stress the oddity of flash (managing pre-erased
> blocks and wear levelling).  maybe I'm being picking at semantics, but

Small random reads are actually the toughest things to deal with for 
properly burnt-in SSDs.  Sure, the underlying NAND prefers reads to 
writes, but ultimately, load-leveling across channels, packages, dies, 
and planes is much tougher for reads that are destined for specific data 
that already "lives" somewhere.  This is particularly true for the tiny 
queues available for SATA drives, where request reordering is extremely 
limited.  OTOH, small writes can easily be spread out over all of the 
internal storage in parallel because the device can "choose" where they 
go.  Anyhow, I think we are running into overly detailed, semantics 
issues here though.  The take-away is that, because of very different 
encoding between PCIe and SATA, it's inane to have SATA-controllered 
NAND in a PCIe-based flash device and transcode all the time.  Just fix 
the damn bug.  That's my story and I'm sticking to it ;).

>> Managing the raw flash at the filesystem or driver level is an option,
>> but not what I was talking about (nor what is currently in vogue right
>> now, to my knowledge).
> well, it's interesting that some people are talking about adding flash
> to dram dimms - very unclear to me how that would work.  but maybe they're
> merely using the dimm form-factor and proposing a new interface.

Who are these folks?  I'd love to read what they are saying -- this is 
where my next research is geared, so this is exciting to hear!

> I guess I have more respect for SATA than you do.  the Micron thing is

Oh, don't get me wrong, I love SATA...for disks.  I just am increasingly 
seeing these devices less as disks than giant, slow memory.  I just 
worry if we force SATA to keep up with these things as they rapidly 
improve we may hurt things for magnetic storage, which still has a very 
real and valuable place in the storage hierarchy.  This is a new tier, 
and we've treated it so far like a tier in storage.  Maybe now that it's 
so dang fast, it's time to rethink that.

> still just a disk interface - block shuffling.  it arguably removes one
> level of protocol reformatting, but I'm not sure how much difference
> that would make to the consumer.  a raid0 across SATA channels does a
> pretty good job of piling up IOPs...

This encoding issue takes a bigger hit on bandwidth than IOPs, fwiw. 
You're probably correct about SATA working (in the short-term at least) 
for IOPs.

> OTOH, a PCIe card that really did map flash blocks directly into the
> memory space would be quite interesting.  it just sounds tricky to get

I agree completely.  Obviously from my previous ramblings, I'm hoping we 
see more in this direction in the near future.  Sure there are hurdles, 
but I do not believe they are any harder than those we've surmounted 
with making it a disk-like device.



More information about the Beowulf mailing list