[Beowulf] SSD caching for parallel filesystems

Sat Feb 9 10:16:43 PST 2013

>>> solid devices as well (also expensive however).  I think Micron also has
>>> a native PCIe device in the wild now, the P320h?  Anybody know of other,
>>> native PCIe devices?
>>
>> I'm not even sure what "native" PCIe flash would look like.  Do you mean
>> that the driver and/or filesystem would have to do the wear-levelling and
>> block remapping and garbage collection explicitly?
>
> No, this is referring to the internal protocols of the SSD.  The SSD is
> just exposing a given protocol, but internally is managing many discrete
> storage devices (think baby RAID and micro-os in a box).

yes, I know.  that's why I asked: all the extra stuff that it takes 
to make flash usable is done entirely by the controller, which is 
tightly coupled to the disk interface (normally SATA, of course.)

> Right now, many SSD manufacturers are just SSD "repackagers" (including
> OCZ to my knowledge).

it's true that a lot of SATA SSDs contain the same Sandforce controller
with a small variety of different flash types, but there are several vendors
who either developed or acquired their controller IP.  OCZ was a Marvell
client (eg Indilinx), but bought IP and hired a SoC team and now seems 
independent.  Intel and Samsung also both have independent designs afaik.

> They buy a controller design from one place (some
> make this component), SSD packages from someplace else, some channel
> controllers, etc, etc, and strap it all together.  Which is totally

well, I only pay attention to the SATA SSD market, but the media 
controller is in the same chip as the flash controler, wear logic, etc.
so yes, there is some shopping around of flash components, but having
industry-wide flash interface standards is hardly a bad thing.
having so many different-branded SSDs with basically the same Sandforce
controller is a bit odd, but probably just a phase.  some SF-based vendors
do claim to have customized the firmware (Intel, for instance.)

> fine, but the problem arises because the volume for NAND flash packages
> are for SATA based drives.  This results in most of the NAND packages
> within to export a SATA protocol.

that confuses me.  flash chips have a generic interface which I can't 
really see as being at all specific to a particular blockdev interface.

> This requires these re-packaging
> companies to have to translate to and from the SATA and PCIe protocols.

well, PCIe has basically two interfaces: a register-based command interface
and a memory-mapped one.  while I can imagine mapping flash chips directly
into the PCIe memory space, I'm not sure it would be practical to do onboard
all the coddling that flash needs to survive.  a block interface offers 
the controller a lot of visibility and flexibility to the stream of
operations, so it can express them in flash-friendly terms.

>  For another explanation, please see the fourth paragraph of:
>
> http://www.anandtech.com/show/4408/microns-p320h-a-custom-controller-native-pcie-ssd-in-350700gb-capacities
>
> Hopefully this explains better the issue I'm referring to.

no, it doesn't.  Micron has simply invented their own
flash-to-disk-interface.  if you're saying "skipping SATA is important",
well, maybe.  it looks from Micron's whitepapers that they are focused
almost entirely on small random reads (not unreasonable).  but that's a 
workload that doesn't stress the oddity of flash (managing pre-erased 
blocks and wear levelling).  maybe I'm being picking at semantics, but 
their interface looks block-transfer (disk-like) to me, so I wouldn't 
call it native PCIe.  I'm guessing they have the usuall sort of ring
interface to the controller, though perhaps simply with a deeper queue
than SATA permits (32 per channel).

> Managing the raw flash at the filesystem or driver level is an option,
> but not what I was talking about (nor what is currently in vogue right
> now, to my knowledge).

well, it's interesting that some people are talking about adding flash
to dram dimms - very unclear to me how that would work.  but maybe they're
merely using the dimm form-factor and proposing a new interface.

> vendors.  In even a year or two this probably won't be the case, but
> right now there is a lot of junk on the market.

I guess I have more respect for SATA than you do.  the Micron thing is 
still just a disk interface - block shuffling.  it arguably removes one
level of protocol reformatting, but I'm not sure how much difference 
that would make to the consumer.  a raid0 across SATA channels does a 
pretty good job of piling up IOPs...

offhand, I think a card that implemented 8x standard SATA-SSD channels
could keep up with the Micron card.  (perhaps not in write endurance,
since Micron uses SLC and everyone else is MLC.)

OTOH, a PCIe card that really did map flash blocks directly into the 
memory space would be quite interesting.  it just sounds tricky to get 
the semantics right, given that flash sometimes just has to take a 
breather to catch up on GC/erasures, and would need to profile all the 
memory-mapped IO to each flash block to do wear levelling...

regards, mark hahn.