[Beowulf] SSD caching for parallel filesystems

Mark Hahn hahn at mcmaster.ca
Mon Feb 11 21:04:07 PST 2013


this is getting absurd.  I think we all know the relative prices
and performances of off-the-shelf disks/ssd/ram.  each have peculiarities
that make their use somewhat complex.

- with disks, you have to think about seek time, since it can range from
zero to ~15ms.  for some workloads, a saving grace is that with request
sorting (helped also by disk-level queue reordering), you can perform
several transactions along the way.  disks have historically followed
a pretty steep curve of improved density, somewhat akin to Moore's law,
which has delivered ever-higher density and attendant bandwidth.

- with ram, you get the proverbial random access.  like most proverbs,
that's only a little true: ram has banking and paging effects, and 
emphatically rewards sequential access.  it also suffers from a very
stiff industry that hasn't managed to adopt a transactional interface,
even though cpus have evermore internal concurrency.  (caches have let
dram designers stay very lazy...)

- with flash, you'll probably never have a random-access interface - 
it'll always be a disk-like block-transfer thing.  why?  because flash
has to be remapped to be useful, and that remapping has to change 
during use (indeed, *because*of* use).

the discussion of PCIe and NVMe were pretty much a diversion, since 
none of them are substantially altering the block-transfer nature of flash.
yes, something like NVMe does simplify the protocol being employed, but
it's still a mechanism for queueing block-transfer requests, like any 
IO device (SATA, SCSI, RAID, even eth/IB networks.)

it would be amusing to see a flash vendor take a page from networks,
and offer "flash rdma".  but frankly, I'm not sure there's enough niche
for high-end flash at all.  high-volume devices will always just follow
the capacity-performance bounds of current flash fabs.  an awfully big
chunk of the IT industry wants *distributed* performance (the Googles 
of the world) and won't normally want to pay the premium of a PCIe flash
device, since they can get arbitrary aggregate performance with big clusters,
and need big clusters anyway for other reasons.


More information about the Beowulf mailing list