[Beowulf] Checkpointing using flash
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Fri Sep 21 09:58:10 PDT 2012
On 9/21/12 9:44 AM, "Ellis H. Wilson III" <ellis at cse.psu.edu> wrote:
>On 09/21/12 12:29, Lux, Jim (337C) wrote:
>> Flash is slow, though... SLC NAND flash (pretty fast, 8 Gbit part) is
>>250
>> microseconds to write a 4kbyte (approx) page. Erasing is about 700
>> microseconds (reading is 25 microseconds)
>>
>> MLC flash (say 512Gbit parts with 8 kBbyte pages) takes 1.3milliseconds
>>to
>> write a page, 3.8 ms to erase (75us to read)... And has a life of 3000
>> write/erase cycles.
>
>Modern MLC has at least a 10k cycle guarantee per-page, and research I'm
>doing at PSU has shown to me at least that this is a very low bar.
>Often it's way higher than that.
Yes.. I've tested 100k cycle SLC and it lasted way more than a million.
However, it's very temperature sensitive.. Get it warm and it forgets and
wears out faster.
>
>> That's 53 Mpbs streaming to the part. Yeah, any practical design is
>>going
>> to have multiple interleaved devices, etc. so you can probably do it
>> faster..
>>
>> But still, say you are checkpointing 8Gbyte.. That's 1300 seconds (yep,
>> about 20 minutes), assuming you've previously erased everything.
>
>As you mention there are multiple interleaved devices. Specifically,
>modern flash devices (SSDs, which is what they plan to do this
>checkpointing with) have many layers of parallelism within them --
>channels, packages and dies to be exact. Something like 4-8 channels,
>each having multiple packages on each channel (8-16 I think in modern
>devices) and each package having multiple dies inside (2-4 is common).
>And inside of each die you finally are looking at an individual flash
>page/block/cell/etc.
>
>So you can't calculate their speeds like HDDs -- it doesn't work like
>that.
>
>Basic COTS SSDs can provide upwards of 200MB/s sustained writes until
>erases have to be done or you've filled greater than ~80% of the drive.
> So it's more like 40-60 seconds for 8GB, certainly not 1300 seconds.
>Use PCI-E flash devices and you're looking at much closer to 500MB/s to
>1GB/s, depending on what you are willing to spend.
>
>I mean, think about it -- modern HDDs can easily hit 100MB/s streaming
>sequential writes. At 1300s to do 8GB you're suggesting flash is much
>slower (around 6MB/s) than that, which is definitely not the case.
>Maybe for USB thumb drives or some ridiculous single-deviced medium, but
>not real SSDs (especially PCI-E flash devices).
Yes.. You raise good points. I suppose in this sort of application, one
can afford to use the "optimized for streaming write" sort of devices.
>
>> Fast compared to disk, maybe, but very slow. Why not just mirror memory
>> (other than cost and power: RAM is much less dense than flash)
>
>The cost and power concerns you mention with RAM mirroring are
>absolutely huge. Flash is a steal compared to RAM on both counts.
Is it.. I haven't looked at the power consumption of DRAM recently, but I
suppose compared to RAM the flash doesn't draw any power between
checkpoints, and the duty cycle is low, so even if the flash part *is*
drawing a lot of power it's ok (the 512Gb parts I was looking at are on
the order of 50mA at 3.3V during a program/write operation)
>
>> There's also the write cycle limit.. If you're looking for very high
>> densities (USB thumb drive) you're looking at
>> A) serial interfaces
>> B) MLC NAND with maybe 10k cycle life on each page
>
>Let's say you do one checkpoint that saturates your flash every 4 hours
>and let the flash trickle that out to the underlying HDDs over the next
>4 hours before your next checkpoint. Even with MLC (10k guarantee)
>that's around 5 years before you hit the guarantee, and I bet you'll be
>able to go a while after that. Given that major supers don't last 5
>years, this is a non-issue.
Yes.. If that's the frequency of checkpoints. I was thinking more like 1
checkpoint per second or 10 seconds.
>
More information about the Beowulf
mailing list