[Beowulf] Big storage

Fri Aug 24 14:03:40 PDT 2007

Hi Jeff,

OK, I see the point.  You are not worried about multiple unreadable 
sectors making it impossible to reconstruct lost data.  You are worried 
about 'whole disk' failure.

I definitely agree that this is a possible problem.  In fact we operate 
all of our UWM data archives (about 300 TB) as RAID-6 to reduce the 
probability of this.  The idea of a second disk failing in a RAID-5 array 
during rebuild does not make for a good night's sleep!

Cheers,
 	Bruce

On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:

> Bruce,
>
> I urge you to read Garth's comments. Your description of what
> RAID controllers do is very good when there are no failed drives.
> If a drive fails though, you can't scan the disks looking for bad
> sectors.
>
> During a reconstruction, the RAID controller is reconstructing
> the data based on the remaining drives and the parity.
> Unfortunately, the controller is likely to be block based so it has
> to rebuild every block of the failed disk. But if the controller is
> doing a reconstruction and hits a URE, then the reconstruction
> process just stops and the controller cries uncle. This means you
> have to restore the failed array from a backup. This means the
> entire volume.
>
> With drives getting larger and larger all the time, the window of
> vulnerability during reconstruction (where a second drive failure
> will fail the entire volume) has grown because it takes longer and
> longer to reconstruct so much data. This is why people are moving
> to RAID-6. But RAID-6 is expensive in terms of capacity and performance
> (Note: it has worse write performance than RAID-5). It gives the
> ability to tolerate a second drive failure, but it may not reduce the
> window of vulnerability during reconstruction because it takes longer
> to reconstruct.
>
> Here's an article where Garth talks about this (it's at the end):
>
> http://www.eweek.com/article2/0,1895,2168821,00.asp
>
> I wanted to note one quick thing from the article:
>
> "The probability of the disk failing to read back data is the same as
> it was long ago, so today you can expect at least one failed read every
> 10TB to 100TB. But the reconstruction of a failed 500GB disk in an
> 11-disk array has to read 5TB, so there can be an unacceptably large
> chance of failure to rebuild every one of the 1 billion sectors on the
> failed disk."
>
> So if a reconstruction fails, you have to copy 5TB of data from the
> backup to the volume. If you do this from tape - you're going to wait
> a long time. You can do it from a disk backup but it still may take
> some time to move 5TB across the wire depending upon how you
> everything connected.
>
> Jeff
>
>
>> Hi Jeff,
>> 
>> For this reason, in a RAID system with a lot of disks it is important to 
>> scan the disks looking for unreadable (UNC = uncorrectable) data blocks on 
>> a regular basis.  If these are found, then the missing data at that Logical 
>> Block Address (LBA) has to be reconstructed from the *other* disks and 
>> re-written onto the failed disk.
>> 
>> In a well-designed (hardware or software) RAID implementation, you can 
>> reconstruct the missing data by only reading a handful of logical blocks 
>> from the redundant disks.  It is not necessary to read the entire disk 
>> surface just to get a few 512 byte sectors of data.  So a failure for 
>> different data somewhere else on a disk should not (in principle) prevent 
>> reconstruction of the lost/missing data.  In a poorly-designed RAID 
>> implementation, you have to read the ENTIRE disk surface to get data from a 
>> few sectors.  In this case, another uncorrectable disk sector can be 
>> crippling.
>> 
>> Most good hardware RAID cards have an option for continous disk scanning. 
>> For example ARECA called this 'consistency checking'.  It should be done on 
>> a regular basis.
>> 
>> You can use smartmontools to do this also, by cayring out regular read 
>> scans of the disk surface and then forcing a RAID consistency check/rebuild 
>> if there is a read failure at some disk block.
>> 
>> Note that continous scanning is also needed for ECC memory to prevent 
>> correctable single-bit errors from becomming uncorrectable double-bit 
>> errors.  In this RAM/memory context it is called 'memory scrubbing'
>> 
>> Cheers,
>>     Bruce
>> 
>> On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:
>> 
>>> This isn't really directed at Jeff, but it seemed like a good segway
>>> for a comment. Everyone - please read some recent article by
>>> Garth Gibson about large capacity disks and large number of
>>> disks in a RAID group. Just to cut to the chase, given the
>>> Unrecoverable Read Error (URE) rate and large disks, during
>>> a rebuild you are almost guaranteed to hit a URE. When that
>>> happens, the rebuild stops and you have to restore everything
>>> from a backup. RAID-6 can help, but given enough disks and
>>> large enough disks, the same thing can happen (plus RAID-6
>>> rebuilds take longer since there are more computations involved).
>>> 
>>> Jeff
>>> 
>>> P.S. I guess I should disclose that my day job is at Panasas. But
>>> regardless, I would recommend reading some of Garth's comments.
>>> Maybe I can also get one of his presentations to pass around.
>>> 
>>> P.P.S. If you don't know Garth, he's one of the fathers of RAID.
>>> 
>>>> Hello Jakob,
>>>> A couple of things...
>>>> 1. ClusterFS has an easy to understand calculation on why raid 6 is
>>>> necessary for the amount of disks you're considering. You do need to
>>>> plan for multi-disk failure, especially with the rebuild time of 1TB
>>>> disks.
>>>> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512 
>>>> 
>>>> 2. Avoid tape if you can. At this scale, the administrative time and
>>>> costs far outweigh the benefits. Of course if you need to move your
>>>> data to a secure vault that's another thing. If you really want to do
>>>> tape, some people choose to do disk > disk > tape. This eliminates the
>>>> read interrupts on the primary storage and provides some added
>>>> redundancy.
>>>> 
>>>> 3. We do use Nexsan's satabeasts for storage similar to this. Without
>>>> commenting on costs, the jackrabbit is technologically superior.
>>>> 
>>>> Thanks,
>>>>                 jeff
>>>> 
>>>> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>>>> 
>>>>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>>>> 
>>>>>> Greetings Jakob:
>>>>>> 
>>>>>> 
>>>>> Hi Joe,
>>>>> 
>>>>> Thanks for answering!
>>>>> 
>>>>> ...
>>>>> 
>>>>>> up front disclaimer: we design/build/market/support such things.
>>>>>> 
>>>>> That does not disqualify you  :)
>>>>> 
>>>>> 
>>>>>>> I'm looking at getting some big storage. Of all the parameters, 
>>>>>>> getting as low
>>>>>>> dollars/(month*GB) is by far the most important. The price of 
>>>>>>> acquiring and
>>>>>>> maintaining the storage solution is the number one concern.
>>>>>>> 
>>>>>> Should I presume density, reliability, and performance also factor in
>>>>>> somewhere as 2,3,4 (somehow) on the concern list?
>>>>>> 
>>>>> I expect that the major components of the total cost of running this 
>>>>> beast will
>>>>> be something like
>>>>>
>>>>>    acquisition
>>>>>  + power
>>>>>  + cooling
>>>>>  + payroll (disk-replacing admins :)
>>>>> 
>>>>> Real-estate is a concern as well, of course. The rent isn't free. It 
>>>>> would be
>>>>> nice to pack this in as few racks as possible.  Reliability, well... I 
>>>>> expect
>>>>> frequent drive failures, and I would expect that we'd run some form of 
>>>>> RAID to
>>>>> mitigate this. If the rest of the hardware is just reasonably well 
>>>>> designed,
>>>>> the most frequently failing components should be redundant and hot-swap
>>>>> replacable (fans and PSUs).
>>>>> 
>>>>> It's acceptable that a head-node fails for a short period of time. The 
>>>>> entire
>>>>> system will not depend on all head nodes functioning simultaneously.
>>>>> 
>>>>> 
>>>>>>> The setup will probably have a number of "head nodes" which receive a 
>>>>>>> large
>>>>>>> amount of data over standard gigabit from a large amount of remote 
>>>>>>> sources.
>>>>>>> Data is read infrequently from the head nodes by remote systems. The 
>>>>>>> primary
>>>>>>> load on the system will be data writes.
>>>>>>> 
>>>>>> Ok, so you are write dominated.  Could you describe (guesses are fine)
>>>>>> what the writes will look like?  Large sequential data, small random
>>>>>> data (seek, write, close)?
>>>>>> 
>>>>> I would expect something like 100-1000 simultaneous streaming writes to 
>>>>> just as
>>>>> many files (one file per writer). The files will be everything from a 
>>>>> few
>>>>> hundred MiB to many GiB.
>>>>> 
>>>>> I guess that on most filesystems these streaming sequential writes will 
>>>>> result
>>>>> in something close to "random writes" to the block layer. However, we 
>>>>> can be
>>>>> very generous with write buffering.
>>>>> 
>>>>> 
>>>>>>> The head nodes need not see the same unified storage; so I am not 
>>>>>>> required to
>>>>>>> have one big shared filesystem. If beneficial, each of the head nodes 
>>>>>>> could
>>>>>>> have their own local storage.
>>>>>>> 
>>>>>> There are some interesting designs with a variety of systems, including
>>>>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>>>>> them.  These designs will add to the overall cost, and increase 
>>>>>> complexity.
>>>>>> 
>>>>> Simple is nice :)
>>>>> 
>>>>> 
>>>>>>> The storage pool will start out at around 100TiB and will grow to 
>>>>>>> ~1PiB within
>>>>>>> a year or two (too early to tell). It would be nice to use as few 
>>>>>>> racks as
>>>>>>> possible, and as little power as possible  :)
>>>>>>> 
>>>>>> Ok, so density and power are important.  This is good.  Coupled with 
>>>>>> the
>>>>>>  low management cost and low acquisition cost, we have about 3/4 of 
>>>>>> what
>>>>>> we need.  Just need a little more description of the writes.
>>>>>> 
>>>>> I hope the above helped.
>>>>> 
>>>>> 
>>>>>> Also, do you intend to back this up?
>>>>>> 
>>>>> That is a *very* good question.
>>>>> 
>>>>> 
>>>>>> How important is resiliency of the
>>>>>> system?  Can you tolerate a failed unit (assume the units have hot
>>>>>> spares, RAID6, etc).
>>>>>> 
>>>>> Yes. Single head nodes may fail. They must be fairly quick to get back 
>>>>> on line
>>>>> (having a replacement box I would expect no more than an hour of 
>>>>> downtime).
>>>>> 
>>>>> 
>>>>>> When you look at storage of this size, you have to
>>>>>> start planning for the eventual (and likely) failure of a chassis (or
>>>>>> some number of them), and think about with a RAIN configuration.
>>>>>> 
>>>>> Yep. I don't know how likely a "many-disk" failure would be... If I have 
>>>>> a full
>>>>> replacement chassis, I would guess that I could simply pull out all the 
>>>>> disks
>>>>> from a failed system, move them to the replacement chassis and be up and
>>>>> running again in "short" time.
>>>>> 
>>>>> If a PSU decides to fry everything connected to it including the disks, 
>>>>> then
>>>>> yes, I can see the point in RAIN or a full backup.
>>>>> 
>>>>> It's a business decision if a full node loss would be acceptable. I 
>>>>> honestly
>>>>> don't know that, but it is definitely interesting to consider both "yes" 
>>>>> and
>>>>> "no".
>>>>> 
>>>>> 
>>>>>> Either
>>>>>> that, or invest into massive low level redundancy (which should be 
>>>>>> scope
>>>>>> limited to the box it is on anyway).
>>>>>> 
>>>>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>>>> 
>>>>> 
>>>>>>> It *might* be possible to offload older files to tape; does anyone 
>>>>>>> have
>>>>>>> experience with HSM on Linux?  Does it work?  Could it be worthwhile 
>>>>>>> to
>>>>>>> investigate?
>>>>>>> 
>>>>>> Hmmm...  First I would suggest avoiding tape, you should likely be
>>>>>> looking at disk to disk for backup, and use slower nearline mechanisms.
>>>>>> 
>>>>> Why would you avoid tape?
>>>>> 
>>>>> Let's say there was software which allowed me to offload data to tape in 
>>>>> a
>>>>> reasonable manner. Considering the running costs of disk versus tape, 
>>>>> tape
>>>>> would win hands down on power, cooling and replacements.
>>>>> 
>>>>> Sure, the random seek time of a tape library sucks golf balls through a 
>>>>> garden
>>>>> hose, but assuming that one could live with that, are there more 
>>>>> important
>>>>> reasons to avoid tape?
>>>>> 
>>>>> 
>>>>>>> One setup I was looking at, is simply using SunFire X4500 systems (you 
>>>>>>> can put
>>>>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can buy 
>>>>>>> them with
>>>>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) and 
>>>>>>> grow the
>>>>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>>>> 
>>>>>>> Any better ideas?  Is there a way to get this more dense without 
>>>>>>> paying an arm
>>>>>>> and a leg?  Has anyone tried something like this with HSM?
>>>>>>> 
>>>>>> Yes, but I don't want to turn this into a commercial, so I will be
>>>>>> succinct.  Scalable Informatics (my company) has a similar product,
>>>>>> which does have a good price and price per gigabyte, while providing
>>>>>> excellent performance.  Details (white paper, benchmarks, 
>>>>>> presentations)
>>>>>> at the http://jackrabbit.scalableinformatics.com web site.
>>>>>> 
>>>>> Yep, I was just looking at that actually.
>>>>> 
>>>>> The hardware looks similar in concept to the SunFire, but as I see it 
>>>>> you guys
>>>>> have thought about a number of services atop of that (RAIN etc.)
>>>>> 
>>>>> 
>>>>> Very interesting!
>>>>> 
>>>>> -- 
>>>>>
>>>>>  / jakob
>>>>> 
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> 
>> 
>
>