[Beowulf] Big storage

Bruce Allen ballen at gravity.phys.uwm.edu
Thu Aug 30 01:44:19 PDT 2007


Hi Jeffrey,

OK, I agree with your points below.  I'm glad we have converged!  To 
summarize:

(1) With RAID-5, if a disk fails, then even a single uncorrectable sector 
on the remaining disks can destroy your filesystem and data.

(2) With RAID-6, you are protected against this scenario if only one disk 
fails.  If two disks fail, then the problem is again present.

(3) The probability of an uncorrectable sector on a large modern disk is 
high. So scenario (1) is very probable.  You can greatly reduce this 
probability of this by ensuring that the RAID controller scans the disks 
continuously to identify (and then rewrite and reallocate) uncorrectable 
sectors.

Speaking of my own experience and recommendations, I don't buy RAID-5 
systems. I only purchase RAID-6 systems that carry out regular 
background/repair scans for uncorrectable sectors.

Cheers,
 	Bruce

On Mon, 27 Aug 2007, Jeffrey B. Layton wrote:

> Bruce,
>
> IMHO the fundamental problem is not necessarily the bad sectors
> that happen from time to time, although you have to have some
> way of recovering the data (I don't know much about specific
> RAID cards and what they do, but I'm pretty sure that a number
> of storage vendors don't scan for bad sectors at any time). I don't
> believe this is necessarily the point.
>
> I think the point is that if a RAID array has a bad disk (for what
> ever reason) then the array has to be reconstructed from the
> remaining data and parity. During this reconstruction process,
> the probability of encountering a read error is high. The probability
> depends upon the number of disks, and the URE rate.
>
> If you have a RAID-5 volume (N disks) and you are rebuilding and
> hit a read error, the reconstruction stops and you have to restore from
> backup. If you have a RAID-6 volume (N disks) and one disk has
> failed (N-1) and you are reconstructing, then the reconstruction can
> continue because you have the ability to tolerate two failed disks.
> I'm not really sure what happens if during the reconstruction with
> N-1 disks, it hits a read error. It may reconstruct the bad block from the
> remaining N-1 drives or it may just mark the drive as down and recover
> the block from the remaining N-2 disks.
>
> In general you are vulnerable during the reconstruction period. If
> you have a RAID-5 volume, lose a disk and start reconstruction, you
> have a period of time where if you lose another disk you will lose
> all the data on the volume. You could also consider hitting a read
> error during reconstruction as a "failure". How long this period of
> time is,  is is fairly important. If you can reconstruct  during this time
> period, you are fine (if you have enough disks for a spare or you
> can put a disk in to act as a spare).
>
> If you have a RAID-6 volume, lose a disk and start reconstruction,
> you also have a period of time where you vulnerable.  The problem
> with RAID-6 is that it takes more work to reconstruct the data. So
> while you have some extra protection from the second disk, it takes
> longer to reconstruct the data. I don't know the reconstruction times
> of RAID-5 vs. RAID-6 unfortunately. So this window may be larger
> or smaller than the RAID-5 window. I'm guessing that it's smaller,
> but I don't know for sure.
>
> I think there several important points here.
>
> 1. The sectors on disk need to be scanned continually to find bad
> sectors (to have them remapped and have the data on the sectors
> rebuilt).
>
> 2. If you have a RAID controller and a RAID-5 volume and lose a
> disk and then hit a read error, the volume is failed and you have to
> restore the volume from backup. As disks get bigger it could take
> a long time to do this.
>
> 3. If you have a RAID controller and a RAID-6 volume and lose
> a disk, then you can reconstruct. I'm not sure what a read error
> does on the remaining N-1 disks, you might or might not have
> problems.
>
> So it's reconstruction that is a concern.
>
> Jeff
>
>> Jeff,
>> 
>> I did read Garth's comments. I believe that there are two types of possible 
>> problems:
>> 
>> (1) A sector or handful of sectors on a disk become unreadable
>> (2) An entire disk fails (all sectors become unreadable)
>> 
>> Problems of type (1) can be handled well by high quality raid 
>> implementations.  They are not serious, in principle, because the necessary 
>> redundant data for those few blocks exists elsewhere on the array, and is 
>> statistically very unlikely to also be unreadable. Also, high-quality 
>> implementation regularly scans disks looking for uncorrectable blocks, so 
>> that these can be rewritten from redundant data. A high-quality RAID-6 
>> implementation can also handle failures of type (1) on the redundant disks, 
>> even when rebuilding one of two failed disks. More serious is the problem 
>> of having two failed disks (2) and THEN encountering unreadable sectors on 
>> the remaining disks.
>> 
>> In short, as I see it, the real issue is with failed disks, not with 
>> unreadable sectors.  Unreadable sectors are unlikely to happen at the same 
>> LBAs on two disks, unless the entire disk has failed. So the right question 
>> is (for RAID-6) what is the probability of two failed disks within the 
>> rebuild time window, and how likely is it that uncorrectable sectors have 
>> appeared during that time?
>> 
>> Cheers,
>>     Bruce
>> 
>> 
>> On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
>> 
>>> Bruce Allen wrote:
>>>> Hi Jeff,
>>>> 
>>>> OK, I see the point.  You are not worried about multiple unreadable 
>>>> sectors making it impossible to reconstruct lost data.  You are worried 
>>>> about 'whole disk' failure.
>>> 
>>> Well, no actually. I'm worried about unrecoverable reads on the
>>> remaining disks during reconstruction. :) Is that what you are referring
>>> to?
>>> 
>>>> I definitely agree that this is a possible problem.  In fact we operate 
>>>> all of our UWM data archives (about 300 TB) as RAID-6 to reduce the 
>>>> probability of this.  The idea of a second disk failing in a RAID-5 array 
>>>> during rebuild does not make for a good night's sleep!
>>> 
>>> Did you see Garth's comments? Even using a number of 500TB drives
>>> greatly increases the probability of a URE during reconstruction. RAID-6
>>> helps you sleep, but not as much as you think :) Scares the cr** out of 
>>> me.
>>> I'm looking to build a home server and I think I'm going to do RAID-61
>>> to give myself some extra protection. I just have to figure out how to 
>>> power
>>> all of them and find a case where they can fit and a motherboard with 
>>> enough
>>> SATA connectors :)
>>> 
>>> Enjoy!
>>> 
>>> Jeff
>>> 
>>>> 
>>>> Cheers,
>>>>     Bruce
>>>> 
>>>> On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
>>>> 
>>>>> Bruce,
>>>>> 
>>>>> I urge you to read Garth's comments. Your description of what
>>>>> RAID controllers do is very good when there are no failed drives.
>>>>> If a drive fails though, you can't scan the disks looking for bad
>>>>> sectors.
>>>>> 
>>>>> During a reconstruction, the RAID controller is reconstructing
>>>>> the data based on the remaining drives and the parity.
>>>>> Unfortunately, the controller is likely to be block based so it has
>>>>> to rebuild every block of the failed disk. But if the controller is
>>>>> doing a reconstruction and hits a URE, then the reconstruction
>>>>> process just stops and the controller cries uncle. This means you
>>>>> have to restore the failed array from a backup. This means the
>>>>> entire volume.
>>>>> 
>>>>> With drives getting larger and larger all the time, the window of
>>>>> vulnerability during reconstruction (where a second drive failure
>>>>> will fail the entire volume) has grown because it takes longer and
>>>>> longer to reconstruct so much data. This is why people are moving
>>>>> to RAID-6. But RAID-6 is expensive in terms of capacity and performance
>>>>> (Note: it has worse write performance than RAID-5). It gives the
>>>>> ability to tolerate a second drive failure, but it may not reduce the
>>>>> window of vulnerability during reconstruction because it takes longer
>>>>> to reconstruct.
>>>>> 
>>>>> Here's an article where Garth talks about this (it's at the end):
>>>>> 
>>>>> http://www.eweek.com/article2/0,1895,2168821,00.asp
>>>>> 
>>>>> I wanted to note one quick thing from the article:
>>>>> 
>>>>> "The probability of the disk failing to read back data is the same as
>>>>> it was long ago, so today you can expect at least one failed read every
>>>>> 10TB to 100TB. But the reconstruction of a failed 500GB disk in an
>>>>> 11-disk array has to read 5TB, so there can be an unacceptably large
>>>>> chance of failure to rebuild every one of the 1 billion sectors on the
>>>>> failed disk."
>>>>> 
>>>>> So if a reconstruction fails, you have to copy 5TB of data from the
>>>>> backup to the volume. If you do this from tape - you're going to wait
>>>>> a long time. You can do it from a disk backup but it still may take
>>>>> some time to move 5TB across the wire depending upon how you
>>>>> everything connected.
>>>>> 
>>>>> Jeff
>>>>> 
>>>>> 
>>>>>> Hi Jeff,
>>>>>> 
>>>>>> For this reason, in a RAID system with a lot of disks it is important 
>>>>>> to scan the disks looking for unreadable (UNC = uncorrectable) data 
>>>>>> blocks on a regular basis.  If these are found, then the missing data 
>>>>>> at that Logical Block Address (LBA) has to be reconstructed from the 
>>>>>> *other* disks and re-written onto the failed disk.
>>>>>> 
>>>>>> In a well-designed (hardware or software) RAID implementation, you can 
>>>>>> reconstruct the missing data by only reading a handful of logical 
>>>>>> blocks from the redundant disks.  It is not necessary to read the 
>>>>>> entire disk surface just to get a few 512 byte sectors of data.  So a 
>>>>>> failure for different data somewhere else on a disk should not (in 
>>>>>> principle) prevent reconstruction of the lost/missing data.  In a 
>>>>>> poorly-designed RAID implementation, you have to read the ENTIRE disk 
>>>>>> surface to get data from a few sectors.  In this case, another 
>>>>>> uncorrectable disk sector can be crippling.
>>>>>> 
>>>>>> Most good hardware RAID cards have an option for continous disk 
>>>>>> scanning. For example ARECA called this 'consistency checking'.  It 
>>>>>> should be done on a regular basis.
>>>>>> 
>>>>>> You can use smartmontools to do this also, by cayring out regular read 
>>>>>> scans of the disk surface and then forcing a RAID consistency 
>>>>>> check/rebuild if there is a read failure at some disk block.
>>>>>> 
>>>>>> Note that continous scanning is also needed for ECC memory to prevent 
>>>>>> correctable single-bit errors from becomming uncorrectable double-bit 
>>>>>> errors.  In this RAM/memory context it is called 'memory scrubbing'
>>>>>> 
>>>>>> Cheers,
>>>>>>     Bruce
>>>>>> 
>>>>>> On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:
>>>>>> 
>>>>>>> This isn't really directed at Jeff, but it seemed like a good segway
>>>>>>> for a comment. Everyone - please read some recent article by
>>>>>>> Garth Gibson about large capacity disks and large number of
>>>>>>> disks in a RAID group. Just to cut to the chase, given the
>>>>>>> Unrecoverable Read Error (URE) rate and large disks, during
>>>>>>> a rebuild you are almost guaranteed to hit a URE. When that
>>>>>>> happens, the rebuild stops and you have to restore everything
>>>>>>> from a backup. RAID-6 can help, but given enough disks and
>>>>>>> large enough disks, the same thing can happen (plus RAID-6
>>>>>>> rebuilds take longer since there are more computations involved).
>>>>>>> 
>>>>>>> Jeff
>>>>>>> 
>>>>>>> P.S. I guess I should disclose that my day job is at Panasas. But
>>>>>>> regardless, I would recommend reading some of Garth's comments.
>>>>>>> Maybe I can also get one of his presentations to pass around.
>>>>>>> 
>>>>>>> P.P.S. If you don't know Garth, he's one of the fathers of RAID.
>>>>>>> 
>>>>>>>> Hello Jakob,
>>>>>>>> A couple of things...
>>>>>>>> 1. ClusterFS has an easy to understand calculation on why raid 6 is
>>>>>>>> necessary for the amount of disks you're considering. You do need to
>>>>>>>> plan for multi-disk failure, especially with the rebuild time of 1TB
>>>>>>>> disks.
>>>>>>>> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512 
>>>>>>>> 2. Avoid tape if you can. At this scale, the administrative time and
>>>>>>>> costs far outweigh the benefits. Of course if you need to move your
>>>>>>>> data to a secure vault that's another thing. If you really want to do
>>>>>>>> tape, some people choose to do disk > disk > tape. This eliminates 
>>>>>>>> the
>>>>>>>> read interrupts on the primary storage and provides some added
>>>>>>>> redundancy.
>>>>>>>> 
>>>>>>>> 3. We do use Nexsan's satabeasts for storage similar to this. Without
>>>>>>>> commenting on costs, the jackrabbit is technologically superior.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>>                 jeff
>>>>>>>> 
>>>>>>>> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>>>>>>>> 
>>>>>>>>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>>>>>>>> 
>>>>>>>>>> Greetings Jakob:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> Hi Joe,
>>>>>>>>> 
>>>>>>>>> Thanks for answering!
>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>>> up front disclaimer: we design/build/market/support such things.
>>>>>>>>>> 
>>>>>>>>> That does not disqualify you  :)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> I'm looking at getting some big storage. Of all the parameters, 
>>>>>>>>>>> getting as low
>>>>>>>>>>> dollars/(month*GB) is by far the most important. The price of 
>>>>>>>>>>> acquiring and
>>>>>>>>>>> maintaining the storage solution is the number one concern.
>>>>>>>>>>> 
>>>>>>>>>> Should I presume density, reliability, and performance also factor 
>>>>>>>>>> in
>>>>>>>>>> somewhere as 2,3,4 (somehow) on the concern list?
>>>>>>>>>> 
>>>>>>>>> I expect that the major components of the total cost of running this 
>>>>>>>>> beast will
>>>>>>>>> be something like
>>>>>>>>>
>>>>>>>>>    acquisition
>>>>>>>>>  + power
>>>>>>>>>  + cooling
>>>>>>>>>  + payroll (disk-replacing admins :)
>>>>>>>>> 
>>>>>>>>> Real-estate is a concern as well, of course. The rent isn't free. It 
>>>>>>>>> would be
>>>>>>>>> nice to pack this in as few racks as possible.  Reliability, well... 
>>>>>>>>> I expect
>>>>>>>>> frequent drive failures, and I would expect that we'd run some form 
>>>>>>>>> of RAID to
>>>>>>>>> mitigate this. If the rest of the hardware is just reasonably well 
>>>>>>>>> designed,
>>>>>>>>> the most frequently failing components should be redundant and 
>>>>>>>>> hot-swap
>>>>>>>>> replacable (fans and PSUs).
>>>>>>>>> 
>>>>>>>>> It's acceptable that a head-node fails for a short period of time. 
>>>>>>>>> The entire
>>>>>>>>> system will not depend on all head nodes functioning simultaneously.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> The setup will probably have a number of "head nodes" which 
>>>>>>>>>>> receive a large
>>>>>>>>>>> amount of data over standard gigabit from a large amount of remote 
>>>>>>>>>>> sources.
>>>>>>>>>>> Data is read infrequently from the head nodes by remote systems. 
>>>>>>>>>>> The primary
>>>>>>>>>>> load on the system will be data writes.
>>>>>>>>>>> 
>>>>>>>>>> Ok, so you are write dominated.  Could you describe (guesses are 
>>>>>>>>>> fine)
>>>>>>>>>> what the writes will look like?  Large sequential data, small 
>>>>>>>>>> random
>>>>>>>>>> data (seek, write, close)?
>>>>>>>>>> 
>>>>>>>>> I would expect something like 100-1000 simultaneous streaming writes 
>>>>>>>>> to just as
>>>>>>>>> many files (one file per writer). The files will be everything from 
>>>>>>>>> a few
>>>>>>>>> hundred MiB to many GiB.
>>>>>>>>> 
>>>>>>>>> I guess that on most filesystems these streaming sequential writes 
>>>>>>>>> will result
>>>>>>>>> in something close to "random writes" to the block layer. However, 
>>>>>>>>> we can be
>>>>>>>>> very generous with write buffering.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> The head nodes need not see the same unified storage; so I am not 
>>>>>>>>>>> required to
>>>>>>>>>>> have one big shared filesystem. If beneficial, each of the head 
>>>>>>>>>>> nodes could
>>>>>>>>>>> have their own local storage.
>>>>>>>>>>> 
>>>>>>>>>> There are some interesting designs with a variety of systems, 
>>>>>>>>>> including
>>>>>>>>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>>>>>>>>> them.  These designs will add to the overall cost, and increase 
>>>>>>>>>> complexity.
>>>>>>>>>> 
>>>>>>>>> Simple is nice :)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> The storage pool will start out at around 100TiB and will grow to 
>>>>>>>>>>> ~1PiB within
>>>>>>>>>>> a year or two (too early to tell). It would be nice to use as few 
>>>>>>>>>>> racks as
>>>>>>>>>>> possible, and as little power as possible  :)
>>>>>>>>>>> 
>>>>>>>>>> Ok, so density and power are important.  This is good.  Coupled 
>>>>>>>>>> with the
>>>>>>>>>>  low management cost and low acquisition cost, we have about 3/4 of 
>>>>>>>>>> what
>>>>>>>>>> we need.  Just need a little more description of the writes.
>>>>>>>>>> 
>>>>>>>>> I hope the above helped.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Also, do you intend to back this up?
>>>>>>>>>> 
>>>>>>>>> That is a *very* good question.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> How important is resiliency of the
>>>>>>>>>> system?  Can you tolerate a failed unit (assume the units have hot
>>>>>>>>>> spares, RAID6, etc).
>>>>>>>>>> 
>>>>>>>>> Yes. Single head nodes may fail. They must be fairly quick to get 
>>>>>>>>> back on line
>>>>>>>>> (having a replacement box I would expect no more than an hour of 
>>>>>>>>> downtime).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> When you look at storage of this size, you have to
>>>>>>>>>> start planning for the eventual (and likely) failure of a chassis 
>>>>>>>>>> (or
>>>>>>>>>> some number of them), and think about with a RAIN configuration.
>>>>>>>>>> 
>>>>>>>>> Yep. I don't know how likely a "many-disk" failure would be... If I 
>>>>>>>>> have a full
>>>>>>>>> replacement chassis, I would guess that I could simply pull out all 
>>>>>>>>> the disks
>>>>>>>>> from a failed system, move them to the replacement chassis and be up 
>>>>>>>>> and
>>>>>>>>> running again in "short" time.
>>>>>>>>> 
>>>>>>>>> If a PSU decides to fry everything connected to it including the 
>>>>>>>>> disks, then
>>>>>>>>> yes, I can see the point in RAIN or a full backup.
>>>>>>>>> 
>>>>>>>>> It's a business decision if a full node loss would be acceptable. I 
>>>>>>>>> honestly
>>>>>>>>> don't know that, but it is definitely interesting to consider both 
>>>>>>>>> "yes" and
>>>>>>>>> "no".
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Either
>>>>>>>>>> that, or invest into massive low level redundancy (which should be 
>>>>>>>>>> scope
>>>>>>>>>> limited to the box it is on anyway).
>>>>>>>>>> 
>>>>>>>>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> It *might* be possible to offload older files to tape; does anyone 
>>>>>>>>>>> have
>>>>>>>>>>> experience with HSM on Linux?  Does it work?  Could it be 
>>>>>>>>>>> worthwhile to
>>>>>>>>>>> investigate?
>>>>>>>>>>> 
>>>>>>>>>> Hmmm...  First I would suggest avoiding tape, you should likely be
>>>>>>>>>> looking at disk to disk for backup, and use slower nearline 
>>>>>>>>>> mechanisms.
>>>>>>>>>> 
>>>>>>>>> Why would you avoid tape?
>>>>>>>>> 
>>>>>>>>> Let's say there was software which allowed me to offload data to 
>>>>>>>>> tape in a
>>>>>>>>> reasonable manner. Considering the running costs of disk versus 
>>>>>>>>> tape, tape
>>>>>>>>> would win hands down on power, cooling and replacements.
>>>>>>>>> 
>>>>>>>>> Sure, the random seek time of a tape library sucks golf balls 
>>>>>>>>> through a garden
>>>>>>>>> hose, but assuming that one could live with that, are there more 
>>>>>>>>> important
>>>>>>>>> reasons to avoid tape?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> One setup I was looking at, is simply using SunFire X4500 systems 
>>>>>>>>>>> (you can put
>>>>>>>>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can 
>>>>>>>>>>> buy them with
>>>>>>>>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) and 
>>>>>>>>>>> grow the
>>>>>>>>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>>>>>>>> 
>>>>>>>>>>> Any better ideas?  Is there a way to get this more dense without 
>>>>>>>>>>> paying an arm
>>>>>>>>>>> and a leg?  Has anyone tried something like this with HSM?
>>>>>>>>>>> 
>>>>>>>>>> Yes, but I don't want to turn this into a commercial, so I will be
>>>>>>>>>> succinct.  Scalable Informatics (my company) has a similar product,
>>>>>>>>>> which does have a good price and price per gigabyte, while 
>>>>>>>>>> providing
>>>>>>>>>> excellent performance.  Details (white paper, benchmarks, 
>>>>>>>>>> presentations)
>>>>>>>>>> at the http://jackrabbit.scalableinformatics.com web site.
>>>>>>>>>> 
>>>>>>>>> Yep, I was just looking at that actually.
>>>>>>>>> 
>>>>>>>>> The hardware looks similar in concept to the SunFire, but as I see 
>>>>>>>>> it you guys
>>>>>>>>> have thought about a number of services atop of that (RAIN etc.)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Very interesting!
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>>  / jakob
>>>>>>>>> 
>>> 
>> 
>



More information about the Beowulf mailing list