[Beowulf] Big storage
Bruce Allen
ballen at gravity.phys.uwm.edu
Fri Aug 24 14:03:40 PDT 2007
Hi Jeff,
OK, I see the point. You are not worried about multiple unreadable
sectors making it impossible to reconstruct lost data. You are worried
about 'whole disk' failure.
I definitely agree that this is a possible problem. In fact we operate
all of our UWM data archives (about 300 TB) as RAID-6 to reduce the
probability of this. The idea of a second disk failing in a RAID-5 array
during rebuild does not make for a good night's sleep!
Cheers,
Bruce
On Fri, 24 Aug 2007, Jeffrey B. Layton wrote:
> Bruce,
>
> I urge you to read Garth's comments. Your description of what
> RAID controllers do is very good when there are no failed drives.
> If a drive fails though, you can't scan the disks looking for bad
> sectors.
>
> During a reconstruction, the RAID controller is reconstructing
> the data based on the remaining drives and the parity.
> Unfortunately, the controller is likely to be block based so it has
> to rebuild every block of the failed disk. But if the controller is
> doing a reconstruction and hits a URE, then the reconstruction
> process just stops and the controller cries uncle. This means you
> have to restore the failed array from a backup. This means the
> entire volume.
>
> With drives getting larger and larger all the time, the window of
> vulnerability during reconstruction (where a second drive failure
> will fail the entire volume) has grown because it takes longer and
> longer to reconstruct so much data. This is why people are moving
> to RAID-6. But RAID-6 is expensive in terms of capacity and performance
> (Note: it has worse write performance than RAID-5). It gives the
> ability to tolerate a second drive failure, but it may not reduce the
> window of vulnerability during reconstruction because it takes longer
> to reconstruct.
>
> Here's an article where Garth talks about this (it's at the end):
>
> http://www.eweek.com/article2/0,1895,2168821,00.asp
>
> I wanted to note one quick thing from the article:
>
> "The probability of the disk failing to read back data is the same as
> it was long ago, so today you can expect at least one failed read every
> 10TB to 100TB. But the reconstruction of a failed 500GB disk in an
> 11-disk array has to read 5TB, so there can be an unacceptably large
> chance of failure to rebuild every one of the 1 billion sectors on the
> failed disk."
>
> So if a reconstruction fails, you have to copy 5TB of data from the
> backup to the volume. If you do this from tape - you're going to wait
> a long time. You can do it from a disk backup but it still may take
> some time to move 5TB across the wire depending upon how you
> everything connected.
>
> Jeff
>
>
>> Hi Jeff,
>>
>> For this reason, in a RAID system with a lot of disks it is important to
>> scan the disks looking for unreadable (UNC = uncorrectable) data blocks on
>> a regular basis. If these are found, then the missing data at that Logical
>> Block Address (LBA) has to be reconstructed from the *other* disks and
>> re-written onto the failed disk.
>>
>> In a well-designed (hardware or software) RAID implementation, you can
>> reconstruct the missing data by only reading a handful of logical blocks
>> from the redundant disks. It is not necessary to read the entire disk
>> surface just to get a few 512 byte sectors of data. So a failure for
>> different data somewhere else on a disk should not (in principle) prevent
>> reconstruction of the lost/missing data. In a poorly-designed RAID
>> implementation, you have to read the ENTIRE disk surface to get data from a
>> few sectors. In this case, another uncorrectable disk sector can be
>> crippling.
>>
>> Most good hardware RAID cards have an option for continous disk scanning.
>> For example ARECA called this 'consistency checking'. It should be done on
>> a regular basis.
>>
>> You can use smartmontools to do this also, by cayring out regular read
>> scans of the disk surface and then forcing a RAID consistency check/rebuild
>> if there is a read failure at some disk block.
>>
>> Note that continous scanning is also needed for ECC memory to prevent
>> correctable single-bit errors from becomming uncorrectable double-bit
>> errors. In this RAM/memory context it is called 'memory scrubbing'
>>
>> Cheers,
>> Bruce
>>
>> On Thu, 23 Aug 2007, Jeffrey B. Layton wrote:
>>
>>> This isn't really directed at Jeff, but it seemed like a good segway
>>> for a comment. Everyone - please read some recent article by
>>> Garth Gibson about large capacity disks and large number of
>>> disks in a RAID group. Just to cut to the chase, given the
>>> Unrecoverable Read Error (URE) rate and large disks, during
>>> a rebuild you are almost guaranteed to hit a URE. When that
>>> happens, the rebuild stops and you have to restore everything
>>> from a backup. RAID-6 can help, but given enough disks and
>>> large enough disks, the same thing can happen (plus RAID-6
>>> rebuilds take longer since there are more computations involved).
>>>
>>> Jeff
>>>
>>> P.S. I guess I should disclose that my day job is at Panasas. But
>>> regardless, I would recommend reading some of Garth's comments.
>>> Maybe I can also get one of his presentations to pass around.
>>>
>>> P.P.S. If you don't know Garth, he's one of the fathers of RAID.
>>>
>>>> Hello Jakob,
>>>> A couple of things...
>>>> 1. ClusterFS has an easy to understand calculation on why raid 6 is
>>>> necessary for the amount of disks you're considering. You do need to
>>>> plan for multi-disk failure, especially with the rebuild time of 1TB
>>>> disks.
>>>> http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-10-1.html#wp1037512
>>>>
>>>> 2. Avoid tape if you can. At this scale, the administrative time and
>>>> costs far outweigh the benefits. Of course if you need to move your
>>>> data to a secure vault that's another thing. If you really want to do
>>>> tape, some people choose to do disk > disk > tape. This eliminates the
>>>> read interrupts on the primary storage and provides some added
>>>> redundancy.
>>>>
>>>> 3. We do use Nexsan's satabeasts for storage similar to this. Without
>>>> commenting on costs, the jackrabbit is technologically superior.
>>>>
>>>> Thanks,
>>>> jeff
>>>>
>>>> On 8/23/07, Jakob Oestergaard <jakob at unthought.net> wrote:
>>>>
>>>>> On Thu, Aug 23, 2007 at 07:56:15AM -0400, Joe Landman wrote:
>>>>>
>>>>>> Greetings Jakob:
>>>>>>
>>>>>>
>>>>> Hi Joe,
>>>>>
>>>>> Thanks for answering!
>>>>>
>>>>> ...
>>>>>
>>>>>> up front disclaimer: we design/build/market/support such things.
>>>>>>
>>>>> That does not disqualify you :)
>>>>>
>>>>>
>>>>>>> I'm looking at getting some big storage. Of all the parameters,
>>>>>>> getting as low
>>>>>>> dollars/(month*GB) is by far the most important. The price of
>>>>>>> acquiring and
>>>>>>> maintaining the storage solution is the number one concern.
>>>>>>>
>>>>>> Should I presume density, reliability, and performance also factor in
>>>>>> somewhere as 2,3,4 (somehow) on the concern list?
>>>>>>
>>>>> I expect that the major components of the total cost of running this
>>>>> beast will
>>>>> be something like
>>>>>
>>>>> acquisition
>>>>> + power
>>>>> + cooling
>>>>> + payroll (disk-replacing admins :)
>>>>>
>>>>> Real-estate is a concern as well, of course. The rent isn't free. It
>>>>> would be
>>>>> nice to pack this in as few racks as possible. Reliability, well... I
>>>>> expect
>>>>> frequent drive failures, and I would expect that we'd run some form of
>>>>> RAID to
>>>>> mitigate this. If the rest of the hardware is just reasonably well
>>>>> designed,
>>>>> the most frequently failing components should be redundant and hot-swap
>>>>> replacable (fans and PSUs).
>>>>>
>>>>> It's acceptable that a head-node fails for a short period of time. The
>>>>> entire
>>>>> system will not depend on all head nodes functioning simultaneously.
>>>>>
>>>>>
>>>>>>> The setup will probably have a number of "head nodes" which receive a
>>>>>>> large
>>>>>>> amount of data over standard gigabit from a large amount of remote
>>>>>>> sources.
>>>>>>> Data is read infrequently from the head nodes by remote systems. The
>>>>>>> primary
>>>>>>> load on the system will be data writes.
>>>>>>>
>>>>>> Ok, so you are write dominated. Could you describe (guesses are fine)
>>>>>> what the writes will look like? Large sequential data, small random
>>>>>> data (seek, write, close)?
>>>>>>
>>>>> I would expect something like 100-1000 simultaneous streaming writes to
>>>>> just as
>>>>> many files (one file per writer). The files will be everything from a
>>>>> few
>>>>> hundred MiB to many GiB.
>>>>>
>>>>> I guess that on most filesystems these streaming sequential writes will
>>>>> result
>>>>> in something close to "random writes" to the block layer. However, we
>>>>> can be
>>>>> very generous with write buffering.
>>>>>
>>>>>
>>>>>>> The head nodes need not see the same unified storage; so I am not
>>>>>>> required to
>>>>>>> have one big shared filesystem. If beneficial, each of the head nodes
>>>>>>> could
>>>>>>> have their own local storage.
>>>>>>>
>>>>>> There are some interesting designs with a variety of systems, including
>>>>>> GFS/Lustre/... on those head nodes, and a big pool of drives behind
>>>>>> them. These designs will add to the overall cost, and increase
>>>>>> complexity.
>>>>>>
>>>>> Simple is nice :)
>>>>>
>>>>>
>>>>>>> The storage pool will start out at around 100TiB and will grow to
>>>>>>> ~1PiB within
>>>>>>> a year or two (too early to tell). It would be nice to use as few
>>>>>>> racks as
>>>>>>> possible, and as little power as possible :)
>>>>>>>
>>>>>> Ok, so density and power are important. This is good. Coupled with
>>>>>> the
>>>>>> low management cost and low acquisition cost, we have about 3/4 of
>>>>>> what
>>>>>> we need. Just need a little more description of the writes.
>>>>>>
>>>>> I hope the above helped.
>>>>>
>>>>>
>>>>>> Also, do you intend to back this up?
>>>>>>
>>>>> That is a *very* good question.
>>>>>
>>>>>
>>>>>> How important is resiliency of the
>>>>>> system? Can you tolerate a failed unit (assume the units have hot
>>>>>> spares, RAID6, etc).
>>>>>>
>>>>> Yes. Single head nodes may fail. They must be fairly quick to get back
>>>>> on line
>>>>> (having a replacement box I would expect no more than an hour of
>>>>> downtime).
>>>>>
>>>>>
>>>>>> When you look at storage of this size, you have to
>>>>>> start planning for the eventual (and likely) failure of a chassis (or
>>>>>> some number of them), and think about with a RAIN configuration.
>>>>>>
>>>>> Yep. I don't know how likely a "many-disk" failure would be... If I have
>>>>> a full
>>>>> replacement chassis, I would guess that I could simply pull out all the
>>>>> disks
>>>>> from a failed system, move them to the replacement chassis and be up and
>>>>> running again in "short" time.
>>>>>
>>>>> If a PSU decides to fry everything connected to it including the disks,
>>>>> then
>>>>> yes, I can see the point in RAIN or a full backup.
>>>>>
>>>>> It's a business decision if a full node loss would be acceptable. I
>>>>> honestly
>>>>> don't know that, but it is definitely interesting to consider both "yes"
>>>>> and
>>>>> "no".
>>>>>
>>>>>
>>>>>> Either
>>>>>> that, or invest into massive low level redundancy (which should be
>>>>>> scope
>>>>>> limited to the box it is on anyway).
>>>>>>
>>>>> Yes; I had something like RAID-5 or so in mind on the nodes.
>>>>>
>>>>>
>>>>>>> It *might* be possible to offload older files to tape; does anyone
>>>>>>> have
>>>>>>> experience with HSM on Linux? Does it work? Could it be worthwhile
>>>>>>> to
>>>>>>> investigate?
>>>>>>>
>>>>>> Hmmm... First I would suggest avoiding tape, you should likely be
>>>>>> looking at disk to disk for backup, and use slower nearline mechanisms.
>>>>>>
>>>>> Why would you avoid tape?
>>>>>
>>>>> Let's say there was software which allowed me to offload data to tape in
>>>>> a
>>>>> reasonable manner. Considering the running costs of disk versus tape,
>>>>> tape
>>>>> would win hands down on power, cooling and replacements.
>>>>>
>>>>> Sure, the random seek time of a tape library sucks golf balls through a
>>>>> garden
>>>>> hose, but assuming that one could live with that, are there more
>>>>> important
>>>>> reasons to avoid tape?
>>>>>
>>>>>
>>>>>>> One setup I was looking at, is simply using SunFire X4500 systems (you
>>>>>>> can put
>>>>>>> 48 standard 3.5" SATA drives in each 4U system). Assuming I can buy
>>>>>>> them with
>>>>>>> 1T SATA drives shortly, I could start out with 3 systems (12U) and
>>>>>>> grow the
>>>>>>> entire setup to 1P with 22 systems in little over two full racks.
>>>>>>>
>>>>>>> Any better ideas? Is there a way to get this more dense without
>>>>>>> paying an arm
>>>>>>> and a leg? Has anyone tried something like this with HSM?
>>>>>>>
>>>>>> Yes, but I don't want to turn this into a commercial, so I will be
>>>>>> succinct. Scalable Informatics (my company) has a similar product,
>>>>>> which does have a good price and price per gigabyte, while providing
>>>>>> excellent performance. Details (white paper, benchmarks,
>>>>>> presentations)
>>>>>> at the http://jackrabbit.scalableinformatics.com web site.
>>>>>>
>>>>> Yep, I was just looking at that actually.
>>>>>
>>>>> The hardware looks similar in concept to the SunFire, but as I see it
>>>>> you guys
>>>>> have thought about a number of services atop of that (RAIN etc.)
>>>>>
>>>>>
>>>>> Very interesting!
>>>>>
>>>>> --
>>>>>
>>>>> / jakob
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>
>
More information about the Beowulf
mailing list