[Beowulf] Big storage

Bruce Allen ballen at gravity.phys.uwm.edu
Wed Apr 16 09:08:00 PDT 2008


What was needed to fix the systems?  Reboot?  Hardware replacement?

On Wed, 16 Apr 2008, Gerry Creager wrote:

> We've had two fail rather randomly.  The failures did cause disk corruption 
> but it wasn't an undetected/undetectable sort.  They started throwing errors 
> to syslog, then fell over and stopped accessing disks.
>
> gerry
>
> Bruce Allen wrote:
>> Hi Gerry,
>> 
>> So far the only problem we have had is with one Areca card that had a bad 
>> 2GB memory module.  This generated lots of (correctable) single bit errors 
>> but eventually caused real problems.  Could you say something about the 
>> reliability issues you have seen?
>> 
>> Cheers,
>>     Bruce
>> 
>> 
>> On Wed, 16 Apr 2008, Gerry Creager wrote:
>> 
>>> We've used AoE (CoRAID hardware) with pretty good success (modulo one RAID 
>>> shelf fire that was caused by a manufacturing defect and dealt with 
>>> promptly by CoRAID).  We've had some reliability issues with Areca cards 
>>> but no data corruption on the systems we've built that way.
>>> 
>>> gerry
>>> 
>>> Bruce Allen wrote:
>>>> Hi Xavier,
>>>> 
>>>>>>>> PPS: We've also been doing some experiments with putting 
>>>>>>>> OpenSolaris+ZFS on some of our generic (Supermicro + Areca) 16-disk 
>>>>>>>> RAID systems, which were originally intended to run Linux.
>>>>
>>>>>>>  I think that DESY proved some data corruption with such 
>>>>>>> configuration, so they switched to OpenSolaris+ZFS.
>>>> 
>>>>>> I'm confused.  I am also talking about OpenSolaris+ZFS.  What did DESY 
>>>>>> try, and what did they switch to?
>>>> 
>>>>> Sorry, I am indeed not clear. As far as I know, DESY found data 
>>>>> corruption using Linux and Areca cards. They moved from linux to 
>>>>> OpenSolaris and ZFS, avoiding other corruption. This has been discussed 
>>>>> in HEPiX storage workgroup. However, I can not speak on their behalf at 
>>>>> all. I'll try to get you in touch with someone more aware of this issue, 
>>>>> as my statements lack of figures.
>>>> 
>>>> I think that would be very interesting to the entire Beowulf mailing 
>>>> list, so please suggest that they respond to the entire group, not just 
>>>> to me personally.  Here is an LKML thread about silent data corruption:
>>>> http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
>>>> 
>>>> So far we have not seen any signs of data corruption on Linux+Areca 
>>>> systems (and our data files carry both internal and external checksums, 
>>>> so we would be sensitive to this).
>>>> 
>>>> Cheers,
>>>>     Bruce
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit 
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> 
>>> 
>
>



More information about the Beowulf mailing list