[Beowulf] Re: building a RAID system: A long-delayed follow-on

Thu Aug 19 16:21:32 PDT 2004

I just found this note from last year's discussion.  I've some 
follow-up.  If you're not interested, I'll understand.  Just hit 
<delete> and go on...

We implemented a 1.6 TB RAID-5 system using HighPoint Technology 
controllers and Maxtor 200 GB parallel IDE drives.  The performance 
wasn't what we expected, but some careful examination discovered that, 
just as chronicled below, the additional overhead, especially a complete 
2nd round of buffering, was really slowing performance.

OK, the next manufacturer up the proverbial foodchain was Promise.  Got 
the hardware, better, but far less than stellar performance.

Oh, and drivers were several kernel releases behind, and in some cases I 
considerd the kernel updates mandatory for security.

We started looking at 3Ware, but work got in the way of the fun stuff. 
Also, a collaborator (co-conspirator is more accurate) at another 
institution had been doing similar work and suggested we look at 
software RAID.  OK.  It's quick to configure, we need the box back up, 
and it can't run any worse that the HighPoint stuff.

Well, I'm still thinking I'd like to go with the 3Ware hardware, but 
that'll have to wait 'til we build the next 2 TB system... soon, real 
soon.  And if it's slower than s/w RAID, I'll go back to that.

Since we went to the s/w RAID-5 config we've seen 1 failure caused by 
stupid sysadmin tricks and an inadequate UPS when the campus went down. 
  To confess completely, when RAID didn't come back up cleanly I 
attributed it to a missing entry in /etc/rc.d/rc.local... and 
technically, I was right.  I did a raidstart and mounted the drive, 
without a cursory fsck.  My bad.  We got a "clean" mount, and went 
merrily ahead.  To add to the confusion, I was doing all this from my 
laptop, at 70+mph (my wife was driving for most of this) using a Sprint 
1xRTT connection, once we got into Minnesota.  Iowa doesn't have Sprint 
coverage we could find, save for a 2-block stretch of Ames.

About 3 days after "recovering" we started seeing a bunch of disk 
errors.  By now, I was in _rural_ Wisconsin.  We didn't have cellphone 
coverage of any sort at the inlaws, and on a good day, we got 26k 
dialup... throttling down to 9600 sometimes.  I opted to drive into town 
and suck down coffee where I could get a 1xRTT connection... marginally 
acceptable.  I took the array offline and started an 'fsck -a' which 
would run for hours with little to look at to indicate the system was 
even still responding... and then roll over for "too many errors" and a 
message to run without the '-a' option.  'fsck -y' was little better. 
We fought this for the rest of the vacation, whenever I had 
connectivity, and I never got the disk happy.

Came home, immediately flew to DC and wrote a perl script on the plane 
to tell fsck in manual mode "yes, dammit" to all the 'do ya wanna fix 
this?' questions.  Got into DC at 8pm, started the script, went to 
dinner.  Came back script was still running and the screen was full of 
the Q&A.  Went to bed.  Got up, same thing.  Went to the first day of 
meetings, and returned at 9pm.  Still running.  Another day of meetings, 
and back to the room.  Still running but it completed while I was 
changing clothes before going to dinner.

Overall, FSCK on a 1.7 TB machine appears to take about 96+/- hours to 
run when you've really abused it.

I restarted the box, restarted RAID, remounted, manually started the LDM 
data collection system, and got on an airplane.  By the time I was back 
in Texas, all the missing data from the 2-day odessy was replaced and 
the system was back up to speed.

We're using this system to cache 30 days of all the Level II radar on 
it.  I'll be doing some radar processing on a little 16-node dual 
opteron cluster (ob:cluster) to see about running some of the newer 
processing codes to better render the data.  We'll also be extracting 
some of the data to initialize the MM5 and WRF models, once I figure out 
how to handle that.

We'll still try 3Ware.  I've got indications it's pretty good, from 
another guy.  However, kudos to the kernel and RAID developers in 
Linux-land.  They done good.

Gerry

pesch at attglobal.net wrote:
> You write:
> 
> "The problem with offloading is, that while it made great sense in the
> days of 1 MHz CPUs, it really doesn't make a noticable difference in the
> load on your typical N GHz processor."
> 
> Did you have a maximum data storage size in mind? - or to put it differently: at what data size do you see the
> practical limit of SW RAID?
> 
> Paul
> 
> Jakob Oestergaard wrote:
> 
> 
>>On Thu, Oct 09, 2003 at 08:50:17PM +0200, Daniel Fernandez wrote:
>>
>>>Hi again,
>>
>>...
>>
>>Others have already answered your other questions, I'll try to take one
>>that went unanswered (as far as I can see).
>>
>>...
>>
>>>But must be noted that HW RAID offers better response time.
>>
>>In a HW RAID setup you *add* an extra layer: the dedicated CPU on the
>>RAID card.  Remember, this CPU also runs software - calling it
>>'hardware RAID' in itself is misleading, it could just as well be called
>>'offloaded SW RAID'.
>>
>>The problem with offloading is, that while it made great sense in the
>>days of 1 MHz CPUs, it really doesn't make a noticable difference in the
>>load on your typical N GHz processor.
>>
>>However, you added a layer with your offloaded-RAID. You added one extra
>>CPU in the 'chain of command' - and an inferior CPU at that. That layer
>>means latency even in the most expensive cards you can imagine (and
>>bottleneck in cheap cards).  No matter how you look at it, as long as
>>the RAID code in the kernel is fairly simple and efficient (which it
>>was, last I looked), then the extra layers needed to run the PCI
>>commands thru the CPU and then to the actual IDE/SCSI controller *will*
>>incur latency.  And unless you pick a good controller, it may even be
>>your bottleneck.
>>
>>Honestly I don't know how much latency is added - it's been years since
>>I toyed with offload-RAID last  ;)
>>
>>I don't mean to be handwaving and spreading FUD - I'm just trying to say
>>that the people who advocate SW RAID here are not necessarily smoking
>>crack - there are very good reasons why SW RAID will outperform HW RAID
>>in many scenarios.
>>
>>
>>>HW raid offers hotswap capability and offload our work instead of
>>>maintaining a SW raid solution ...we'll see ;)
>>
>>That, is probably the best reason I know of for choosing hardware RAID.
>>And depending on who you will have administering your system, it can be
>>a very important difference.
>>
>>There are certainly scenarios where you will be willing to trade a lot
>>of performance for a blinking LED marking the failed disk - I am not
>>kidding.
>>
>>Cheers,
>>
>>--
>>................................................................
>>:   jakob at unthought.net   : And I see the elder races,         :
>>:.........................: putrid forms of man                :
>>:   Jakob Østergaard      : See him rise and claim the earth,  :
>>:        OZ9ABN           : his downfall is at hand.           :
>>:.........................:............{Konkhra}...............:
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020
FAX:  979.847.8578 Pager:  979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843