[Beowulf] Re: building a RAID system: A long-delayed follow-on
Gerry Creager n5jxs
gerry.creager at tamu.edu
Thu Aug 19 16:21:32 PDT 2004
I just found this note from last year's discussion. I've some
follow-up. If you're not interested, I'll understand. Just hit
<delete> and go on...
We implemented a 1.6 TB RAID-5 system using HighPoint Technology
controllers and Maxtor 200 GB parallel IDE drives. The performance
wasn't what we expected, but some careful examination discovered that,
just as chronicled below, the additional overhead, especially a complete
2nd round of buffering, was really slowing performance.
OK, the next manufacturer up the proverbial foodchain was Promise. Got
the hardware, better, but far less than stellar performance.
Oh, and drivers were several kernel releases behind, and in some cases I
considerd the kernel updates mandatory for security.
We started looking at 3Ware, but work got in the way of the fun stuff.
Also, a collaborator (co-conspirator is more accurate) at another
institution had been doing similar work and suggested we look at
software RAID. OK. It's quick to configure, we need the box back up,
and it can't run any worse that the HighPoint stuff.
Well, I'm still thinking I'd like to go with the 3Ware hardware, but
that'll have to wait 'til we build the next 2 TB system... soon, real
soon. And if it's slower than s/w RAID, I'll go back to that.
Since we went to the s/w RAID-5 config we've seen 1 failure caused by
stupid sysadmin tricks and an inadequate UPS when the campus went down.
To confess completely, when RAID didn't come back up cleanly I
attributed it to a missing entry in /etc/rc.d/rc.local... and
technically, I was right. I did a raidstart and mounted the drive,
without a cursory fsck. My bad. We got a "clean" mount, and went
merrily ahead. To add to the confusion, I was doing all this from my
laptop, at 70+mph (my wife was driving for most of this) using a Sprint
1xRTT connection, once we got into Minnesota. Iowa doesn't have Sprint
coverage we could find, save for a 2-block stretch of Ames.
About 3 days after "recovering" we started seeing a bunch of disk
errors. By now, I was in _rural_ Wisconsin. We didn't have cellphone
coverage of any sort at the inlaws, and on a good day, we got 26k
dialup... throttling down to 9600 sometimes. I opted to drive into town
and suck down coffee where I could get a 1xRTT connection... marginally
acceptable. I took the array offline and started an 'fsck -a' which
would run for hours with little to look at to indicate the system was
even still responding... and then roll over for "too many errors" and a
message to run without the '-a' option. 'fsck -y' was little better.
We fought this for the rest of the vacation, whenever I had
connectivity, and I never got the disk happy.
Came home, immediately flew to DC and wrote a perl script on the plane
to tell fsck in manual mode "yes, dammit" to all the 'do ya wanna fix
this?' questions. Got into DC at 8pm, started the script, went to
dinner. Came back script was still running and the screen was full of
the Q&A. Went to bed. Got up, same thing. Went to the first day of
meetings, and returned at 9pm. Still running. Another day of meetings,
and back to the room. Still running but it completed while I was
changing clothes before going to dinner.
Overall, FSCK on a 1.7 TB machine appears to take about 96+/- hours to
run when you've really abused it.
I restarted the box, restarted RAID, remounted, manually started the LDM
data collection system, and got on an airplane. By the time I was back
in Texas, all the missing data from the 2-day odessy was replaced and
the system was back up to speed.
We're using this system to cache 30 days of all the Level II radar on
it. I'll be doing some radar processing on a little 16-node dual
opteron cluster (ob:cluster) to see about running some of the newer
processing codes to better render the data. We'll also be extracting
some of the data to initialize the MM5 and WRF models, once I figure out
how to handle that.
We'll still try 3Ware. I've got indications it's pretty good, from
another guy. However, kudos to the kernel and RAID developers in
Linux-land. They done good.
Gerry
pesch at attglobal.net wrote:
> You write:
>
> "The problem with offloading is, that while it made great sense in the
> days of 1 MHz CPUs, it really doesn't make a noticable difference in the
> load on your typical N GHz processor."
>
> Did you have a maximum data storage size in mind? - or to put it differently: at what data size do you see the
> practical limit of SW RAID?
>
> Paul
>
> Jakob Oestergaard wrote:
>
>
>>On Thu, Oct 09, 2003 at 08:50:17PM +0200, Daniel Fernandez wrote:
>>
>>>Hi again,
>>
>>...
>>
>>Others have already answered your other questions, I'll try to take one
>>that went unanswered (as far as I can see).
>>
>>...
>>
>>>But must be noted that HW RAID offers better response time.
>>
>>In a HW RAID setup you *add* an extra layer: the dedicated CPU on the
>>RAID card. Remember, this CPU also runs software - calling it
>>'hardware RAID' in itself is misleading, it could just as well be called
>>'offloaded SW RAID'.
>>
>>The problem with offloading is, that while it made great sense in the
>>days of 1 MHz CPUs, it really doesn't make a noticable difference in the
>>load on your typical N GHz processor.
>>
>>However, you added a layer with your offloaded-RAID. You added one extra
>>CPU in the 'chain of command' - and an inferior CPU at that. That layer
>>means latency even in the most expensive cards you can imagine (and
>>bottleneck in cheap cards). No matter how you look at it, as long as
>>the RAID code in the kernel is fairly simple and efficient (which it
>>was, last I looked), then the extra layers needed to run the PCI
>>commands thru the CPU and then to the actual IDE/SCSI controller *will*
>>incur latency. And unless you pick a good controller, it may even be
>>your bottleneck.
>>
>>Honestly I don't know how much latency is added - it's been years since
>>I toyed with offload-RAID last ;)
>>
>>I don't mean to be handwaving and spreading FUD - I'm just trying to say
>>that the people who advocate SW RAID here are not necessarily smoking
>>crack - there are very good reasons why SW RAID will outperform HW RAID
>>in many scenarios.
>>
>>
>>>HW raid offers hotswap capability and offload our work instead of
>>>maintaining a SW raid solution ...we'll see ;)
>>
>>That, is probably the best reason I know of for choosing hardware RAID.
>>And depending on who you will have administering your system, it can be
>>a very important difference.
>>
>>There are certainly scenarios where you will be willing to trade a lot
>>of performance for a blinking LED marking the failed disk - I am not
>>kidding.
>>
>>Cheers,
>>
>>--
>>................................................................
>>: jakob at unthought.net : And I see the elder races, :
>>:.........................: putrid forms of man :
>>: Jakob Østergaard : See him rise and claim the earth, :
>>: OZ9ABN : his downfall is at hand. :
>>:.........................:............{Konkhra}...............:
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020
FAX: 979.847.8578 Pager: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843
More information about the Beowulf
mailing list