[Beowulf] Software Raid

Joe Landman landman at scalableinformatics.com
Tue Dec 13 20:12:46 PST 2005

Michael T. Prinkey wrote:
> Honestly, I am wondering if the Software/Hardware RAID argument has
> devolved to the state of SCSI versus ATA or (heaven forfend) emacs versus
> vi.  8)

Possibly, though I am getting the sense that one of the participants is 
missing some of the points being made, possibly due to a language issue.

> My experiences with hardware raid have been consistently lack luster over
> 7 years and several generations of hardware.  My experiences with software
> RAID servers (specifically built for the task) have been largely positive.  
> When I read comments extolling the virtues of hardware RAID solutions, I
> find myself constantly wondering if I could be missing something after
> some many years and many dozens of deployed units.

Hardware raid has a specific domain of utility, as does software raid. 
There is some overlap.  I am still not sure if I can use software RAID1 
to do automatic (e.g. no fingers touching the keyboard) rebuilds on a 
failed mirrored boot/root drive.  I would like to.

Under very heavily loaded (heavy memory usage, heavy CPU usage) 
situations, the possibility for a deadlock in the raid to file system 
path or in contention for buffer manipulation versus memory space may 
exist.  I say may as I have not seen this under heavy load, but it is a 
possible "corner case".  I do see regular old file system access, say 
with ext3, losing performance rather badly due to its journaling issues. 
   I haven't looked at ext3 code recently, so I cannot comment on 
whether or not the journaling code is a point of serialization.  Under 
heavy load, you are far more likely to be running into ext3 limits than 
the SW raid system.  Of course you could use a better file system, in 
which case you are more likely to stress the SW raid.

We use xfs on RAID0 for local disk performance on compute nodes.  Doing 
a little tuning, and we are hitting a fairly nice (sustained) 140 MB/s 
across two SCSI disks on large block reads for some applications that 
need this (per node), using SW raid.  We are hitting about 120 MB/s on 
SATA drives for a similar IO system.

> To provide numbers, I am really only concerned if the raid array can
> saturate the gigabit line feeding it.  On-server performance is pretty
> useless as no work is ever done directly on the RAID servers.  For reading

We have customers running atop/iftop all the time.  Its nice to see your 
NFS server pushing 300+ MB/s through the switches to the clients beating 
on it.  The problem we are running into are interrupts on the cards. 
Most of the kernels are compiled with NAPI off, so there is no way 
without rebuilding kernels to tell if this will mitigate a real live 
interrupt tsunami from heavy NFS IO.  For a number of reasons, we would 
like to avoid rebuilding kernels (ask RGB why it is a bad idea).

> data, the server could certainly saturate gigabit...Bonnie on the NFS
> mount gave roughly 85 MB/sec for software RAID5.  When we deployed an
> 8-drive RAID5 array using hardware RAID on the SATA 3ware card,
> performance was on the order of 15 MB/sec.  We had initially deployed
> these raid servers using the hardware RAID5 setup, but we had several user
> complaints about poor storage performance.  So we retooled with software
> RAID and the rest is history.

Use what works.  We use both.  HW raid where we must have 
no-fingers-on-keyboard hotswap.  SW raid for other things (local drive 
performance).  Note though that 3ware and others don't generally perform 
well out of the box without some tweaking.  After a little tuning 
(blockdev and other bits), they can scream.  If your server is 
overloaded with interrupts from lousy network cards (grr), you probably 
dont want to add more context switching sources (SW RAID).

Its a design choice.  Both are good, both have domains of applicability. 
  Anyone suggesting otherwise might not be talking about the same thing 
we are discussing here.  File system bugs are nasty, and no block device 
is going to save you from them, neither software nor hardware block device.

> Clearly, YMMV.
> Mike
> On Tue, 13 Dec 2005, Joe Landman wrote:
>>Vincent Diepeveen wrote:
>>>>The remaining advantage of hardware is still hot-swapping
>>>>failed drives without having to shutdown the server.
>>>Those same nerds of above, they do not take into account that if 
>>>something complex like a raid array gets suddenly handled in 
>>>software instead of hardware, that even the tiniest 
>>>undiscovered bug in a file system, will impact you.
>><scratch />
>><scratch />
>><huh? />
>>As the raid device is being created at the block device level, and the 
>>file system resides above this, a file system bug will be just as 
>>detrimental to a hardware raid system as it would a software raid system.
>>Of course, you could have meant a bug in the software raid block device 
>>driver.  Yes such things do exist.  So do bugs in the hardware raid 
>>controllers.  In *neither* case do you want to touch the buggy code. 
>>Best case is completely innocuous behavior.  Worst case, well, lets not 
>>get into that.
>>Bugs can and do occur in any software.  Whether burned into firmware, 
>>written as VHDL/Verilog that creates the ASICs or FPGAs on the hardware 
>>raid, or in the software raid block device.
>>>And be sure that there is bugs. So doing a hardware XOR (or whatever) in
>>>RAM of the raid controller instead of in the software, is a huge advantage.
>>RAID is *far more* than doing hardware XOR.  Most XOR implementations 
>>tend to be bug free given how atomic this operation is.  The code around 
>>it however occasionally has bugs.  Firmware and software code.
>>>It reduces complexity of what software has to do, so it reduces the
>>>chance that a bug will occur in the OS somewhere, causing you to lose
>>>all your files.
>>No.  Absolutely not.  Software raid simply does in software what the 
>>hardware may do in part in hardware.  Any bug, anywhere in this process 
>>(in either HW or SW raid) and you can have problems.  Problems and bugs 
>>are not just the provenance of SW raid.

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615

More information about the Beowulf mailing list