[Beowulf] Software Raid

Jakob Oestergaard jakob at unthought.net
Tue Dec 13 13:40:20 PST 2005

On Mon, Dec 12, 2005 at 04:26:44PM -0800, Paul wrote:
> I read in a post somewhere that it was not possible to use a Linux 
> software RAID configuration for shared file storage in a cluster. I know 
> that it is possible to use software RAID on individual compute nodes but 
> the post stated that software RAID would not support properly support 
> simultaneous accesses on a file server. Is this true?

Depends on what you mean.

Software RAID is just a block-level driver like LVM or like your
partitioning support or like parts of whatever filesystem you're using.

So in that sense, of course it will support concurrent access in any
setup imaginable.  Anything else would be completely bogus. How would SW
RAID detect that simulatenous accesses came from a cluster in order to
screw them up?  No that doesn't make sense...

However, a common misunderstanding between people who forget to think
properly about it (hey we all forget to think at times, that's fair
enough), has been that you could mount *multiple* instances of
filesystems on *multiple* nodes, using the *same* lower level block
devices, by means of Software RAID and Network block devices. Now that,
of course, won't work.

The setup that *will* of course work, no matter if it's SW RAID, HW RAID
or no RAID at all is:


Node 1:
 disk-A ---\
            >- SW/HW/no-RAID --- FS --- NFS ---> network shared storage
 disk-B ---/

Nodes 2-X:
  -----> NFS mount -> storage.


The setup that can't possibly work is:


Node 1:
 disk-A --- NDB export ->  shared_storage_A
                            >-- SW RAID 1 --- FS
 shared_storage_B import --/

Node 2:
 shared_storage_B import --\
                            >-- SW RAID 1 -- FS
 disk-B --- NBD export ->  shared_storage_B


Now, naïvely, a write on Node1 to the FS would result in a write to
disk-A and a proper mirrored write to disk-B.  And a write on Node2
would result in a write to disk-B and a proper mirrored write to the
remote disk-A.

So, you have a mirrored filesystem on two nodes then?  No!  Each node
has their own internal caches and filesystem structures in memory that
are NOT COHERENT OVER THE NETWORK.  And just because the block devices
are mirrored it does NOT mean that the LIVE filesystems are in any way

Changing blocks underneath the FS mounted on Node 1 by writing on Node 2
will absolutely positively definitely lead to death and destruction of
the data on every single disk involved in that experiment.

A clustered FS is a complex beast and it requires all sorts of fancy
locking and ordering schemes. Something that cannot in any way what so
ever by any means imaginable be implemented at the block-layer alone,
which is where SW RAID lives.

So, in short, SW RAID can of course be used in clusters just fine. But
it is only RAID, it cannot magically implement cluster-wide cache
coherency, locking, ordering, multiversion concurrency,
commit/rollback/abort, deadlock detection and everything else that would
be needed to transform a non-cluster-aware FS like ext3 or XFS into a
cluster filesystem.  And nothing can, not working at the block layer.
It's a problem you can't solve at the block layer alone. So a trillion
dollars of hardware wouldn't solve the problem either.

> Assuming that hardware RAID is required (or at least preferable)

Memory backed write cache on HW RAID controllers can often provide a
significant speedup on FS operations.

But again, it is not a correctness issue. SW RAID can do what HW RAID
can do, correctness wise.  HW RAID can do some things faster (and bad HW
RAID controllers can do a lot of things slower) than SW RAID, but both
will give you correct processing of all data if they are used and
installed as intended.

> I was 
> wondering if the built in RAID on some motherboards would be adequate or 
> do we need to look into a dedicated piece of hardware.

RAID on motherboards is for the most part just SW RAID implemented in
the motherboard drivers. Meaning, SW RAID only not as well tuned or
tested as the code already in Linux.

Some boards have real RAID controllers. However, unless you get battery
backed write cache, there will be very little performance gain in the
best case scenario, and some performance loss in a worse scenario.

My stance on HW RAID is; either do it *properly* and get all the
benefits, or don't do it at all.

> We will have 
> about 10 - 12 cpus initially that will be connected with giganet 
> network. We currently have about a terrabyte of storage space and are 
> planning to mount it using NFS in a RAID 5 configuration.
> Our 
> applications for now will be database intensive bioinformatics apps. I 
> would be very interested in any comments. Thanks

I'd seriously consider multiple RAID-1 sets instead of one large RAID-5.

I know it'll cost you. But how long will a RAID-5 resync take on a
terabyte array if you have heavy all-hours traffic on the array?  It
will be days, and during that time performance will suck.  Of course, if
you have large 12-hour windows with near-zero activity this is likely
not a problem.


 / jakob

More information about the Beowulf mailing list