[Beowulf] Software RAID?

Tue Nov 27 02:22:03 PST 2007

Long reply, some actual numbers I've collected, if you read nothing
else please read the last paragraph.

Joe Landman wrote:
> Ekechi Nwokah wrote:

> Hmmm... Anyone with a large disk count SW raid want to run a few 
> bonnie++ like loads on it and look at the interrupt/csw rates?  Last I 

Bonnie++ runs lots of things, seems like a smaller test might be more
useful.  I.e. did you want large contiguous reads?  Writes?  Random?
Small?

> looked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates. 

What is very high?

I just did a 16GB (total) dd's from 2 8 disk raid5s (at the same time) @ 
600MB/sec. (16GB in 27.206 seconds).

230644 context switches, 8477 per second, 2119 per cpu per second.

Interrupts services was 366,688, 13,478 per second, 3369 per cpu per second.

System feels normal, maybe I could run a netperf with a small packet size
to measure performance under this load... other suggestions?

BTW, as far as the maximum context switch performance lmbench 3.0-a8:

$ ./lat_ctx 2 4
"size=0k ovr=2.55
2 3.14
4 3.10

4 in parallel:
$./lat_ctx -P 4 8 16 24
8 1.12
16 1.36
24 1.36

>  This would quickly swamp any perceived advantages of "infinitely many" 
> or "infinitely fast" cores. 

Hrm, why?  Does context switches not scale with core speed?  Or number
of cores?  Can't interrupts be spread across CPUs?  Hell if you are
really worried about it you could put 2 RAID controllers each connected
to the PCI-e attached to hypertransport on separate opteron sockets.

Have you seen cases where practical I/O loads were limited by context switches
or interrupts?

Personally seeing 600MB/sec out of 16 disks in 2 RAID-5's keeps me pretty 
happy.  Do you have numbers for a hardware RAID?  I have a crappy hardware 
RAID, I think it's the newest dell Perc, 6 of the fancy 15k rpm 36GB SAS 
drives in a single RAID-5.  I manage 171MB/sec. 494,499 context switches, 2587
per second.  910,294 interrupts, 4763/sec.

So to compare:
                         software-raid  hardware-raid  s/h
Context switches/sec      8477         2587           3.28
Interrupts/sec          13,478         4763           2.82

Which sounds like a big win until you realize that the software raid was
3.5 times faster.

> Sort of like an Amdahl's law.  Make the 
> expensive parallel computing portion take zero time, and you are still 
> stuck with the serial time (which you can't do much about).  Worse, it 

What part of context switching and interrupt handling doesn't scale with
core speed or cores, er, well at least sockets?

> is size extensive, so as you increase the number of disks, you have to 
> increase the interrupt rate (one controller per drive currently), and 

Er, why is that?  Say I have 1000 disks.  You want to read 64KB it's
going to be a few disks (unless you have an oddly small strip size), so
you generate a few interrupts (not 1000s).

Of course if you want to support 50MB/sec to 1000 disks then the interrupts
go up by a factor of 1000, of course you will bottleneck elsewhere.

Why would interrupts scale with the number of disks instead of performance?

I've not noticed hardware raid scaling any better than software raid
per disk.  I've not personally tested any system with more than 16 drives 
though, I prefer commodity parts including RAID controllers, power supplies,
and cases.  The 24-48 drive setups seem pretty exotic, low volume, and
make me nervous about cooling, drive spin up, weight, etc.  If you
need to install a 48 disk server at the top of a 48U rack I am definitely
busy ;-).  Not to mention I'd bet that under most work loads 4 16 disk
servers are going to be faster than 1 48... and cheaper.  Probably worse
per watt, maybe not worse performance/watt.

I wouldn't turn down the opportunity to benchmark software RAID on a 48 drive
though.  Sun recommends software raid on their 48 disk server.

> the base SATA drivers seem to have a problem with lots of CSW.

Which?  I've got a couple areca 16 port (not the fast new one) and a
couple 3ware 16 port of some 9550sx flavor (I've have to check if it's
the fastest they have in that size).  I'd happily buy a 16 port non-raid
card if I could find them, I haven't so far.

>> RAID 0/1/5/6, etc., hotswap, SAS/SATA capability, etc.
>>
>>> Oh, and how do you measure performance?  Bandwidth?  Seeks?
>>> Transactions?
>>> Transaction size?  Mostly read? write?
>>>
>>
>>
>> All of the above. We would be max per-drive performance, say 70MB/s
>> reads with 100 IOPs on SATA, 120MB/s reads with 300 IOPs on SAS using 4k
>> transaction sizes. Hopefully eliminate any queueing bottlenecks on the
>> hardware RAID card.

I didn't mean to trim the attribution (if I did it), but yeah 40-60MB/sec
per drive seems common with large sequential transfers and software RAID.  In 
my experience (up to 16 drives) software raid seems to scale much better than 
hardware RAID.  I've seen many 8-16 drive hardware raids peak out at 2-3 times 
the single disk performance.  Of course you will see much less then than 
40-60MB/sec with smaller random transfers.  My favorite benchmark for 
quantifying this is postmark you can set the transaction size, ratio of 
reads/writes, and number of transactions.  Bonnie++ by comparison seems kind 
of silly because it does all kinds of things so you can't really say X raid is 
better than Y raid at Z because of context switches or interupts since it's 
all averaged together.

>> Assume that we are using RDMA as the network transfer protocol so there
>> are no network interrupts on the cpus being used to do the XORs, etc.
> 
> er .... so your plan is to use something like a network with RDMA to 
> attach the disks.  So you are not using SATA controllers.  You are using 
> network controllers.  With some sort of offload capability (RDMA without 
> it is a little slow).

I've yet to see RDMA or TOE be justified from a performance perspective,
I've seen 800MB/sec over infinipath, but I wasn't driving it from a storage
array.  I could try pushing 600MB/sec from disks to IB, I'd be kind of
surprised if I hit some context switch or interrupt wall.  If there's
a real workload that has this problem and you think it's hitting the
wall I'm game for trying to set it up.

> You sort-of have something like this today, in Coraid's AOE units.  If

coraid didn't impress me as supporting very good bandwidth per disk,
if you want some kind of block level transport I'd suggest iSCSI over
whatever you want, infiniband or 10G would be the obvious choices.  My
testing of coraid and a few of the turn key 16-24 disk NAS like devices
that run linux with hardware RAID and I was VERY disappointed in their
performance, kind shocking considering the 5 figure prices.

> you don't have experience with them, you should ask about what happens 
> to the user load under intensive IO operations.  Note:  there is nothing 
> wrong with Coraid units, we like them (and in full disclosure, we do 
> resell them, and happily connect them with our JackRabbit units).

For cheap block level storage sure, but the discussion seemed to be can 
software raid be just as good or better than hardware RAID. I don't see that 
as being particularly relevant to the coraid.

>> Right now, all the hardware cards start to precipitously drop in
>> performance under concurrent access, particularly read/write mixes.
> 
> Hmmm.... Are there particular workloads you are looking at?  Huge reads 
> with a tiny write?  Most of the RAID systems we have seen suffer from 
> small block random I/O. 

Right, hardware or software.

> There your RAID system will get in the way (all 
> the extra seeks and computations will slow you down relative to single 
> disks).  There you want RAID10's.

Sure, at least depending on your mix of read/writes and access patterns,
best quantified (IMO) with postmark.  I'd love to find something better...

> We have put our units (as well as software RAIDs) through some pretty 
> hard tests: single RAID card feeding 4 simultaneous IOzone and bonnie++ 
> tests (each test 2x the RAM in the server box) through channel bonded 
> quad gigabit.  Apart from uncovering some kernel OOPses due to the 
> channel bond driver not liking really heavy loads, we sustained 360-390 
> MB/s out of the box, with large numbers of concurrent reads and writes. 
>  We simply did not see degradation.  Could you cite some materials I can 
> go look at, or help me understand which workloads you are talking about?

I don't see any reason that software raid + quad GigE or IB/10G couldn't
do similar or better.

>> Areca is the best of the bunch, but it's not saying much compared to
>> Tier 1 storage ASICs/FPGAs. 
> 
> You get what you pay for.

My experience is just the opposite.  Low volume high margin expensive storage 
solutions often leave me shocked and horrified as to their performance, even
on the easy things like bandwidth, let alone the harder things like random
I/O or write intensive workloads.

>> The idea here is twofold. Eliminate the cost of the hardware RAID and
> 
> I think you are going to wind up paying more than that cost in other 
> elements, such as networking, JBOD cards (good ones, not the crappy 
> driver ones).

I'd love a cheap fast JBOD card, but alas I've been buying 3ware/areca
16 ports just because I've not found anything cheaper.  I'd rather have
1 16 port card than 2 ports (or 4 4 ports) just for complexity reasons
and the 16 ports I've tried seem to scale reasonably, I have even higher
expectations for software RAID performance with the newest 3ware/areca, 
especially now with 8x pcie instead of 4x which I expect is the limiting 
factor I'm seeing currently.

> The parity calculations are fairly simple, and last I checked, at MD 
> driver startup, it *DOES* check which method makes the parity check 
> fastest in the md assemble stage.  In fact, you can see, in the Linux 
> kernel source, SSE2, MMX, Altivec implementations of RAID6. 
> Specifically, look at raid6sse2.c

Right:
[642361.177665] raid5: using function: generic_sse (7821.000 MB/sec)
[642361.665406] raid6: using algorithm sse2x4 (5333 MB/s)

> If this is limited by anything (just eyeballing it), it would be a) a 
> lack of functional units, b) SSE2 issue rate, c) SSE2 operand width.

I'd pick D) disks.  If a single core can handle 5-8GB/sec a vanishingly small 
fraction of a SINGLE core (out of the usual 4-8), leaving a large majority of 
the CPUs free for handling the application code, file system etc.  I'd much 
rather have er, even at 600MB/sec something like 7% of a single CPU doing RAID 
calcs than have that 7% free and get less performance with hardware RAID 
(which has been my experience).

> Lack of functional units can sort of be handled by more cores.  However, 
> this code is assembly (in C) language.  Parallel assembly programming is 
> not fun.

Er the MD code is written, very well tested, and doesn't need to be touched.
Just write your application as you normally would and let the linux software
RAID do it's thing.

> Moreover, OS jitter, context switching away from these calculations will 
> be *expensive* as you have to restore not just the full normal register 
> stack and frame, but all of the SSE2 registers.  You would want to be 
> able to dedicate entire cores to this, and isolate interrupt handling to 
> other cores.

Sure, context switches get more expensive.  But if your machine can do
a context switch in 3-4us per cpu, or in 1.25us (with 4 cpus) and you only
need 10k/sec is it really that big of a deal?  I don't really see 10k/sec
being a big issue if the system can handle almost 1M/sec, even if they
get 1/2 as fast because of SSE registers restores.

>> I just haven't seen something like that and I was not aware that md
>> could acheive anything close to the performance of a hardware RAID card
>> across a reasonable number of drives (12+), let alone provide the
>> feature set. 
> 
> Due to SATA driver CSW/interrupt handling, I would be quite surprised if 
> it were able to do this (achieve similar performance).  I would bet 

Pick an exact workload, ideally something like postmark so we can pin
down things like context switches, scaling, and interrupts) and we
can compare.

> performance would top out below 8 drives.

I've not seen this.  Granted RAID throughput per drive in general decreases 
with an increased number of drives.  From what I can tell this is just because
with a large number of disks you get effectively one virtual head.  So if
you have a ton of reads/writes in the 100GB around block X you are golden.
But if run a second disk intensive process with I/O clustered around block
Y all your drivers are suddenly seeking a ton.  Much better to make two RAIDs
out of 1/2 the drives and split the workload across them.  This of course is
hard to do in practice and creates justifications for replications, fancy
file systems, load balancing, migrations, and other high end storage features
that are from what I can tell completely independent of software vs hardware
RAID.

>  My own experience suggests 4 
> drives.  After that, you have to start spending money on those SATA 
> controllers.  And you will still be plagued by interrupts/CSW.  Which 
> will limit your performance.  Your costs will start approaching the 
> "expensive" RAID cards.

Exactly what workloads are you seeing this scaling on hardware RAID and
not on software raid?  I'll try just about any source that I can grab
of, or a config file for postmark.

> What we have found is, generally, performance on SATA is very much a 
> function of the quality of the driver, the implementation details of the 

I've tried nvidia (1-4 drives), promise (1-4), areca (1-16), 3ware (1-16)
and seen no scaling problems.  I've tried older 3ware (1-16),
storage works (1-15), dell perc (1-15), a bunch of $2k-$5k SCSI
controllers, and a few others I can't remember (at least one megaraid or 
similar) and found them all rather lacking in the scaling department with 
hardware raid.

> controller, how it handles heavy IO (does it swamp the motherboard with 
> interrupts?).  I have a SuperMicro 8 core deskside unit with a small 
> RAID0 on 3 drives.  When I try to push the RAID0 hard, I swamp the 
> motherboard with huge numbers of interrupts/CSW.  Note that this is not 
> even doing RAID calculations, simply IO.

Can you quantify this?  I've seen behavior like this with older kernels that 
seemed to be more related to the schedule and getting processes swapped out 
and they would starve, it wasn't anything related to context switches or 
interrupts.  This seems basically fixed in the newer kernels.

> You are rate limited by how fast the underlying system can handle IO. 
> The real value of any offload processor is how it, not so oddly enough, 
> offloads stuff (calculations, interrupts, IO, ...) from the main CPUs. 
> Some of the RAID cards for these units do a pretty good job of 
> offloading, some are crap (and even with SW raid issues, it is faster 

It's a bit of a mystery to me exactly what is being offloaded.  The RAID calc 
is trivial.  The hard stuff (applications, file system, buffering I/O in main 
memory, etc) is the same.  Handling 50MB/sec streams from very latency 
tolerant (on the scale of a 2GHz cpu) disks is pretty straight forward.

In any case, I'm open to the idea that hardware RAID has significant 
superiority for some workload.  I've yet to see it (despite some trying). I've 
not seen any bottlenecks in memory bandwidth (5-7GB per socket), I/O bandwidth 
(1-4GB/sec per socket), context switches, interrupt handling, or CPU cycles 
that would improve with hardware RAID.   I'd love to see a workload that shows 
greater application performance on some I/O intensive workload with hardware RAID.