[Beowulf] Software RAID?

Tue Nov 27 07:00:28 PST 2007

Hi Bill

Bill Broadley wrote:
> 
> Long reply, some actual numbers I've collected, if you read nothing
> else please read the last paragraph.

Read it all.  Thanks for the reply.

> 
> Joe Landman wrote:
>> Ekechi Nwokah wrote:
> 
>> Hmmm... Anyone with a large disk count SW raid want to run a few 
>> bonnie++ like loads on it and look at the interrupt/csw rates?  Last I 
> 
> Bonnie++ runs lots of things, seems like a smaller test might be more
> useful.  I.e. did you want large contiguous reads?  Writes?  Random?
> Small?
> 
>> looked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates. 
> 
> What is very high?

8500 csw/s, 7500 ints/s for the unit I mentioned, 3 drive RAID0.  We 
have done larger 4 drive RAID10's, and some 8 drive RAID5's with a hot 
spare.  We saw 20+k CSW/s, with 20+k interrupts/s.  System with 4 cores 
was quite sluggish.

Replaceing the SW RAID with even a slow 3ware dropped the interrupt and 
CSW rates to under 10k, and the system felt less sluggish, more responsive.

> I just did a 16GB (total) dd's from 2 8 disk raid5s (at the same time) @ 
> 600MB/sec. (16GB in 27.206 seconds).

Yeah I have done these too, with software and hardware RAIDs.  They 
aren't indicative of real workloads as far as I can tell, and if you do 
them large enough, you remove any system caching behavior.

You can see something I wrote about this a while ago: 
http://scalability.org/?p=394#more-394

My test code looks like this

#!/bin/bash
sync
echo -n "start at "
date
dd if=/dev/zero of=/local/big.file bs=134217728 count=100 oflag=direct
sync
echo -n "stop at "
date

I changed 100 to 1000 and then to 10000.  For 100 (13 GB), our 13 drive 
RAID6 (11 data drives) saw 812 MB/s (13421772800 bytes copied, 16.5194 
seconds, 812 MB/s).  For 1000 (134 GB) our 13 drive RAID6 saw 
(134217728000 bytes copied, 191.932 seconds, 699 MB/s) 699 MB/s.  Going 
to 10000, our 13 drive RAID6 saw (1342177280000 bytes copied, 2191.99 
seconds, 612 MB/s) 612 MB/s.

Our CSW rate was under 4000 the entire time, our interrupt rate was 
about 3000 the entire time.

> 
> 230644 context switches, 8477 per second, 2119 per cpu per second.

Now try to do something else on that machine that requires kernel CSW 
and interrupts.  We have, and it wasn't pretty.  Specifically, try to 
pull this data out of a network connection that does generate interrupts 
or context switches.

> Interrupts services was 366,688, 13,478 per second, 3369 per cpu per 
> second.
> 
> System feels normal, maybe I could run a netperf with a small packet size
> to measure performance under this load... other suggestions?

Real use cases.  Multiple readers and writers over ~4 GbE channels doing 
heavy NFS file IO.

Again, we have done this, and have helped end users/customers with this. 
  It isn't pretty.

If you are using 3ware/Areca/LSI RAID cards in JBOD mode, their drivers 
handle much of the pain for you.  The pain in this case are the SATA 
interrupts and context switches.  If you are using motherboard mounted 
SATA controllers (you can get systems with up to 14: 6 SATA 8 SAS), this 
is the case I am focused upon.

[...]

>>  This would quickly swamp any perceived advantages of "infinitely 
>> many" or "infinitely fast" cores. 
> 
> Hrm, why?  Does context switches not scale with core speed?  Or number
> of cores?  Can't interrupts be spread across CPUs?  Hell if you are

Not for serialized access to IO.  Any point of serialization is bad, and 
IO is a great serializer.  Most IO systems don't do concurrent IO's, 
they have all sorts of tricks (elevators, io-schedulers, etc) to provide 
better throughput, but at the end of the day, they are serial devices.

So having more CPUs handle more CSW is great up until the the data has 
to flow out of system cache and onto devices.  Which reminds me.  Please 
try the code fragment above with the 1000 case.  I am curious what 
happens well outside cache.  The 13 GB number isn't "fair" as it still 
makes very good use of cache, and few IO servers that I am aware of 
right now have 128 GB ram (a few do, so we need larger tests).  The 134 
GB test works out quite well for obliterating cache effects and getting 
to the raw controller/disk performance.  You will be IO bound for that 
(and in this case, this is write bound, and interrupt/csw bound).

> really worried about it you could put 2 RAID controllers each connected
> to the PCI-e attached to hypertransport on separate opteron sockets.
> 
> Have you seen cases where practical I/O loads were limited by context 
> switches
> or interrupts?

Yes.  NFS servers under heavy load serving content via gigabit.  Local 
BLAST servers running large numbers of queries against the nt database. 
  And others.

In the latter case, I had our JackRabbit 4 core 13 drive RAID6 unit run 
1000 sequences of A. thaliana against nt from July 07.  I ran this on 
the RAID0 machine with 8 cores.  The RAID0 machine can achieve 210 MB/s 
read speads according to bonnie++.  The RAID0 machine has 8 GB ram, the 
JackRabbit has 16 GB.  The indexed nt DB's fit into ram, and the code 
was optimized for x86_64.

In both cases, I used 1 thread per core.

The 4 core JackRabbit server finished its calculations (involving 
mmap'ed reading/rereading of the nt indexes in about 14 minutes.  The 8 
core Pegasus desktop finished its calculations in something on the order 
of 95 minutes.

During this work, the JackRabbit was usable/functional as an NFS server, 
and I was running a VMWare server session on it at the same time, with a 
copy of Windows XP x64 with 2 CPUs, not to mention some streaming tars 
we were doing via nfs.  We saw in aggregate about 5000 CSW/s, and under 
7000 ints/s.  System was not sluggish, and scaled very well.

During this work, the Pegasus was sluggish as a desktop system.  We saw 
in aggregate about 18000 CSW/s, and about 16000 ints/s.

Same binary code, same data set, not a dd, but something the folks 
visiting us on this show floor might do themselves.

> Personally seeing 600MB/sec out of 16 disks in 2 RAID-5's keeps me 
> pretty happy.  Do you have numbers for a hardware RAID?  I have a crappy 

We are seeing sustained 650-800 MB/s out of 1 13-disk RAID6.  Out of a 
two RAID controller system, we are seeing something north of 1.2 GB/s 
sustained.  Far outside of cache that is (not just simple dd tests, but 
real workloads).

In all cases, we are not hitting a context switch wall.  Nor are we 
hitting an interrupt wall.

Please note that some of this is distribution dependent, in that some 
distributions have gone off in some not-so-helpful directions with 
regards to kernel stacks and other bits.  We get our best performance 
numbers with Ubuntu, lose about 8-10% going to SuSE, lose ~20% going to 
Fedora.  Nothing compared to what you losing going to RHEL4, but still, 
worth at least knowing.  We recommend a 2.6.22.6 kernel we built that 
does a nice job on pretty much every workload we have thrown on it.  Our 
.debs should be up on our download site.  The .rpms take a little more 
work (I wish building kernel RPMs for RHEL/Centos/SuSE was even close to 
as easy as it is building new .debs for Ubuntu/Debian).

> hardware RAID, I think it's the newest dell Perc, 6 of the fancy 15k rpm 
> 36GB SAS drives in a single RAID-5.  I manage 171MB/sec. 494,499 context 
> switches, 2587
> per second.  910,294 interrupts, 4763/sec.
> 
> So to compare:
>                         software-raid  hardware-raid  s/h
> Context switches/sec      8477         2587           3.28
> Interrupts/sec          13,478         4763           2.82
> 
> Which sounds like a big win until you realize that the software raid was
> 3.5 times faster.

I haven't seen Dell Perc's described as "fast".  We don't use them, we 
don't have them, so I can't comment on them.  Our HW raids out perform 
our SW raids by quite a bit.  If we built our SW raids atop our HW RAID 
cards running in JBOD mode, then we would likely achieve "similar" 
numbers.  Built atop the motherboard SATA (which the original poster was 
suggesting), you will experience the issues that I had indicated.  If 
you build your SW raid atop the same controllers that others build their 
hardware RAID atop of, you aren't necessarily doing what the original 
poster asked for (they wanted to avoid spending money on the very cards 
you are using, and leverage the existing cheap SATA cards).

> 
>> Sort of like an Amdahl's law.  Make the expensive parallel computing 
>> portion take zero time, and you are still stuck with the serial time 
>> (which you can't do much about).  Worse, it 
> 
> What part of context switching and interrupt handling doesn't scale with
> core speed or cores, er, well at least sockets?

Anything where the data sink/source is serial, such as IO, networking, ...

Then you are sharing a fixed sized resource among more 
processors/sockets.  You get a classic 1/N problem which looks/scales 
exactly like the OpenMP false sharing does.  Adding more cores/sockets 
actually slows it down.

In the case of the PCIe connected RAID cards, you have about 2 GB/s pipe 
into the unit.  The RAID card handles the SATA controllers for you (does 
  all the interrupt servicing via the processor on the card).  Does all 
the local cache management, etc.  If you are attaching SATA to this, and 
operating it as HW or SW RAID, you have only interrupts to the cards, 
not to the controllers.  If on the other hand, and speaking to the point 
of the original poster, you attach SATAs to the SATA ports on the 
motherboard, the CPUs have to handle all the controllers, CSWs and 
interrupts.  Very different scenarios.

>> is size extensive, so as you increase the number of disks, you have to 
>> increase the interrupt rate (one controller per drive currently), and 
> 
> Er, why is that?  Say I have 1000 disks.  You want to read 64KB it's
> going to be a few disks (unless you have an oddly small strip size), so
> you generate a few interrupts (not 1000s).

Back to the original point of the poster, the motherboard controllers 
(remember, this person does not want to buy RAID cards and then use them 
as SATA controllers) will generate interrupts per disk transfer, and you 
have one controller per disk.

If you use a 3ware/Areca/LSI/Adaptec as a SATA controller (RAID in JBOD 
mode), this is a different story.  One interrupt per controller card, 
and you are avoiding using functionality you have paid for (which in the 
majority of these cases, is not a bad thing).

> Of course if you want to support 50MB/sec to 1000 disks then the interrupts
> go up by a factor of 1000, of course you will bottleneck elsewhere.
> 
> Why would interrupts scale with the number of disks instead of performance?

See above, going to the original intent of the poster.  One controller 
per disk.  Controllers generate interrupts per transfer.  N disks 
generate N*M interrupts for large transfers.

> I've not noticed hardware raid scaling any better than software raid
> per disk.  I've not personally tested any system with more than 16 
> drives though, I prefer commodity parts including RAID controllers, 
> power supplies,

We have, on motherboard/cheap SATA controller connected SW RAID.

> and cases.  The 24-48 drive setups seem pretty exotic, low volume, and

Not exotic.  Not high volume.

> make me nervous about cooling, drive spin up, weight, etc.  If you

Weight is an issue, you want good sturdy racks.  Airflow is not an 
issue.  Noise is.  You need to move lots of air to keep these drives 
cool.  Drive spin up is not an issue.  We do this with delays.  Works fine.

> need to install a 48 disk server at the top of a 48U rack I am definitely
> busy ;-).  Not to mention I'd bet that under most work loads 4 16 disk

Darn it, I was going to call you and ask for a hand with this :)

> servers are going to be faster than 1 48... and cheaper.  Probably worse
> per watt, maybe not worse performance/watt.

Actually no.  The other way around.  1 x 48 (ok, one of our 48s) costs 
less than 4 x 16's (4 of our 16's) with the same drives.  If you care 
about single file system name space, then you have to run a clustered 
file system, which complicates the 4 x 16s.  Each 16 runs about 700W 
max. Each 48 runs about 1300W max.

> I wouldn't turn down the opportunity to benchmark software RAID on a 48 
> drive
> though.  Sun recommends software raid on their 48 disk server.

... which they charge significantly more than others for who use HW RAID 
with 48 drives (and who achieve somewhat better performance than the SW 
RAID).

>> the base SATA drivers seem to have a problem with lots of CSW.
> 
> Which?  I've got a couple areca 16 port (not the fast new one) and a
> couple 3ware 16 port of some 9550sx flavor (I've have to check if it's
> the fastest they have in that size).  I'd happily buy a 16 port non-raid
> card if I could find them, I haven't so far.

Areca and 3Ware are not SATA adapters, they are RAID adapters, which 
have a JBOD mode.  If you are using these in your discussion, then you 
are leveraging all of the advantages of the HW RAID, with the simple 
difference of doing the RAID calculations on the CPU rather than on the 
card.  The card handles all the SATA controller issues for you.  As well 
as the interrupts the controller generates.  It doesn't present the SATA 
controllers as pass-through.  3ware is known to be "not-fast" on RAID calcs.

As the original postered indicated, they wanted to do this without 
spending money on the RAID cards (with JBOD mode).

Which you can do with the 14 drive SuperMicro motherboards.  I was 
talking about the latter, you seem to be talking about the former.

[...]

>> er .... so your plan is to use something like a network with RDMA to 
>> attach the disks.  So you are not using SATA controllers.  You are 
>> using network controllers.  With some sort of offload capability (RDMA 
>> without it is a little slow).
> 
> I've yet to see RDMA or TOE be justified from a performance perspective,
> I've seen 800MB/sec over infinipath, but I wasn't driving it from a storage
> array.  I could try pushing 600MB/sec from disks to IB, I'd be kind of
> surprised if I hit some context switch or interrupt wall.  If there's
> a real workload that has this problem and you think it's hitting the
> wall I'm game for trying to set it up.

Greg might comment on this, but Infinipath drivers operated in 
effectively a polling mode, and the cards did some of their own offload 
processing of some sort.

We have seen RDMA/TOE make sense for users in a real code scenario.  We 
used Ammasso  1100 cards for a customer running iWARP, and ran StarCD on 
it.  Running same machines same network switch, without TOE/iWARP was 
1/4 the speed of running with, for this MPI job (latency sensitive).  I 
keep hearing people denigrate TOE and RDMA, but how many have actually 
used it?  We have, and it has made some noticeable differences in the 
real world apps.

Worth the cost?  That is a separate discussion.  I wouldn't pay a huge 
premium for it.  This would be hard to justify apart from exceptional cases.

>> You sort-of have something like this today, in Coraid's AOE units.  If
> 
> coraid didn't impress me as supporting very good bandwidth per disk,

They dont.  They are good, cheap, bulk storage.

> if you want some kind of block level transport I'd suggest iSCSI over
> whatever you want, infiniband or 10G would be the obvious choices.  My

I fail to see how iSCSI over gigabit would be any faster than AoE over 
gigabit.

> testing of coraid and a few of the turn key 16-24 disk NAS like devices
> that run linux with hardware RAID and I was VERY disappointed in their
> performance, kind shocking considering the 5 figure prices.

Wow ... our devices have 4 figure prices, and are quite a bit faster 
than most units with 5 figure pricing.  Maybe I should bug you offline 
if you are willing to share information.

Coraid's sweet spot is bulk storage.

> 
>> you don't have experience with them, you should ask about what happens 
>> to the user load under intensive IO operations.  Note:  there is 
>> nothing wrong with Coraid units, we like them (and in full disclosure, 
>> we do resell them, and happily connect them with our JackRabbit units).
> 
> For cheap block level storage sure, but the discussion seemed to be can 
> software raid be just as good or better than hardware RAID. I don't see 
> that as being particularly relevant to the coraid.

This was related to his RDMA point.  The RDMA adapters all come with a 
price premium.  Just like the HW RAID adapters.  The poster did not want 
to pay a premium for HW RAID adapters (like Areca/3ware/LSI/Adaptec), so 
I was confused as to why they wanted to pay a premium for RDMA.  It was 
a wash in the end.

Coraid is cheap block storage.

The discussion as I read it was can you achieve HW RAID performance with 
SW RAID without spending money on the HW RAID adapters.

Correct me if I am wrong, but you are using the proprietary HW RAID 
adapters in JBOD mode?

[...]

>> We have put our units (as well as software RAIDs) through some pretty 
>> hard tests: single RAID card feeding 4 simultaneous IOzone and 
>> bonnie++ tests (each test 2x the RAM in the server box) through 
>> channel bonded quad gigabit.  Apart from uncovering some kernel OOPses 
>> due to the channel bond driver not liking really heavy loads, we 
>> sustained 360-390 MB/s out of the box, with large numbers of 
>> concurrent reads and writes.  We simply did not see degradation.  
>> Could you cite some materials I can go look at, or help me understand 
>> which workloads you are talking about?
> 
> I don't see any reason that software raid + quad GigE or IB/10G couldn't
> do similar or better.

We could just as easily turn off the HW RAID portion and do the same 
thing in SW RAID.  The point of the poster was not to spend the money on 
the HW RAID adapter in the first place.  If you don't spend the money on 
the HW RAID adapter, even if you run it solely in JBOD mode, and use the 
motherboard SATA, you will not achieve what we are talking about.

>>> Areca is the best of the bunch, but it's not saying much compared to
>>> Tier 1 storage ASICs/FPGAs. 
>>
>> You get what you pay for.
> 
> My experience is just the opposite.  Low volume high margin expensive 
> storage solutions often leave me shocked and horrified as to their 
> performance, even
> on the easy things like bandwidth, let alone the harder things like random
> I/O or write intensive workloads.

Correcting the context.  The Bluearc/DDN/... FPGAs are highly tuned 
processors for storage.  If you want them, you need to pay for them.

> 
>>> The idea here is twofold. Eliminate the cost of the hardware RAID and
>>
>> I think you are going to wind up paying more than that cost in other 
>> elements, such as networking, JBOD cards (good ones, not the crappy 
>> driver ones).
> 
> I'd love a cheap fast JBOD card, but alas I've been buying 3ware/areca
> 16 ports just because I've not found anything cheaper.  I'd rather have

This is my point ... you are using the expensive RAID cards (in JBOD 
mode) while the poster wanted to "Eliminate the cost of the hardware 
RAID".  You won't eliminate the cost of the hardware RAID by running it 
in JBOD mode.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615