[Beowulf] Software RAID?

Tue Nov 27 14:18:49 PST 2007

See below. 

> -----Original Message-----
> From: Joe Landman [mailto:landman at scalableinformatics.com] 
> Sent: Monday, November 26, 2007 6:56 PM
> To: Ekechi Nwokah
> Cc: Bill Broadley; Beowulf Mailing List
> Subject: Re: [Beowulf] Software RAID?
> 
> Ekechi Nwokah wrote:
> > Reposting with (hopefully) more readable formatting.
> 
> [...]
> 
> >> Of course there are a zillion things you didn't mention.  How many 
> >> drives did you want to use?  What kind? (SAS? SATA?)  If 
> you want 16 
> >> drives often you get hardware RAID hardware even if you 
> don't use it.
> >> What config did you want? 
> >> Raid-0? 1? 5? 6? Filesystem?
> >>
> > 
> > So let's say it's 16. But in theory it could be as high as 
> 192. Using 
> > multiple JBOD cards that present the drives individually 
> (as separate 
> > LUNs, for lack of a better term), and use software RAID to 
> do all the 
> > things that a 3ware/Areca, etc. card would do across the 
> total span of
> > drives:
> 
> Hmmm... Anyone with a large disk count SW raid want to run a few 
> bonnie++ like loads on it and look at the interrupt/csw rates?  Last I
> looked on a RAID0 (2 disk) we were seeing very high 
> interrupt/csw rates. 
>   This would quickly swamp any perceived advantages of 
> "infinitely many" 
> or "infinitely fast" cores.  Sort of like an Amdahl's law.  
> Make the expensive parallel computing portion take zero time, 
> and you are still stuck with the serial time (which you can't 
> do much about).  Worse, it is size extensive, so as you 
> increase the number of disks, you have to increase the 
> interrupt rate (one controller per drive currently), and the 
> base SATA drivers seem to have a problem with lots of CSW.
> 
> > 
> > RAID 0/1/5/6, etc., hotswap, SAS/SATA capability, etc.
> > 
> >> Oh, and how do you measure performance?  Bandwidth?  Seeks?
> >> Transactions?
> >> Transaction size?  Mostly read? write?
> >>
> > 
> > 
> > All of the above. We would be max per-drive performance, say 70MB/s 
> > reads with 100 IOPs on SATA, 120MB/s reads with 300 IOPs on 
> SAS using 
> > 4k transaction sizes. Hopefully eliminate any queueing 
> bottlenecks on 
> > the hardware RAID card.
> 
> This (queuing bottleneck) hasn't really been an issue in most 
> of the workloads we have seen.  Has anyone seen this as an 
> issue on their workloads?
> 

Hmmm....we hit this bottleneck with *very* little concurrency with a
number of different workloads. Even with sequential I/O, it doesn't take
much. Maybe 4 - 8 streams.

> > Assume that we are using RDMA as the network transfer protocol so 
> > there are no network interrupts on the cpus being used to 
> do the XORs, etc.
> 
> er .... so your plan is to use something like a network with 
> RDMA to attach the disks.  So you are not using SATA 
> controllers.  You are using network controllers.  With some 
> sort of offload capability (RDMA without it is a little slow).
> 
> How does this save money/time again?  You are replacing 
> "expensive" RAID controllers with "expensive" Network 
> controller (unless you forgo offload, in which case RDMA 
> doesn't make much sense)?
> 

No. I am looking to replace expensive and single-bus ASICs/FPGAs on the
hardware RAID cards with less expensive software running on the cheap
and plentiful cores. 

The (offload) network controllers are much cheaper than the RAID
controllers.

> Which network were you planning on using for the disks?  
> Gigabit?  10 GbE?  IB?
>

IB.

> You sort-of have something like this today, in Coraid's AOE 
> units.  If you don't have experience with them, you should 
> ask about what happens to the user load under intensive IO 
> operations.  Note:  there is nothing wrong with Coraid units, 
> we like them (and in full disclosure, we do resell them, and 
> happily connect them with our JackRabbit units).
> 

Never heard of it. Will check it out.

> > Right now, all the hardware cards start to precipitously drop in 
> > performance under concurrent access, particularly read/write mixes.
> 
> Hmmm.... Are there particular workloads you are looking at?  
> Huge reads with a tiny write?  Most of the RAID systems we 
> have seen suffer from small block random I/O.  There your 
> RAID system will get in the way (all the extra seeks and 
> computations will slow you down relative to single disks).  
> There you want RAID10's.
>

Yes - well the issue is that even sequential I/O from multiple trends
towards random by the time the block requests hit the RAID card.

> We have put our units (as well as software RAIDs) through 
> some pretty hard tests: single RAID card feeding 4 
> simultaneous IOzone and bonnie++ tests (each test 2x the RAM 
> in the server box) through channel bonded quad gigabit.  
> Apart from uncovering some kernel OOPses due to the channel 
> bond driver not liking really heavy loads, we sustained 
> 360-390 MB/s out of the box, with large numbers of concurrent 
> reads and writes. 
>   We simply did not see degradation.  Could you cite some 
> materials I can go look at, or help me understand which 
> workloads you are talking about?
> 

360MB/s from 4 read/write streams....that would be with *very* large
block request sizes - like 8MB or something - hitting the RAID. How many
workloads will consistently generate those request sizes? 

> > Areca is the best of the bunch, but it's not saying much 
> compared to 
> > Tier 1 storage ASICs/FPGAs.
> 
> You get what you pay for.
> 
> 

Yes. But you should be able to get 90% of the way there with current
commodity hardware technology. But the software isn't delivering. At
least I'm not aware of any such software, hence my post. And by software
I mean the whole stack - drivers, scheduler, network transport, etc.

Let's take the data path from production (disk) to consumption
(application). There are 4 places you need hardware processing in
today's environment, IMHO: reading/writing data from disks (server),
reading writing data to network (server), reading/writing data to
network (client), consuming the data (client). Every thing else in the
data path should a bus or a network link driven by software running on
those processors. Eventually, I hope we'll see this reduced to 2
processors, one on each end of the data path.

That's not a lot of money for processors: 2 network ASICs for transport
offload, and 2 commodity cpus. Using available technology, I think you
could drive 2-4GB/s through that data path if the software layers are
there.

> > The idea here is twofold. Eliminate the cost of the 
> hardware RAID and
> 
> I think you are going to wind up paying more than that cost 
> in other elements, such as networking, JBOD cards (good ones, 
> not the crappy driver ones).
> 

Not at all. If you look at the hardware cost metrics at scale, you'd be
suprised. Particularly in relation to cost/performance.

> > handle concurrent access accesses better. My theory is that 8 cores 
> > would handle concurrent ARRAY access much better than the 
> chipsets on 
> > the hardware cards, and that if you did the parity 
> calculations, CRC, 
> > etc. using SSE instruction set you could acheive a high level of 
> > parallelism and performance.
> 
> The parity calculations are fairly simple, and last I 
> checked, at MD driver startup, it *DOES* check which method 
> makes the parity check fastest in the md assemble stage.  In 
> fact, you can see, in the Linux kernel source, SSE2, MMX, 
> Altivec implementations of RAID6. 
> Specifically, look at raid6sse2.c
> 
> /*
>   * raid6sse2.c
>   *
>   * SSE-2 implementation of RAID-6 syndrome functions
>   *
>   */
> 
> You can see the standard calc, the unrolled by 2 calc, etc.
> 
> If this is limited by anything (just eyeballing it), it would be a) a 
> lack of functional units, b) SSE2 issue rate, c) SSE2 operand width.
> 
> Lack of functional units can sort of be handled by more 
> cores.  However, 
> this code is assembly (in C) language.  Parallel assembly 
> programming is 
> not fun.
> 
> Moreover, OS jitter, context switching away from these 
> calculations will 
> be *expensive* as you have to restore not just the full 
> normal register 
> stack and frame, but all of the SSE2 registers.  You would want to be 
> able to dedicate entire cores to this, and isolate interrupt 
> handling to 
> other cores.
> 

Interesting.

> > I just haven't seen something like that and I was not aware that md
> > could acheive anything close to the performance of a 
> hardware RAID card
> > across a reasonable number of drives (12+), let alone provide the
> > feature set. 
> 
> Due to SATA driver CSW/interrupt handling, I would be quite 
> surprised if 
> it were able to do this (achieve similar performance).  I would bet 
> performance would top out below 8 drives.  My own experience 
> suggests 4 
> drives.  After that, you have to start spending money on those SATA 
> controllers.  And you will still be plagued by interrupts/CSW.  Which 
> will limit your performance.  Your costs will start approaching the 
> "expensive" RAID cards.
> 
> What we have found is, generally, performance on SATA is very much a 
> function of the quality of the driver, the implementation 
> details of the 
> controller, how it handles heavy IO (does it swamp the 
> motherboard with 
> interrupts?).  I have a SuperMicro 8 core deskside unit with a small 
> RAID0 on 3 drives.  When I try to push the RAID0 hard, I swamp the 
> motherboard with huge numbers of interrupts/CSW.  Note that 
> this is not 
> even doing RAID calculations, simply IO.
> 

Is it a SATA driver *implementation* issue, or an issue with the SATA
spec itself? (I don't know much about low-level SATA technology).

If it's an implementation issue, how hard would it be to rewrite the
driver?

> You are rate limited by how fast the underlying system can handle IO. 
> The real value of any offload processor is how it, not so 
> oddly enough, 
> offloads stuff (calculations, interrupts, IO, ...) from the 
> main CPUs. 
> Some of the RAID cards for these units do a pretty good job of 
> offloading, some are crap (and even with SW raid issues, it is faster 
> than the crappy ones).
> 
> 

Good point. I suppose my contention is that with cores as cheap as they
are today, you can "offload" most of these calculations to seperate
cores, eliminating the need for an offload processor. The key is indeed
reducing csw/interrupts. I didn't realize the SATA drivers generated so
many interrupts.

Alternatively, you could place a commodity quad-core processor on a PCI
card and eliminate the $5 million or whatever in FPGA/ASIC development
costs.

> 
> > 
> > -- Ekechi
> > 
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>         http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>

Thanks Joseph. Appreciate the feedback.

Ekechi