[Beowulf] Software RAID?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.Joe Landman landman at scalableinformatics.com
Tue Nov 27 07:00:28 PST 2007
- Previous message: [Beowulf] Software RAID?
- Next message: [Beowulf] Software RAID?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Bill Bill Broadley wrote: > > Long reply, some actual numbers I've collected, if you read nothing > else please read the last paragraph. Read it all. Thanks for the reply. > > Joe Landman wrote: >> Ekechi Nwokah wrote: > >> Hmmm... Anyone with a large disk count SW raid want to run a few >> bonnie++ like loads on it and look at the interrupt/csw rates? Last I > > Bonnie++ runs lots of things, seems like a smaller test might be more > useful. I.e. did you want large contiguous reads? Writes? Random? > Small? > >> looked on a RAID0 (2 disk) we were seeing very high interrupt/csw rates. > > What is very high? 8500 csw/s, 7500 ints/s for the unit I mentioned, 3 drive RAID0. We have done larger 4 drive RAID10's, and some 8 drive RAID5's with a hot spare. We saw 20+k CSW/s, with 20+k interrupts/s. System with 4 cores was quite sluggish. Replaceing the SW RAID with even a slow 3ware dropped the interrupt and CSW rates to under 10k, and the system felt less sluggish, more responsive. > I just did a 16GB (total) dd's from 2 8 disk raid5s (at the same time) @ > 600MB/sec. (16GB in 27.206 seconds). Yeah I have done these too, with software and hardware RAIDs. They aren't indicative of real workloads as far as I can tell, and if you do them large enough, you remove any system caching behavior. You can see something I wrote about this a while ago: http://scalability.org/?p=394#more-394 My test code looks like this #!/bin/bash sync echo -n "start at " date dd if=/dev/zero of=/local/big.file bs=134217728 count=100 oflag=direct sync echo -n "stop at " date I changed 100 to 1000 and then to 10000. For 100 (13 GB), our 13 drive RAID6 (11 data drives) saw 812 MB/s (13421772800 bytes copied, 16.5194 seconds, 812 MB/s). For 1000 (134 GB) our 13 drive RAID6 saw (134217728000 bytes copied, 191.932 seconds, 699 MB/s) 699 MB/s. Going to 10000, our 13 drive RAID6 saw (1342177280000 bytes copied, 2191.99 seconds, 612 MB/s) 612 MB/s. Our CSW rate was under 4000 the entire time, our interrupt rate was about 3000 the entire time. > > 230644 context switches, 8477 per second, 2119 per cpu per second. Now try to do something else on that machine that requires kernel CSW and interrupts. We have, and it wasn't pretty. Specifically, try to pull this data out of a network connection that does generate interrupts or context switches. > Interrupts services was 366,688, 13,478 per second, 3369 per cpu per > second. > > System feels normal, maybe I could run a netperf with a small packet size > to measure performance under this load... other suggestions? Real use cases. Multiple readers and writers over ~4 GbE channels doing heavy NFS file IO. Again, we have done this, and have helped end users/customers with this. It isn't pretty. If you are using 3ware/Areca/LSI RAID cards in JBOD mode, their drivers handle much of the pain for you. The pain in this case are the SATA interrupts and context switches. If you are using motherboard mounted SATA controllers (you can get systems with up to 14: 6 SATA 8 SAS), this is the case I am focused upon. [...] >> This would quickly swamp any perceived advantages of "infinitely >> many" or "infinitely fast" cores. > > Hrm, why? Does context switches not scale with core speed? Or number > of cores? Can't interrupts be spread across CPUs? Hell if you are Not for serialized access to IO. Any point of serialization is bad, and IO is a great serializer. Most IO systems don't do concurrent IO's, they have all sorts of tricks (elevators, io-schedulers, etc) to provide better throughput, but at the end of the day, they are serial devices. So having more CPUs handle more CSW is great up until the the data has to flow out of system cache and onto devices. Which reminds me. Please try the code fragment above with the 1000 case. I am curious what happens well outside cache. The 13 GB number isn't "fair" as it still makes very good use of cache, and few IO servers that I am aware of right now have 128 GB ram (a few do, so we need larger tests). The 134 GB test works out quite well for obliterating cache effects and getting to the raw controller/disk performance. You will be IO bound for that (and in this case, this is write bound, and interrupt/csw bound). > really worried about it you could put 2 RAID controllers each connected > to the PCI-e attached to hypertransport on separate opteron sockets. > > Have you seen cases where practical I/O loads were limited by context > switches > or interrupts? Yes. NFS servers under heavy load serving content via gigabit. Local BLAST servers running large numbers of queries against the nt database. And others. In the latter case, I had our JackRabbit 4 core 13 drive RAID6 unit run 1000 sequences of A. thaliana against nt from July 07. I ran this on the RAID0 machine with 8 cores. The RAID0 machine can achieve 210 MB/s read speads according to bonnie++. The RAID0 machine has 8 GB ram, the JackRabbit has 16 GB. The indexed nt DB's fit into ram, and the code was optimized for x86_64. In both cases, I used 1 thread per core. The 4 core JackRabbit server finished its calculations (involving mmap'ed reading/rereading of the nt indexes in about 14 minutes. The 8 core Pegasus desktop finished its calculations in something on the order of 95 minutes. During this work, the JackRabbit was usable/functional as an NFS server, and I was running a VMWare server session on it at the same time, with a copy of Windows XP x64 with 2 CPUs, not to mention some streaming tars we were doing via nfs. We saw in aggregate about 5000 CSW/s, and under 7000 ints/s. System was not sluggish, and scaled very well. During this work, the Pegasus was sluggish as a desktop system. We saw in aggregate about 18000 CSW/s, and about 16000 ints/s. Same binary code, same data set, not a dd, but something the folks visiting us on this show floor might do themselves. > Personally seeing 600MB/sec out of 16 disks in 2 RAID-5's keeps me > pretty happy. Do you have numbers for a hardware RAID? I have a crappy We are seeing sustained 650-800 MB/s out of 1 13-disk RAID6. Out of a two RAID controller system, we are seeing something north of 1.2 GB/s sustained. Far outside of cache that is (not just simple dd tests, but real workloads). In all cases, we are not hitting a context switch wall. Nor are we hitting an interrupt wall. Please note that some of this is distribution dependent, in that some distributions have gone off in some not-so-helpful directions with regards to kernel stacks and other bits. We get our best performance numbers with Ubuntu, lose about 8-10% going to SuSE, lose ~20% going to Fedora. Nothing compared to what you losing going to RHEL4, but still, worth at least knowing. We recommend a 220.127.116.11 kernel we built that does a nice job on pretty much every workload we have thrown on it. Our .debs should be up on our download site. The .rpms take a little more work (I wish building kernel RPMs for RHEL/Centos/SuSE was even close to as easy as it is building new .debs for Ubuntu/Debian). > hardware RAID, I think it's the newest dell Perc, 6 of the fancy 15k rpm > 36GB SAS drives in a single RAID-5. I manage 171MB/sec. 494,499 context > switches, 2587 > per second. 910,294 interrupts, 4763/sec. > > So to compare: > software-raid hardware-raid s/h > Context switches/sec 8477 2587 3.28 > Interrupts/sec 13,478 4763 2.82 > > Which sounds like a big win until you realize that the software raid was > 3.5 times faster. I haven't seen Dell Perc's described as "fast". We don't use them, we don't have them, so I can't comment on them. Our HW raids out perform our SW raids by quite a bit. If we built our SW raids atop our HW RAID cards running in JBOD mode, then we would likely achieve "similar" numbers. Built atop the motherboard SATA (which the original poster was suggesting), you will experience the issues that I had indicated. If you build your SW raid atop the same controllers that others build their hardware RAID atop of, you aren't necessarily doing what the original poster asked for (they wanted to avoid spending money on the very cards you are using, and leverage the existing cheap SATA cards). > >> Sort of like an Amdahl's law. Make the expensive parallel computing >> portion take zero time, and you are still stuck with the serial time >> (which you can't do much about). Worse, it > > What part of context switching and interrupt handling doesn't scale with > core speed or cores, er, well at least sockets? Anything where the data sink/source is serial, such as IO, networking, ... Then you are sharing a fixed sized resource among more processors/sockets. You get a classic 1/N problem which looks/scales exactly like the OpenMP false sharing does. Adding more cores/sockets actually slows it down. In the case of the PCIe connected RAID cards, you have about 2 GB/s pipe into the unit. The RAID card handles the SATA controllers for you (does all the interrupt servicing via the processor on the card). Does all the local cache management, etc. If you are attaching SATA to this, and operating it as HW or SW RAID, you have only interrupts to the cards, not to the controllers. If on the other hand, and speaking to the point of the original poster, you attach SATAs to the SATA ports on the motherboard, the CPUs have to handle all the controllers, CSWs and interrupts. Very different scenarios. >> is size extensive, so as you increase the number of disks, you have to >> increase the interrupt rate (one controller per drive currently), and > > Er, why is that? Say I have 1000 disks. You want to read 64KB it's > going to be a few disks (unless you have an oddly small strip size), so > you generate a few interrupts (not 1000s). Back to the original point of the poster, the motherboard controllers (remember, this person does not want to buy RAID cards and then use them as SATA controllers) will generate interrupts per disk transfer, and you have one controller per disk. If you use a 3ware/Areca/LSI/Adaptec as a SATA controller (RAID in JBOD mode), this is a different story. One interrupt per controller card, and you are avoiding using functionality you have paid for (which in the majority of these cases, is not a bad thing). > Of course if you want to support 50MB/sec to 1000 disks then the interrupts > go up by a factor of 1000, of course you will bottleneck elsewhere. > > Why would interrupts scale with the number of disks instead of performance? See above, going to the original intent of the poster. One controller per disk. Controllers generate interrupts per transfer. N disks generate N*M interrupts for large transfers. > I've not noticed hardware raid scaling any better than software raid > per disk. I've not personally tested any system with more than 16 > drives though, I prefer commodity parts including RAID controllers, > power supplies, We have, on motherboard/cheap SATA controller connected SW RAID. > and cases. The 24-48 drive setups seem pretty exotic, low volume, and Not exotic. Not high volume. > make me nervous about cooling, drive spin up, weight, etc. If you Weight is an issue, you want good sturdy racks. Airflow is not an issue. Noise is. You need to move lots of air to keep these drives cool. Drive spin up is not an issue. We do this with delays. Works fine. > need to install a 48 disk server at the top of a 48U rack I am definitely > busy ;-). Not to mention I'd bet that under most work loads 4 16 disk Darn it, I was going to call you and ask for a hand with this :) > servers are going to be faster than 1 48... and cheaper. Probably worse > per watt, maybe not worse performance/watt. Actually no. The other way around. 1 x 48 (ok, one of our 48s) costs less than 4 x 16's (4 of our 16's) with the same drives. If you care about single file system name space, then you have to run a clustered file system, which complicates the 4 x 16s. Each 16 runs about 700W max. Each 48 runs about 1300W max. > I wouldn't turn down the opportunity to benchmark software RAID on a 48 > drive > though. Sun recommends software raid on their 48 disk server. ... which they charge significantly more than others for who use HW RAID with 48 drives (and who achieve somewhat better performance than the SW RAID). >> the base SATA drivers seem to have a problem with lots of CSW. > > Which? I've got a couple areca 16 port (not the fast new one) and a > couple 3ware 16 port of some 9550sx flavor (I've have to check if it's > the fastest they have in that size). I'd happily buy a 16 port non-raid > card if I could find them, I haven't so far. Areca and 3Ware are not SATA adapters, they are RAID adapters, which have a JBOD mode. If you are using these in your discussion, then you are leveraging all of the advantages of the HW RAID, with the simple difference of doing the RAID calculations on the CPU rather than on the card. The card handles all the SATA controller issues for you. As well as the interrupts the controller generates. It doesn't present the SATA controllers as pass-through. 3ware is known to be "not-fast" on RAID calcs. As the original postered indicated, they wanted to do this without spending money on the RAID cards (with JBOD mode). Which you can do with the 14 drive SuperMicro motherboards. I was talking about the latter, you seem to be talking about the former. [...] >> er .... so your plan is to use something like a network with RDMA to >> attach the disks. So you are not using SATA controllers. You are >> using network controllers. With some sort of offload capability (RDMA >> without it is a little slow). > > I've yet to see RDMA or TOE be justified from a performance perspective, > I've seen 800MB/sec over infinipath, but I wasn't driving it from a storage > array. I could try pushing 600MB/sec from disks to IB, I'd be kind of > surprised if I hit some context switch or interrupt wall. If there's > a real workload that has this problem and you think it's hitting the > wall I'm game for trying to set it up. Greg might comment on this, but Infinipath drivers operated in effectively a polling mode, and the cards did some of their own offload processing of some sort. We have seen RDMA/TOE make sense for users in a real code scenario. We used Ammasso 1100 cards for a customer running iWARP, and ran StarCD on it. Running same machines same network switch, without TOE/iWARP was 1/4 the speed of running with, for this MPI job (latency sensitive). I keep hearing people denigrate TOE and RDMA, but how many have actually used it? We have, and it has made some noticeable differences in the real world apps. Worth the cost? That is a separate discussion. I wouldn't pay a huge premium for it. This would be hard to justify apart from exceptional cases. >> You sort-of have something like this today, in Coraid's AOE units. If > > coraid didn't impress me as supporting very good bandwidth per disk, They dont. They are good, cheap, bulk storage. > if you want some kind of block level transport I'd suggest iSCSI over > whatever you want, infiniband or 10G would be the obvious choices. My I fail to see how iSCSI over gigabit would be any faster than AoE over gigabit. > testing of coraid and a few of the turn key 16-24 disk NAS like devices > that run linux with hardware RAID and I was VERY disappointed in their > performance, kind shocking considering the 5 figure prices. Wow ... our devices have 4 figure prices, and are quite a bit faster than most units with 5 figure pricing. Maybe I should bug you offline if you are willing to share information. Coraid's sweet spot is bulk storage. > >> you don't have experience with them, you should ask about what happens >> to the user load under intensive IO operations. Note: there is >> nothing wrong with Coraid units, we like them (and in full disclosure, >> we do resell them, and happily connect them with our JackRabbit units). > > For cheap block level storage sure, but the discussion seemed to be can > software raid be just as good or better than hardware RAID. I don't see > that as being particularly relevant to the coraid. This was related to his RDMA point. The RDMA adapters all come with a price premium. Just like the HW RAID adapters. The poster did not want to pay a premium for HW RAID adapters (like Areca/3ware/LSI/Adaptec), so I was confused as to why they wanted to pay a premium for RDMA. It was a wash in the end. Coraid is cheap block storage. The discussion as I read it was can you achieve HW RAID performance with SW RAID without spending money on the HW RAID adapters. Correct me if I am wrong, but you are using the proprietary HW RAID adapters in JBOD mode? [...] >> We have put our units (as well as software RAIDs) through some pretty >> hard tests: single RAID card feeding 4 simultaneous IOzone and >> bonnie++ tests (each test 2x the RAM in the server box) through >> channel bonded quad gigabit. Apart from uncovering some kernel OOPses >> due to the channel bond driver not liking really heavy loads, we >> sustained 360-390 MB/s out of the box, with large numbers of >> concurrent reads and writes. We simply did not see degradation. >> Could you cite some materials I can go look at, or help me understand >> which workloads you are talking about? > > I don't see any reason that software raid + quad GigE or IB/10G couldn't > do similar or better. We could just as easily turn off the HW RAID portion and do the same thing in SW RAID. The point of the poster was not to spend the money on the HW RAID adapter in the first place. If you don't spend the money on the HW RAID adapter, even if you run it solely in JBOD mode, and use the motherboard SATA, you will not achieve what we are talking about. >>> Areca is the best of the bunch, but it's not saying much compared to >>> Tier 1 storage ASICs/FPGAs. >> >> You get what you pay for. > > My experience is just the opposite. Low volume high margin expensive > storage solutions often leave me shocked and horrified as to their > performance, even > on the easy things like bandwidth, let alone the harder things like random > I/O or write intensive workloads. Correcting the context. The Bluearc/DDN/... FPGAs are highly tuned processors for storage. If you want them, you need to pay for them. > >>> The idea here is twofold. Eliminate the cost of the hardware RAID and >> >> I think you are going to wind up paying more than that cost in other >> elements, such as networking, JBOD cards (good ones, not the crappy >> driver ones). > > I'd love a cheap fast JBOD card, but alas I've been buying 3ware/areca > 16 ports just because I've not found anything cheaper. I'd rather have This is my point ... you are using the expensive RAID cards (in JBOD mode) while the poster wanted to "Eliminate the cost of the hardware RAID". You won't eliminate the cost of the hardware RAID by running it in JBOD mode. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615
- Previous message: [Beowulf] Software RAID?
- Next message: [Beowulf] Software RAID?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list