[Beowulf] NFS over RDMA performance confusion

Ellis H. Wilson III ellis at cse.psu.edu
Thu Sep 13 05:56:29 PDT 2012

On 09/13/2012 08:34 AM, Joe Landman wrote:
> On 09/13/2012 07:52 AM, holway at th.physik.uni-frankfurt.de wrote:
>> If I set up a single machine to hammer the fileserver with IOzone I see
>> something like 50,000 IOPS but if all four machines are hammering the
>> filesystem concurrently we got it up to 180,000 IOPS.
> I wouldn't recommend IOzone for this sort of testing.  Its not a very
> good load generator, and it has a tendency to report things which are
> not actually seen at the hardware level.  I'd noticed this some years
> ago, when running some of our benchmark testing on these units, that an
> entire IOzone benchmark completed with very few activity lights going on
> the disks.  Which suggested that the test was happily entirely cached,
> and I was running completely within cache.

I assume so, but just to be clear you witnessed this behavior even with 
the -I (directio) parameter?

>> Can anyone tell me what might be the bottleneck on the single machines?
>> Why can I not get 180,000 IOPS when running on a single machine.

Can you rerun those tests with, 16 and 32 procs?  I've run into some 
pretty wacky relationships between numbers of cores, procs, and disks in 
the subsystem.  I assume your machine has 8 cores, and I tend to find 
around 2 processes per core to be ideal if the number of disks your 
trying to run against are greater than the number of cores.  This is a 
really handwaving rule-of-thumb, but it's served me alright in the past 
as a first benchmark.

Also, are you flushing your caches in between runs?  Regardless of what 
iozone promises, that's the safest thing to do.  If you somehow have 
root access to your file servers I'd do the same on them as well to be 
absolutely safe.

> Some observations ... these don't sound like disk subsystems.  A 15k RPM
> drive will give you ~300 IOPs.  To get 50k IOPs, you would need 167 disk

Maybe I'm misreading the iozone command line, as I haven't played with 
that for 6+ months, but it looks like your writing 8 tiny ten megabyte 
files.  This kind of really tiny benchmarking gives a lot of credence to 
Joe's fear this is running out of cache exclusively.

I almost always try to do tests that are twice the size of all of my 
machines (all clients + all file servers) to get a real idea of 
steady-state throughput.  But I'm more concerned with absolutely huge 
I/O so perhaps this isn't your use-case.  If your use case is just 80 
Megs, screw it, you don't need a file server, you need to run MySQL in 
solely main-memory mode.



More information about the Beowulf mailing list