[Beowulf] NFS over RDMA performance confusion

William Law law.will at gmail.com
Thu Sep 13 08:25:59 PDT 2012

I'm a little wary about entering the conversation, but I've been spending way too much time with ZFS so perhaps this will help.

First, benchmark whatever you will really run - ZFS is odd enough that anything involving simulations may not match up to an actual application.  I guess I'd argue that is true of any technology.

With database or database-like technology performance can be dramatically influenced by the ZFS recordsize.  On some workloads (which I have not yet seen, tho Oracle is a frequently cited example so it certainly applies to some DBs) logbias needs to be set to throughput to see reasonable performance.  The general principle is to set the recordsize to the same size as your writes.  Here is what Oracle says for MySQL: 

I'd tend to follow their practices as Nexenta is mostly silent on applications at the moment.

Network tuning is also an issue.  If you are using anything IP based (which NFS over RDMA fortunately gets you out of), also look at things like turning off Nagle's algorithm.  Last time I used IB on solaris the drivers were a little weird, but I will admit it was on a niagara box about…. 4 years ago? 

Remember too that L2ARC only caches reads and the ZIL only caches writes, I think specifically writes smaller than 32k.  Under random IO ZFS is limited  to the performance of a single disk in each VDEV which is drastically different from most other storage systems.

It is impressive technology but is also…  a bit complicated.


On Sep 13, 2012, at 4:52 AM, holway at th.physik.uni-frankfurt.de wrote:

> Hi,
> I am a bit confused.
> I have 4 top notch dual socket machines with 128GB ram each. I also have a
> Nexenta box which is my NFS / ZFS server. Everything is connected together
> with QDR infiniband.
> I want to use this for setup for mysql databases so I am testing for 16K
> random and stride performance.
> If I set up a single machine to hammer the fileserver with IOzone I see
> something like 50,000 IOPS but if all four machines are hammering the
> filesystem concurrently we got it up to 180,000 IOPS.
> Can anyone tell me what might be the bottleneck on the single machines?
> Why can I not get 180,000 IOPS when running on a single machine.
> If I test using IPoIB in connected mode I see this: http://pastie.org/4708542
> Some kind of buffer problem?
> Thanks,
> Andrew
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list