NFS Performance (was Re: [Beowulf] GPFS on Linux (x86))

Fri Sep 15 09:39:52 PDT 2006

Michael Will wrote:
> We are not using jumbo packets (they are not on by default, are they?)

do an

	ifconfig | grep -i mtu

and see if anything is above 1500

> and
> I have tried both udp and tcp mounts, same symptom. The nics tested
> where
> syskonnect as well as tg3 and it behaved identically. 

We had intel nics that worked great (on the server), and broadcom (tg3) 
on the client.  When I moved to an Intel nic it went away (kind of 
expensive for a cluster if you have to buy the cards and insert them). 
When we moved to the bcm5700 and forced tcp mounts and "normal" MTU the 
problems went away.  I seem to remember trying a syskonnect card using 
the sk98lin module on the clients, and it worked fine, but I pulled down 
the new version of the driver for some reason.

You might also be filtering RPC and portmap, or the "smart" switch could 
be doing some of that as well.

What does rpcinfo report?  This is for a Centos 4.3 looking at a SuSE 
10.1 server.

[root at crunch-r ~]# rpcinfo -p dualcore
    program vers proto   port
     100000    2   tcp    111  portmapper
     100000    2   udp    111  portmapper
     100024    1   udp  32768  status
     100021    1   udp  32768  nlockmgr
     100021    3   udp  32768  nlockmgr
     100021    4   udp  32768  nlockmgr
     100024    1   tcp  53220  status
     100021    1   tcp  53220  nlockmgr
     100021    3   tcp  53220  nlockmgr
     100021    4   tcp  53220  nlockmgr
[root at crunch-r ~]#

I am not exporting any drives right now, so if I do a showmount ...

[root at crunch-r ~]# showmount -e dualcore
mount clntudp_create: RPC: Program not registered

> 
> -----Original Message-----
> From: Joe Landman [mailto:landman at scalableinformatics.com] 
> Sent: Friday, September 15, 2006 9:28 AM
> To: Michael Will
> Cc: Brent Franks; Chris Samuel; beowulf at beowulf.org
> Subject: Re: NFS Performance (was Re: [Beowulf] GPFS on Linux (x86))
> 
> Michael Will wrote:
>> I am puzzled by an sles9sp3 (2.6.9 kernel) nfs server that serves 
>> rhel3
>> (2.4.21 kernel) compute nodes. For some reason a lot of times the 
>> mounts fail (with default as well as modified parameters). The symptom
> 
>> is
>> mount: rpc timeout. The server logs all authentification requests as 
>> successful.  The switch is an oversubscribed hp 4108gl.
> 
> Yes.  This is what we ran into last year.  A SuSE box serving a Rocks
> cluster (Rocks 4.0).  Basic idea: use tcp mounts and turn off jumbo
> packets.  Also, we had major issues with the tg3 driver, and moved the
> RHEL units to a BCM5700 driver.  After this, most of the problems went
> away.
> 
> 

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615