[Beowulf] MPI and Redhat9 NFS slow down

Jack Chen chimou at mail.wsu.edu
Fri Aug 27 14:11:54 PDT 2004


Hi Laurence,

Thanks for the suggestion.  Changing the export from sync to async made 
a HUGE difference.

The same job finished in 252 seconds as to 10407 seconds before the 
change.

The sync option is the default export setting.

Jack


Laurence Liew wrote:

> hi,
> 
> try adding in async to NFS
> 
> it speeds up the IO on our RHEL V3 cluster by an order of magnitude... 
> not too sure the RH9 kernel and nfs supports async though
> 
> laurence
> 
> Jack Chen wrote:
> 
>> Hi all,
>>
>> I'm not sure if this is the right place to post this question.  If it
>> is not, please tell me where's the best place to get help on this,
>> thanks..
>>
>> We recently built a 8-node PC Linux cluster running RedHat 9 (kernel:
>> 2.4.20-8smp #1 SMP).  We use this system to run EPA's CMAQ
>> photochemical grid model.  I have installed the latest MPICH 1.2.6
>> with Portland Group Compiler (5.2-1) using ssh.  Everything worked
>> fine with the mpi example programs (cpi, pi3p etc)and 'make testing'. 
>> However when I tried to run any program that write output to other nfs
>> mounted drives I get very long delay.  I'm not sure where the problem
>> is.  I know the NFS automount is working fine because if I start the
>> job with just one processor (mpirun -np 1), I don't experience the
>> slow down.
>>
>> For example: If I start the job on master node using 4 processors 
>> (mpirun -np 4)
>> and write to the master node (master2 0),
>> PIxxx file:
>> master2 0 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node104 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>>
>> the run takes 168 sec
>>
>> If I start the same job but write the output to any other nfs mounted
>> drives besides the master node, the job will be extremely slow.  In
>> this case the same job took 10962 sec.
>>
>> I have tried to mount the drive using different parameters (rw,soft
>> and rw,hard,bg,intr,noac) and increased the nfsd daemon from 8 to 16
>> on the NSF server, but nothing change.
>>
>> If you have any idea on what is going on, please help!
>>
>> Any help/suggestion are greatly appreciated.
>>
>> Jack
>>
>>  Jack Chen
>>  Laboratory for Atmospheric Research
>>  Dept.of Civil & Environmental Engineering
>>  Washington State University
>>  Pullman, WA 99164-2910
>>  509.335.5738
>>  509.335.7632 (FAX)
>>
>>  
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>




More information about the Beowulf mailing list