[Beowulf] MPI and Redhat9 NFS slow down
Jack Chen
chimou at mail.wsu.edu
Fri Aug 27 14:11:54 PDT 2004
Hi Laurence,
Thanks for the suggestion. Changing the export from sync to async made
a HUGE difference.
The same job finished in 252 seconds as to 10407 seconds before the
change.
The sync option is the default export setting.
Jack
Laurence Liew wrote:
> hi,
>
> try adding in async to NFS
>
> it speeds up the IO on our RHEL V3 cluster by an order of magnitude...
> not too sure the RH9 kernel and nfs supports async though
>
> laurence
>
> Jack Chen wrote:
>
>> Hi all,
>>
>> I'm not sure if this is the right place to post this question. If it
>> is not, please tell me where's the best place to get help on this,
>> thanks..
>>
>> We recently built a 8-node PC Linux cluster running RedHat 9 (kernel:
>> 2.4.20-8smp #1 SMP). We use this system to run EPA's CMAQ
>> photochemical grid model. I have installed the latest MPICH 1.2.6
>> with Portland Group Compiler (5.2-1) using ssh. Everything worked
>> fine with the mpi example programs (cpi, pi3p etc)and 'make testing'.
>> However when I tried to run any program that write output to other nfs
>> mounted drives I get very long delay. I'm not sure where the problem
>> is. I know the NFS automount is working fine because if I start the
>> job with just one processor (mpirun -np 1), I don't experience the
>> slow down.
>>
>> For example: If I start the job on master node using 4 processors
>> (mpirun -np 4)
>> and write to the master node (master2 0),
>> PIxxx file:
>> master2 0 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node103 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>> node104 1 /master2/home/chenj/CMAQ_v4.3/Run/cctm/CCTM_e2a
>>
>> the run takes 168 sec
>>
>> If I start the same job but write the output to any other nfs mounted
>> drives besides the master node, the job will be extremely slow. In
>> this case the same job took 10962 sec.
>>
>> I have tried to mount the drive using different parameters (rw,soft
>> and rw,hard,bg,intr,noac) and increased the nfsd daemon from 8 to 16
>> on the NSF server, but nothing change.
>>
>> If you have any idea on what is going on, please help!
>>
>> Any help/suggestion are greatly appreciated.
>>
>> Jack
>>
>> Jack Chen
>> Laboratory for Atmospheric Research
>> Dept.of Civil & Environmental Engineering
>> Washington State University
>> Pullman, WA 99164-2910
>> 509.335.5738
>> 509.335.7632 (FAX)
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
More information about the Beowulf
mailing list