[Beowulf] Questions about a large job

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Apr 18 12:35:03 PDT 2006


On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote:

> The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
> runs well. But, when I tried to run with 2296 CPUs, the job won't start.

Are you able to run a simple "hello world" test ? If not, you might be
hitting the per-process descriptor limit, as each process will try to
open a TCP connection to each other process - in this case you should
still be able to run a job on something like 500 nodes (=1000
processes, slightly less than the 1024 maximum descriptors per
process).

> Various errors happened, one for each try. The Torque version installed
> is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.

This just confirms my suspicion expressed above.

To change the limits on a Red Hat like system, add a line like:

*	-	nofile	4096

to /etc/security/limits.conf.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De




More information about the Beowulf mailing list