[Beowulf] Questions about a large job
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Apr 18 12:35:03 PDT 2006
On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote:
> The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
> runs well. But, when I tried to run with 2296 CPUs, the job won't start.
Are you able to run a simple "hello world" test ? If not, you might be
hitting the per-process descriptor limit, as each process will try to
open a TCP connection to each other process - in this case you should
still be able to run a job on something like 500 nodes (=1000
processes, slightly less than the 1024 maximum descriptors per
process).
> Various errors happened, one for each try. The Torque version installed
> is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.
This just confirms my suspicion expressed above.
To change the limits on a Red Hat like system, add a line like:
* - nofile 4096
to /etc/security/limits.conf.
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
More information about the Beowulf
mailing list