[Beowulf] Fwd: warewulf - cannot log into nodes
Duke Nguyen
duke.lists at gmx.com
Thu Nov 29 02:52:13 PST 2012
On 11/28/12 1:56 AM, Gus Correa wrote:
> On 11/27/2012 01:52 PM, Gus Correa wrote:
>> On 11/27/2012 02:14 AM, Duke Nguyen wrote:
>>> On 11/27/12 1:44 PM, Christopher Samuel wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 27/11/12 15:51, Duke Nguyen wrote:
>>>>
>>>>> Thanks! Yes, I am trying to get the system work with
>>>>> Torque/Maui/OpenMPI now.
>>>> Make sure you build Open-MPI with support for Torques TM interface,
>>>> that will save you a lot of hassle as it means mpiexec/mpirun will
>>>> find out directly from Torque what nodes and processors have been
>>>> allocated for the job.
>>> Christopher, how would I check that? I got Torque/Maui/OpenMPI up,
>>> working with root (not with normal user yet :( !!!), tried mpirun and it
>>> worked fine:
>>>
> PS - Do 'qsub myjob' as a regular user, not as root.
>
>>> # /usr/lib64/openmpi/bin/mpirun -pernode --hostfile
>>> /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello
>>> Hello world! I am process number: 3 on host node0118
>>> Hello world! I am process number: 1 on host node0104
>>> Hello world! I am process number: 0 on host node0103
>>> Hello world! I am process number: 2 on host node0117
>>>
>>> Thanks,
>>>
>>> D.
>> D.
>>
>> Try to omit the hostfile from your mpirun command line,
>> put it inside a Torque/PBS script, and submit it with qsub.
>> Like this:
>>
>> *********************************
>> myPBSScript.tcsh
>> *********************************
>> #! /bin/tcsh
>> #PBS -l nodes=2:ppn=8 [Assuming your Torque 'nodes' file has np=8]
>> #PBS -q batch at mycluster.mydomain
>> #PBS -N hello
>> @ NP = `cat $PBS_NODEFILE | wc -l`
>> mpirun -np ${NP} ./mpihello
>> *********************************
>>
>> $ qsub myPBSScript.tcsh
>>
>>
>> If OpenMPI was built with Torque support,
>> the job will run on the nodes/processors allocated by Torque.
>> [The nodes/processors are listed in $PBS_NODEFILE,
>> but you don't need to refer to it in the mpirun line if
>> OpenMPI was built with Torque support. If OpenMPI lacks
>> Torque support, then you can use $PBS_NODEFILE as your hostfile:
>> mpirun -hostfile $PBS_NODEFILE.]
>>
>> If Torque was installed in a standard place, say under /usr,
>> then OpenMPI configure will pick it up automatically.
>> If not in a standard location, then add
>> --with-tm=/torque/directory
>> to the OpenMPI configure line.
>> [./configure --help is your friend!]
>>
>> Another check:
>>
>> $ ompi_info [tons of output that you can grep for "tm" to see
>> if Torque was picked up.]
>>
>>
OK, after a huge headache of torque/maui things, I finally found out
that my master node's system was a mess :D. Multiple version of torque
(via yum and via src etc...) which cause the confuse for different users
logging in (root or normal users) - well, mainly because I followed
different guides on the net. Then I decided to delete everything related
to pbs (torque, maui, openmpi) and start from scratch. So I built torque
rpms for masters/nodes, installed them, then built maui rpm, installed
with support for torque, then built openmpi rpm with support for torque
too. This time I think I got almost everything:
[mpiwulf at biobos:~]$ ompi_info | grep tm
MCA ras: tm (MCA v2.0, API v2.0, Component v1.6.3)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.6.3)
MCA ess: tm (MCA v2.0, API v2.0, Component v1.6.3)
openmpi now works with infiniband:
[mpiwulf at biobos:~]$ /usr/local/bin/mpirun -mca btl ^tcp -pernode
--hostfile /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello
Hello world! I am process number: 3 on host node0118
Hello world! I am process number: 1 on host node0104
Hello world! I am process number: 2 on host node0117
Hello world! I am process number: 0 on host node0103
openmpi also works with torque:
----------------
[mpiwulf at biobos:~]$ cat test/KCBATCH
#!/bin/bash
#
#PBS -l nodes=6:ppn=1
#PBS -N kcTEST
#PBS -m be
#PBS -e qsub.er.log
#PBS -o qsub.ou.log
#
{ time {
/usr/local/bin/mpirun /home/mpiwulf/test/mpihello
} } &>output.log
[mpiwulf at biobos:~]$ qsub test/KCBATCH
21.biobos
[mpiwulf at biobos:~]$ cat output.log
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: node0103
OMPI source: btl_openib_component.c:1200
Function: ompi_free_list_init_ex_new()
Device: mthca0
Memlock limit: 65536
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: node0103
Local device: mthca0
--------------------------------------------------------------------------
Hello world! I am process number: 5 on host node0103
Hello world! I am process number: 0 on host node0104
Hello world! I am process number: 2 on host node0110
Hello world! I am process number: 4 on host node0118
Hello world! I am process number: 1 on host node0109
Hello world! I am process number: 3 on host node0117
[node0104:02221] 5 more processes have sent help message
help-mpi-btl-openib.txt / init-fail-no-mem
[node0104:02221] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[node0104:02221] 5 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
real 0m0.291s
user 0m0.034s
sys 0m0.043s
----------------
Unfortunately I still got the problem of "error registering openib
memory" with non-interactive job. Any experience on this would be great.
Thanks,
D.
More information about the Beowulf
mailing list