[Beowulf] mpich2 complain about nodes that i dont use

Ru-Zhen Li r.li at qmul.ac.uk
Sat Oct 1 03:43:32 PDT 2005


Dear Martin and Mark,

Thanks for the reply, however, the interesting thing is i used nearly 
exactly the same input files for my application, and the first one which i 
submitted several days ago is fine, no errors.

also, when I used ulimit -s, it turns out to be 10240, because the cluster 
is not very stable, for e.g., when i use mpdboot for every nodes before, it 
doesnt have any error...... so I am thinking it might be caused by the 
communicating problem between then nodes.......but i am not sure.....

Thanks again.




Yesterday is history, tomorrow is mystery, only today is a gift, that's why 
we call it present !

========================================================================
Ru-Zhen Li

0044 020 7882 6327
Materials Department
Queen Mary
University of London
E1 4NS

Email: r.li at qmul.ac.uk
Homepage: http://www.freewebs.com/lrz/
----- Original Message ----- 
From: "Martin Siegert" <siegert at sfu.ca>
To: "Mark Hahn" <hahn at physics.mcmaster.ca>
Cc: "Ru-Zhen Li" <r.li at qmul.ac.uk>; <beowulf at beowulf.org>
Sent: Saturday, October 01, 2005 3:37 AM
Subject: Re: [Beowulf] mpich2 complain about nodes that i dont use


> On Fri, Sep 30, 2005 at 09:47:46PM -0400, Mark Hahn wrote:
>> > I am using mpich2 on linux cluster, I kept having errors like the 
>> > following
>> >
>> > rank 14 in job 2  cn128_57798   caused collective abort of all ranks
>> >   exit status of rank 14: killed by signal 9
>>
>> signal 9 is sigkill (not segv or abrt, etc), and I'd be a bit surprised
>> if this happened other than by someone killing the process.
>
> I indeed was surprised when I saw that (signal 9) with one of our codes
> as well. In that case it turned out to be code that needed a larger
> stacksize than was permitted under the current settings (ulimit, etc.).
> Thus, if "ulimit -s" shows something like 8192 you may want to increase
> that and try again.
> I could imagine that something like this could also happen with code
> that has a memory leak and runs the system out of memory.
>
> - Martin
>
> -- 
> Martin Siegert
> Head, HPC at SFU
> WestGrid Site Manager
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> 




More information about the Beowulf mailing list