[Beowulf] Stupid MPI programming question

Robert G. Brown rgb at phy.duke.edu
Thu Sep 28 10:25:29 PDT 2006


On Thu, 28 Sep 2006, Michael Will wrote:

> That's wierd. On my scyld cluster it worked fine once I had created
> /tmp/oooo/ on all compute nodes before running the job.

Maybe we should ask something like "what compiler/kernel/distro" are you
using?  Although he's already begging for mercy from the list now that
his immediate problem is solved:-)

How about it, Brent?  Want more meta-comments and advice?  Mark already
kind of hammered you a bit for "bad style" and I personally refrained
from doing the same (held back by being really busy, mostly:-) but I
agree that your code needs to be a lot prettier and you need to do
things like making directories and creating files therein in a ritual
fashion, checking for error codes on return and printing the resulting
error codes on return OR accepting the fact that you're going to have to
deal with serious debugging issues when things fail because of trivial
syntatically allowed typos.  Well written code is almost by definition
relatively easy to debug, and trying out a new, complex library is NOT a
good time to get sloppy with style...

How far you go to make your code compliant with standard coding practice
and style as discussed in any of a bunch of places (most standard or
"good" coding practice is that way for infinitely practical reasons as
experienced coders are infinitely practical people and don't waste
effort on "style" for no reason:-) or in the name of portability and
robustness is up to you.  If it is really quick-and-dirty code, fine,
but EXPECT problems with debugging then and prepare to spend the time --
you're gambling that your experience will make up for a lack of checking
and if you lose you pay the piper.  OTOH, if you are writing a
commercial-grade app you might actually dust out the *stat() calls and
do things like check ownership and permissions and existence of
paths/parents before trying to make a directory or file as well as
handle at least some of the more likely error codes at the perror()
level or better (or better because some of the errors can be trapped and
in an interactive application, the user can get another chance to enter
a string correctly or the like).  You can also decide on whether or not
to trap the actual signals to avoid crashing the parent program for the
same reasons.

In an MPI program this may be a really gooood thing to do, as you may
crash the entire distributed application and not just the particular
subtask on some given node on certain errors; if you "can" trap the
errors and recover (even crudely) you may be able to keep the main
computation going, or sent out a message that forces a checkpoint to be
written so that the app can be fixed and restarted, or at the very least
get the error message back to where you can find it and figure out WHICH
subprogram/node failed and why.

I personally tend to be gaudy about documenting, indenting, and using a
very consistent (if idiosyncratic) style, including the embedding of
runtime debugging code, in nearly all apps I write.  After all, I may be
the one fixing it (as I am today) years or months after writing it...;-)

    rgb

> Michael
>
> -----Original Message-----
> From: 	Clements, Brent M (SAIC) [mailto:brent.clements at bp.com]
> Sent:	Thu Sep 28 07:11:27 2006
> To:	Jakob Oestergaard; Robert G. Brown
> Cc:	beowulf at beowulf.org
> Subject:	RE: [Beowulf] Stupid MPI programming question
>
> What I ended up doing was just stripping the program down to like 10 lines of code and I have a simple sprintf to create the directory name.
>
> What is wierd is that(I haven't done error reporting yet):
>
> mkdir("NEWDIRNAME"); works(creates a directory in the current working directory)
>
> but mkdir("/tmp/oooo/NEWDIRNAME") doesn't work, I even tried a chdir(which came back succcessful) and then wrote the above mkdir("NEWDIRNAME");
>
> Anyway..I'm starting to get off-topic.
>
> Nevertheless, I got it working minimally to what I wanted to do, so I have a great template for a simple MPI program.
>
> Thanks to everyone who helped out!!!
>
>
>
>
>
>
> This message may contain confidential and/or privileged information.  If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein.  If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation.
>
> ________________________________
>
> From: Jakob Oestergaard [mailto:jakob at unthought.net]
> Sent: Thu 9/28/2006 8:09 AM
> To: Robert G. Brown
> Cc: Clements, Brent M (SAIC); beowulf at beowulf.org
> Subject: Re: [Beowulf] Stupid MPI programming question
>
>
>
> On Thu, Sep 28, 2006 at 08:57:28AM -0400, Robert G. Brown wrote:
>> On Thu, 28 Sep 2006, Jakob Oestergaard wrote:
> ...
>> Ah, that's it.  I'd forgotten this one and missed the write to a static
>> string, although it has bitten me in the past (partly because back in
>> the remote K&R past one could nearly always get away with it).  This is
>> also a way that buffer overwrite attacks can begin if any nefarious
>> human can control the string that is overwritten IIRC...
>>
>> Although in this particular case, that should have produced a very
>> different error than -1 on the mkdir call, should it not?
>
> Well, if the write doesn't give him a segfault and he's allowed to write
> to memory that shouldn't be written to, then I guess pretty much
> anything can happen after that.
>
>> And he was
>> writing out the results string per node right before calling as well, so
>> his compiler was probably letting him get away with it or failing in
>> some odd way later.
>
> Yup
>
> I wonder if valgrind would have caught it...
>
>>
>>> What you probably want to do is:
>>> ---
>>> char foo[1024];          // 1KiB on the stack - writable
>>> strncpy(foo, sizeof foo, "my test"); // Assign contents by copying
>>> ...
>>> foo[0] = ' ';             // <- fine
>>> ---
>>
>> Absolutely.
>
> Uh, except I meant strncpy(foo, "my test", sizeof foo) of course...
>
> Cheers,
>
> --
>
> / jakob
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list