Robert G. Brown
rgb at phy.duke.edu
Wed May 8 01:25:08 PDT 2002
On Tue, 7 May 2002, Timothy W. Moore wrote:
> This is getting frustrating. I have an application where each node
> creates its own data file. When I go to process these files with a
> serial application on the host, it can only read the first timestep
> contained within the file. Could this have something to with NFS...I
> am the owner? If I rsh to the node and look back at /home, the owner is
> Admin. (FYI - I did not create such an ID). One other thing.../home is
> a software raid. Is that a good/bad idea for the host node?
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
I am still having a bit of trouble visualizing your difficulty. The
following simply cannot fail unless something is >>seriously<< broken in
a) Export /home from your server (s1) to all b-nodes. b-nodes must
also mount /home (df should show home correctly mounted) AND s1 and the
b-nodes must share a common uid/gid space (same /etc/passwd, /etc/group,
/etc/shadow) AND the export and mount options must allow rw, not just
ro, for users if not for root. This is done correctly when you can rsh
to a node, touch (or cp or write in some other way) a file in your home
directory, and have it appear or be modified.
WHEN THIS IS TRUE, a node process is no different from the cp, mv,
touch, or any other command that modifies files. Provided that the node
process "belongs" to you (its UID/GID are your UID/GID, which will be
true as long as it you haven't deliberately tried to make it otherwise
with some very specific commands that generally require root privileges
to execute, or forked it out of a root-owned daemon process, or
something else hienous) it can create a file in your home directory
/home/you (PROVIDED that you own it and have sane permissions set),
write to it, close it. SO...
b) On b1, write a SIMPLE task that opens b1.out, writes "Hello,
World!" on 100 lines, closes b1.out. Run it. (Writing this should take
you ten minutes in C, five minutes in perl - less time than it is taking
me to write this response.)
c) You damn well ought to be able to read b1.out on any host that
correctly mounts /home/you with e.g. less /home/you/b1.out.
d) Repeat for all nodes.
As I said, if you can edit files on the nodes in your home directory
with an ordinary file editor or execute simple cp commands, this cannot
really fail unless you've installed a fundamentally broken distribution,
have a fundamentally broken network, or some other really serious (and
Now, assuming that you've arranged it for all that to work, and you
STILL have the problem, one can try to debug it. We've now verified
that it isn't NFS, the kernel, a broken libc, or anything in your basic
configuration or hardware. It can only be in the programs themselves.
In other words, you have a pure and simple bug in your code.
You can prove this to yourself by inserting your hello world code into
your programs as a subroutine executed at the very beginning followed
(for the moment) by an exit statement. Again, it "cannot" fail unless
you've got a basic permissions problem (suid-something program or the
like). So, if when you remove the exit call, your program still cannot
write your file correctly, it is time to look at IT to find out why.
Things to check:
a) Simple bug. You think you're executing the fprintf loop many times,
but really you're executing it only once. Stick a printf in the loop so
it prints to stdout as well, then run the code. One line? Or many?
b) Ordinarily, write buffers get flushed when a line is terminated
with a LF (\n). Ordinarily, NFS is designed to sync writes back to the
home disk "immediately" (buffered, yes, but it doesn't sit on writebacks
like AFS does until the file is closed). Still, I've never found that
it hurts to insert an fflush command into a write loop if I'm concerned
that buffering is screwing up the synchronization between writes and the
appearance of the written text onto file or stdout.
c) Ditto, closing the file will ALWAYS flush. The point is, that it is
"possible" that if you try reading the file when its output buffers
haven't been flushed (yet), the image you will read won't be consistent
with what will eventually be written.
d) IF you are opening the output file on b1 and writing, and AT THE
SAME TIME opening the output file on s1 and trying to read it into your
collection task a line at a time as it is written, then No, No, No You
Cannot Do This. NFS (and file based I/O in general) are not a form of
poor man's IPC's. I mean think about it. You open the file on b1 and
write a line to it. s1 opens the file, reads and buffers its contents,
and reads that first line. b1 writes another line. s1, however, is now
reading the BUFFER it first created and has no way to know or suspect
that it has been updated and all its stat information is wrong. It
thinks it is at EOF. To see this, try the following:
(window a) cat > dummy
text... do not ctrl-d close.
(window b) less dummy
(you will see:
and any other lines you've terminated with an "enter" (LF) in window a.
Do not exit less.
type some more text
and some more
Hit g and G to go to the beginning and end of the buffer. Oops! The
new text is missing! less has no idea that you've written more to the
file since it opened the file and buffered it. There are times and
places where it might do better (in a pipe, for example, or possibly if
the file were so long it wouldn't fit in its default buffer) but you see
If you END less and rerun it, you see the new text (at least that part
that has been flushed). If you END cat with a ctrl-d, you flush it all,
and a new instance of less will now read all the data.
If d) is your problem, then you have a couple of choices. You can wait
until the b1 task completes (closing the file) and THEN read it on s1,
or variants thereof. I do this all the time when I graze for data being
generated by embarrassingly parallel tasks. In fact, I have scripts
that harvest ALL the data in all of the files that are still open and
being written to by the tasks when I run them, basically recreating the
aggregate dataset from scratch each time I run it. After all,
statisical postprocessing takes only a second or two, so it is no big
deal to do it over each time I want to "look" at the latest results.
OR, you can write a real IPC grazer, and write your results back to it
in a master/slave paradigm. PVM does this very simply -- pvm spawn all
your worker tasks, and send the results back to the master as they come
in. Your master task can then do whatever you like with them. Note
that pvm has what NFS is missing -- ways of informing the master that
the slave has written a line to be read, for example. Of course you can
easily do the same thing with raw sockets, MPI, or the like. You can
even manage it with NFS without closing and reopening the files if you
try really hard (using e.g. rewind) but I wouldn't recomend it.
Hope this helps.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf