DQS drops jobs on SuSE 6.3 cluster

Fri Nov 3 06:43:48 PST 2000

Dear Michael,

thanks a lot for your hint. In the mean time I have been experimenting a bit
more, and I think now the problem was due to something else. i.e. I didn't
apply the patch (yet).

I observed that which queue actually executes the job depended on where I
submitted it. So, I started to look at file systems.

The example dqs.sh uses the -cwd flag to run the job in the current
directory, and also put the output files over there. This will obviously
only work when the current directory is mounted on all systems, and with
identical names. The job runs only on those systems (i.e. queues) which
happen to have a directory with the same name.

To achieve this, I used symbolic links, but it looks like qsub resolves the
symbolic link to its original name (which is NOT common to all systems).

I have the following setup:
- 4 Linux machines, called pp1,pp2,...
- each has a /data partition, which they export (for NFS)
- pp1 mounts the /data partitions of the other machines as /pp2-data etc.
- On pp1, I linked /data as /pp1-data.

Net result: on each machine, you can do 'cd /ppx-data/bla', and end up in
the same physical location.

However, if I am on pp1, 'cd /pp1-data/bla', 'qsub dqs.sh', it turns out
that the job ONLY runs when it was assigned to a queue on pp1. Looking at
the output of the job, I see that PWD was set to /data/bla, and not to
/pp1-data/bla as expected.
On the other hand, when I do exactly the same, but from (say) pp2,
everything works fine.

[ I tested this by creating a /data/bla on pp2 as well. Then indeed the job
runs in  a queue on pp2 as well, with output in pp2:/data/bla, and not in
pp1:/data/kris ]

So, at the moment, everything seems to work fine when I submit from a
machine which mounts the cwd.
An alternative solution would be to rename the partitions as pp1:/pp1-data,
such that I wouldn't need the symbolic link.

Personally, I find this behaviour of DQS with symbolic links unexpected, and
worth putting in the documentation (or changing in the code...)

Also, I would expect that the non-existence of the cwd on a system would be
flagged in the DQS err_file. Doesn't seem to happen though.

I'll wait to apply the patch till I discover other problems.

Many thanks,

Kris

>
> Dear Kris:
>
> I think I can help you with this.  This behavior sounds like it is due to
> a known bug in DQS 3.3.1 (and presumable earlier version), for which I
> have a patch from the DQS authors at Florida State University.   I attach
> a portion of an email I received from DQS support a while back regarding
> this issue, which contains a context 'diff' of the necessary patch.  I
> hope this helps.  We are running DQS 3.3.1 on a Red Hat based cluster here
> and it works very well.
>
>
>
> On Thu, 2 Nov 2000, Kris Thielemans wrote:
>
> > Hi,
> >
> > I'm trying to get DQS running on our cluster of 4 SuSE 6.3
> systems. I tried
> > 3 different versions of DQS
> > - the RPM package on the original CD
> > - the RPM pakcage provide on the SuSE website to update it to fix a y2k
> > problem (version 3.2.7)
> > - the newest version  (3.3.1) from ftp.scri.fsu.edu (compiled from
> > sources)
> >
> > All 3 versions have the same problem:
> > jobs are occasionally dropped from the queue, or even not started
> >
> > Symptoms:
> > qsub somejob.sh   -> works ok
> > qstat -f                -> lists job
> >
> > (a little bit later)
> > qstat -f                -> job gone
> >
> > This happens with the simple dqs.sh example script that they provide for
> > testing.
> >
> > There is NO error message in the dqs err_file, or anything in
> the log_file.
> >
> > This problem also occurs when I disable all queues except 1 (on the same
> > node as the qmaster).