DQS drops jobs on SuSE 6.3 cluster

Michael D. Bartberger mdb at chem.ucla.edu
Thu Nov 2 06:28:04 PST 2000


Dear Kris:

I think I can help you with this.  This behavior sounds like it is due to
a known bug in DQS 3.3.1 (and presumable earlier version), for which I
have a patch from the DQS authors at Florida State University.   I attach
a portion of an email I received from DQS support a while back regarding
this issue, which contains a context 'diff' of the necessary patch.  I
hope this helps.  We are running DQS 3.3.1 on a Red Hat based cluster here
and it works very well.

With best regards,
-Michael

----------------------------

[deletia]

Since you mentioned that you are setting up a Linux cluster, I should
warn you that the qmaster has had some trouble dropping its listen
on some Redhat Linux hosts.  Apparently the syslog(3) function in glibc
was closing the wrong file descriptor which turned out to be the socket
descriptor for the qmaster listen.  We have fixed the problem and here
is a context diff of the fix:

*** dqs_log.c	2000/04/30 09:49:34	1.7
--- dqs_log.c	2000/07/24 04:08:58
***************
*** 319,325 ****
--- 319,331 ----
       
  {
    
+ #ifdef linux
+   openlog("",LOG_LOCAL0,LOG_LOCAL0);
+ #endif
    syslog(log_level,"%s",err_str);
+ #ifdef linux
+   closelog();
+ #endif
    return;
    
  }



-----------------



On Thu, 2 Nov 2000, Kris Thielemans wrote:

> Hi,
> 
> I'm trying to get DQS running on our cluster of 4 SuSE 6.3 systems. I tried
> 3 different versions of DQS
> - the RPM package on the original CD
> - the RPM pakcage provide on the SuSE website to update it to fix a y2k
> problem (version 3.2.7)
> - the newest version  (3.3.1) from ftp.scri.fsu.edu (compiled from
> sources)
> 
> All 3 versions have the same problem:
> jobs are occasionally dropped from the queue, or even not started
> 
> Symptoms:
> qsub somejob.sh   -> works ok
> qstat -f                -> lists job
> 
> (a little bit later)
> qstat -f                -> job gone
> 
> This happens with the simple dqs.sh example script that they provide for
> testing.
> 
> There is NO error message in the dqs err_file, or anything in the log_file.
> 
> This problem also occurs when I disable all queues except 1 (on the same
> node as the qmaster).
> 
> 
> Any ideas?
> 
> Thanks,
> 
> Kris Thielemans
> 
> MRC Cyclotron Unit,
> Hammersmith Hospital,
> DuCane Rd,London W12 0NN, United Kingdom
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 








More information about the Beowulf mailing list