[Beowulf] torque: 4GB resources_used.mem limit

Bernd Schubert bernd-schubert at gmx.de
Wed Jun 29 03:01:08 PDT 2005


Hello,

I already posted this to torqueusers at supercluster.org, but I think this list 
has rather little traffic and I guess there are more people subscribed to 
this list, who also already might have had this problem.

We have a cluster running a combination of torque + maui (just for those who 
might not know, torque is the recent version of openpbs). 
In principle its running fine, we only have one pretty annoying problem, 
torque does not detect jobs requiring more than 4GB, qstat always only shows
'actual_size - 4GB' for jobs with more than 4GB.
If it only would be a problem of qstat, we wouldn't care. Unfortunately it
also prevents torque to kill improperly specified jobs. So it can happen and
already happend several times, that one job required all memory on a node,
but torque happily started another job on this node, just because at least
one user didn't properly specify how much memory his/her jobs required and
since torque didn't kill those jobs automatically. Of course, this results in 
heavy swap usage and slowes down both jobs dramatically. 

We hoped this issue would be solved after the installation of the 64-bit (its
a 32/64bit biarch debian system) version of torque, but this didn't help.
Anyone here having an idea whats going on, how to debug or even how to solve
this?
I'm pretty unfamiliar with torque+maui (we don't maintain the basic stuff
ourselves) and also havn't looked into the source code. From thinking in the
C language, I can only imagine that someone has directly specified a 32bit
integer for the memory variable, but who would do this?

The torque version is 1.2.0p3 and maui is 3.2.6p11-2.


Thanks in a advance,
        Bernd


-- 
Bernd Schubert
PCI / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: bernd.schubert at pci.uni-heidelberg.de




More information about the Beowulf mailing list