[Beowulf] [External] SLURM - Where is this exit status coming from?

Skylar Thompson skylar.thompson at gmail.com
Thu Aug 13 14:37:46 PDT 2020


I think this is an artifact of the job process running as a child process of
the job script, where POSIX defines the low-order 8 bits of the process
exit code as indicating which signal the child process received when it exited.

As others noted, 137 is 2^8+9, where 9 is SIGKILL (exceeding memory, also
exceeding the runtime request at least in the Grid Engine world).

On Thu, Aug 13, 2020 at 02:24:49PM -0700, Alex Chekholko via Beowulf wrote:
> This may be a "cargo cult" answer from old SGE days but IIRC "137" was
> "128+9" and it means the process got signal 9 which means _something_ sent
> it a SIGKILL.
> 
> On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf <
> beowulf at beowulf.org> wrote:
> 
> > I think you dialed the wrong number. We're the Beowulf people! Although,
> > I'm sure we can still help you. ;)
> >
> > --
> > Prentice
> > On 8/13/20 4:14 PM, Altemara, Anthony wrote:
> >
> > Cheers SLURM people,
> >
> >
> >
> > We’re seeing some intermittent job failures in our SLURM cluster, all with
> > the same 137 exit code. I’m having difficulty in determining whether this
> > error code is coming from SLURM (timeout?) or the Linux OS (process killed,
> > maybe memory).
> >
> >
> >
> > In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0
> > status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
> >
> >
> >
> > Does anyone have insight into  how all these correlate? I’ve spent a
> > significant amount of time digging  through the documentation, and I don’t
> > see a clear way on how to interpret all these…
> >
> >
> >
> >
> >
> > Example: Job: 62791
> >
> >
> >
> > [root at XXXXXXXXXXXXX]  /var/log/slurm# grep -ai jobid=62791 slurmctld.log
> >
> > [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
> > InitPrio=4294845347 usec=679
> >
> > [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList=
> > XXXXXXXXXXXXX #CPUs=1 Partition=normal
> >
> > [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
> >
> > [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
> >
> >
> >
> >
> >
> > [root@ XXXXXXXXXXXXX]  /var/log/slurm# grep 62791 slurmd.log
> >
> > [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran
> > for 0 seconds
> >
> > [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
> >
> > [2020-08-13T11:17:45.280] [62791.batch] sending
> > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
> >
> > [2020-08-13T11:17:45.405] [62791.batch] done with job
> >
> >
> >
> >
> >
> > [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -j 62791
> >
> >        JobID    JobName  Partition    Account  AllocCPUS      State
> > ExitCode
> >
> > ------------ ---------- ---------- ---------- ---------- ----------
> > --------
> >
> > 62791        nf-normal+     normal     (null)          0     FAILED
> > 9:0
> >
> >
> >
> > [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
> >
> > JobID    UID    JobName  Partition   NNodes        NodeList
> > State               Start                 End  Timelimit
> >
> > 62791        847694 nf-normal+     normal        1 XXXXXXXXXXX.+
> > FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45  UNLIMITED
> >
> >
> >
> >
> >
> > Thank you!
> >
> >
> >
> > Anthony
> >
> >
> >
> >
> > ________________________________________
> > *IMPORTANT* - PLEASE READ: This electronic message, including its
> > attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
> > PRIVILEGED or PROTECTED information and is intended for the authorized
> > recipient of the sender. If you are not the intended recipient, you are
> > hereby notified that any use, disclosure, copying, or distribution of this
> > message or any of the information included in it is unauthorized and
> > strictly prohibited. If you have received this message in error, please
> > immediately notify the sender by reply e-mail and permanently delete this
> > message and its attachments, along with any copies thereof, from all
> > locations received (e.g., computer, mobile device, etc.). To the extent
> > permitted by law, we may monitor electronic communications for the purposes
> > of ensuring compliance with our legal and regulatory obligations and
> > internal policies. We may also collect email traffic headers for analyzing
> > patterns of network traffic and managing client relationships. For further
> > information see: https://www.iqvia.com/about-us/privacy/privacy-policy.
> > Thank you.
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >
> > --
> > Prentice Bisbal
> > Lead Software Engineer
> > Research Computing
> > Princeton Plasma Physics Laboratoryhttp://www.pppl.gov
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


-- 
Skylar


More information about the Beowulf mailing list