[Beowulf] [External] SLURM - Where is this exit status coming from?

Alex Chekholko alex at calicolabs.com
Thu Aug 13 14:24:49 PDT 2020


This may be a "cargo cult" answer from old SGE days but IIRC "137" was
"128+9" and it means the process got signal 9 which means _something_ sent
it a SIGKILL.

On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> I think you dialed the wrong number. We're the Beowulf people! Although,
> I'm sure we can still help you. ;)
>
> --
> Prentice
> On 8/13/20 4:14 PM, Altemara, Anthony wrote:
>
> Cheers SLURM people,
>
>
>
> We’re seeing some intermittent job failures in our SLURM cluster, all with
> the same 137 exit code. I’m having difficulty in determining whether this
> error code is coming from SLURM (timeout?) or the Linux OS (process killed,
> maybe memory).
>
>
>
> In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0
> status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
>
>
>
> Does anyone have insight into  how all these correlate? I’ve spent a
> significant amount of time digging  through the documentation, and I don’t
> see a clear way on how to interpret all these…
>
>
>
>
>
> Example: Job: 62791
>
>
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# grep -ai jobid=62791 slurmctld.log
>
> [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
> InitPrio=4294845347 usec=679
>
> [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList=
> XXXXXXXXXXXXX #CPUs=1 Partition=normal
>
> [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
>
> [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
>
>
>
>
>
> [root@ XXXXXXXXXXXXX]  /var/log/slurm# grep 62791 slurmd.log
>
> [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran
> for 0 seconds
>
> [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
>
> [2020-08-13T11:17:45.280] [62791.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
>
> [2020-08-13T11:17:45.405] [62791.batch] done with job
>
>
>
>
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -j 62791
>
>        JobID    JobName  Partition    Account  AllocCPUS      State
> ExitCode
>
> ------------ ---------- ---------- ---------- ---------- ----------
> --------
>
> 62791        nf-normal+     normal     (null)          0     FAILED
> 9:0
>
>
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
>
> JobID    UID    JobName  Partition   NNodes        NodeList
> State               Start                 End  Timelimit
>
> 62791        847694 nf-normal+     normal        1 XXXXXXXXXXX.+
> FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45  UNLIMITED
>
>
>
>
>
> Thank you!
>
>
>
> Anthony
>
>
>
>
> ________________________________________
> *IMPORTANT* - PLEASE READ: This electronic message, including its
> attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
> PRIVILEGED or PROTECTED information and is intended for the authorized
> recipient of the sender. If you are not the intended recipient, you are
> hereby notified that any use, disclosure, copying, or distribution of this
> message or any of the information included in it is unauthorized and
> strictly prohibited. If you have received this message in error, please
> immediately notify the sender by reply e-mail and permanently delete this
> message and its attachments, along with any copies thereof, from all
> locations received (e.g., computer, mobile device, etc.). To the extent
> permitted by law, we may monitor electronic communications for the purposes
> of ensuring compliance with our legal and regulatory obligations and
> internal policies. We may also collect email traffic headers for analyzing
> patterns of network traffic and managing client relationships. For further
> information see: https://www.iqvia.com/about-us/privacy/privacy-policy.
> Thank you.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> --
> Prentice Bisbal
> Lead Software Engineer
> Research Computing
> Princeton Plasma Physics Laboratoryhttp://www.pppl.gov
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20200813/81d0898c/attachment.html>


More information about the Beowulf mailing list