<div dir="ltr">This may be a "cargo cult" answer from old SGE days but IIRC "137" was "128+9" and it means the process got signal 9 which means _something_ sent it a SIGKILL.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf <<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>I think you dialed the wrong number. We're the Beowulf people!
Although, I'm sure we can still help you. ;) <br>
</p>
<p>--<br>
Prentice<br>
</p>
<div>On 8/13/20 4:14 PM, Altemara, Anthony
wrote:<br>
</div>
<blockquote type="cite">
<div>
<p class="MsoNormal">Cheers SLURM people,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">We’re seeing some intermittent job failures
in our SLURM cluster, all with the same 137 exit code. I’m
having difficulty in determining whether this error code is
coming from SLURM (timeout?) or the Linux OS (process killed,
maybe memory).<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">In this example, there’s the WEXITSTATUS in
the slurmctld.log, error:0 status 35072 in the slurd.log, and
ExitCode 9:0 in the accounting log….???<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Does anyone have insight into how all
these correlate? I’ve spent a significant amount of time
digging through the documentation, and I don’t see a clear
way on how to interpret all these…<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Example: Job: 62791<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">[root@XXXXXXXXXXXXX] /var/log/slurm# grep
-ai jobid=62791 slurmctld.log<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T10:58:28.599]
_slurm_rpc_submit_batch_job: JobId=62791 InitPrio=4294845347
usec=679<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T10:58:29.080] sched: Allocate
JobId=62791 NodeList= XXXXXXXXXXXXX #CPUs=1 Partition=normal<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T11:17:45.275] _job_complete:
JobId=62791 <span style="background:yellow">
WEXITSTATUS 137</span><u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T11:17:45.294] _job_complete:
JobId=62791 done<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">[root@ XXXXXXXXXXXXX] /var/log/slurm# grep
62791 slurmd.log<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T10:58:29.090] _run_prolog:
prolog with lock for job 62791 ran for 0 seconds<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T10:58:29.090] Launching batch
job 62791 for UID 847694<u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T11:17:45.280] [62791.batch]
sending REQUEST_COMPLETE_BATCH_SCRIPT,
<span style="background:yellow">error:0
status 35072</span><u></u><u></u></p>
<p class="MsoNormal">[2020-08-13T11:17:45.405] [62791.batch]
done with job<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">[root@XXXXXXXXXXXXX] /var/log/slurm# sacct
-j 62791<u></u><u></u></p>
<p class="MsoNormal"> JobID JobName Partition
Account AllocCPUS State ExitCode
<u></u><u></u></p>
<p class="MsoNormal">------------ ---------- ----------
---------- ---------- ---------- --------
<u></u><u></u></p>
<p class="MsoNormal">62791 nf-normal+ normal
(null) 0 FAILED
<span style="background:yellow">9:0</span>
<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">[root@XXXXXXXXXXXXX] /var/log/slurm# sacct
-lc | tail -n 100 | grep 62791<u></u><u></u></p>
<p class="MsoNormal">JobID UID JobName Partition
NNodes NodeList State
Start End Timelimit
<u></u><u></u></p>
<p class="MsoNormal">62791 847694 nf-normal+
normal 1 XXXXXXXXXXX.+ FAILED 2020-08-13T10:58:29
2020-08-13T11:17:45 UNLIMITED<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Thank you!<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Anthony<b><u><span style="font-size:9pt;color:rgb(51,51,255)"> <u></u>
<u></u></span></u></b></p>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<br>
<br>
________________________________________<br>
<span><b>IMPORTANT</b> -
PLEASE READ: This electronic message, including its attachments,
is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
PRIVILEGED or PROTECTED information and is intended for the
authorized recipient of the sender. If you are not the intended
recipient, you are hereby notified that any use, disclosure,
copying, or distribution of this message or any of the
information included in it is unauthorized and strictly
prohibited. If you have received this message in error, please
immediately notify the sender by reply e-mail and permanently
delete this message and its attachments, along with any copies
thereof, from all locations received (e.g., computer, mobile
device, etc.). To the extent permitted by law, we may monitor
electronic communications for the purposes of ensuring
compliance with our legal and regulatory obligations and
internal policies. We may also collect email traffic headers for
analyzing patterns of network traffic and managing client
relationships. For further information see:
<a href="https://www.iqvia.com/about-us/privacy/privacy-policy" target="_blank">https://www.iqvia.com/about-us/privacy/privacy-policy</a>. Thank
you.
</span>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a>
</pre>
</blockquote>
<pre cols="72">--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a href="http://www.pppl.gov" target="_blank">http://www.pppl.gov</a></pre>
</div>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
</blockquote></div>