[Beowulf] [External] SLURM - Where is this exit status coming from?

Prentice Bisbal pbisbal at pppl.gov
Thu Aug 13 14:20:57 PDT 2020


I think you dialed the wrong number. We're the Beowulf people! Although, 
I'm sure we can still help you. ;)

--
Prentice

On 8/13/20 4:14 PM, Altemara, Anthony wrote:
>
> Cheers SLURM people,
>
> We’re seeing some intermittent job failures in our SLURM cluster, all 
> with the same 137 exit code. I’m having difficulty in determining 
> whether this error code is coming from SLURM (timeout?) or the Linux 
> OS (process killed, maybe memory).
>
> In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0 
> status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
>
> Does anyone have insight into  how all these correlate? I’ve spent a 
> significant amount of time digging  through the documentation, and I 
> don’t see a clear way on how to interpret all these…
>
> Example: Job: 62791
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# grep -ai jobid=62791 slurmctld.log
>
> [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791 
> InitPrio=4294845347 usec=679
>
> [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList= 
> XXXXXXXXXXXXX #CPUs=1 Partition=normal
>
> [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
>
> [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
>
> [root@ XXXXXXXXXXXXX]  /var/log/slurm# grep 62791 slurmd.log
>
> [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 
> ran for 0 seconds
>
> [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
>
> [2020-08-13T11:17:45.280] [62791.batch] sending 
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
>
> [2020-08-13T11:17:45.405] [62791.batch] done with job
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -j 62791
>
>        JobID    JobName  Partition  Account  AllocCPUS      State 
> ExitCode
>
> ------------ ---------- ---------- ---------- ---------- ---------- 
> --------
>
> 62791        nf-normal+     normal (null)          0     FAILED 9:0
>
> [root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
>
> JobID    UID    JobName  Partition NNodes        NodeList      State 
> Start                 End  Timelimit
>
> 62791        847694 nf-normal+ normal        1 XXXXXXXXXXX.+     
> FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45  UNLIMITED
>
> Thank you!
>
> Anthony*__*
>
>
>
> ________________________________________
> *IMPORTANT* - PLEASE READ: This electronic message, including its 
> attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY 
> PRIVILEGED or PROTECTED information and is intended for the authorized 
> recipient of the sender. If you are not the intended recipient, you 
> are hereby notified that any use, disclosure, copying, or distribution 
> of this message or any of the information included in it is 
> unauthorized and strictly prohibited. If you have received this 
> message in error, please immediately notify the sender by reply e-mail 
> and permanently delete this message and its attachments, along with 
> any copies thereof, from all locations received (e.g., computer, 
> mobile device, etc.). To the extent permitted by law, we may monitor 
> electronic communications for the purposes of ensuring compliance with 
> our legal and regulatory obligations and internal policies. We may 
> also collect email traffic headers for analyzing patterns of network 
> traffic and managing client relationships. For further information 
> see: https://www.iqvia.com/about-us/privacy/privacy-policy. Thank you.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20200813/25580344/attachment-0001.html>


More information about the Beowulf mailing list