Cheers SLURM people,

We're seeing some intermittent job failures in our SLURM cluster, all with the same 137 exit code. I'm having difficulty in determining whether this error code is coming from SLURM (timeout?) or the Linux OS (process killed, maybe memory).

In this example, there's the WEXITSTATUS in the slurmctld.log, error:0 status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log....???

Does anyone have insight into  how all these correlate? I've spent a significant amount of time digging  through the documentation, and I don't see a clear way on how to interpret all these...

Example: Job: 62791

[root at XXXXXXXXXXXXX]  /var/log/slurm# grep -ai jobid=62791 slurmctld.log
[2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791 InitPrio=4294845347 usec=679
[2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList= XXXXXXXXXXXXX #CPUs=1 Partition=normal
[2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
[2020-08-13T11:17:45.294] _job_complete: JobId=62791 done

[root@ XXXXXXXXXXXXX]  /var/log/slurm# grep 62791 slurmd.log
[2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran for 0 seconds
[2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
[2020-08-13T11:17:45.280] [62791.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
[2020-08-13T11:17:45.405] [62791.batch] done with job

[root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -j 62791
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
62791        nf-normal+     normal     (null)          0     FAILED      9:0

[root at XXXXXXXXXXXXX]  /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
JobID    UID    JobName  Partition   NNodes        NodeList      State               Start                 End  Timelimit
62791        847694 nf-normal+     normal        1 XXXXXXXXXXX.+     FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45  UNLIMITED

Thank you!


