[Beowulf] [External] SLURM - Where is this exit status coming from?
Skylar Thompson
skylar.thompson at gmail.com
Thu Aug 13 14:41:06 PDT 2020
Hmm, apparently math is hard today. I of course meant 2^7, not 2^8.
On Thu, Aug 13, 2020 at 02:37:46PM -0700, Skylar Thompson wrote:
> I think this is an artifact of the job process running as a child process of
> the job script, where POSIX defines the low-order 8 bits of the process
> exit code as indicating which signal the child process received when it exited.
>
> As others noted, 137 is 2^8+9, where 9 is SIGKILL (exceeding memory, also
> exceeding the runtime request at least in the Grid Engine world).
>
> On Thu, Aug 13, 2020 at 02:24:49PM -0700, Alex Chekholko via Beowulf wrote:
> > This may be a "cargo cult" answer from old SGE days but IIRC "137" was
> > "128+9" and it means the process got signal 9 which means _something_ sent
> > it a SIGKILL.
> >
> > On Thu, Aug 13, 2020 at 2:22 PM Prentice Bisbal via Beowulf <
> > beowulf at beowulf.org> wrote:
> >
> > > I think you dialed the wrong number. We're the Beowulf people! Although,
> > > I'm sure we can still help you. ;)
> > >
> > > --
> > > Prentice
> > > On 8/13/20 4:14 PM, Altemara, Anthony wrote:
> > >
> > > Cheers SLURM people,
> > >
> > >
> > >
> > > We’re seeing some intermittent job failures in our SLURM cluster, all with
> > > the same 137 exit code. I’m having difficulty in determining whether this
> > > error code is coming from SLURM (timeout?) or the Linux OS (process killed,
> > > maybe memory).
> > >
> > >
> > >
> > > In this example, there’s the WEXITSTATUS in the slurmctld.log, error:0
> > > status 35072 in the slurd.log, and ExitCode 9:0 in the accounting log….???
> > >
> > >
> > >
> > > Does anyone have insight into how all these correlate? I’ve spent a
> > > significant amount of time digging through the documentation, and I don’t
> > > see a clear way on how to interpret all these…
> > >
> > >
> > >
> > >
> > >
> > > Example: Job: 62791
> > >
> > >
> > >
> > > [root at XXXXXXXXXXXXX] /var/log/slurm# grep -ai jobid=62791 slurmctld.log
> > >
> > > [2020-08-13T10:58:28.599] _slurm_rpc_submit_batch_job: JobId=62791
> > > InitPrio=4294845347 usec=679
> > >
> > > [2020-08-13T10:58:29.080] sched: Allocate JobId=62791 NodeList=
> > > XXXXXXXXXXXXX #CPUs=1 Partition=normal
> > >
> > > [2020-08-13T11:17:45.275] _job_complete: JobId=62791 WEXITSTATUS 137
> > >
> > > [2020-08-13T11:17:45.294] _job_complete: JobId=62791 done
> > >
> > >
> > >
> > >
> > >
> > > [root@ XXXXXXXXXXXXX] /var/log/slurm# grep 62791 slurmd.log
> > >
> > > [2020-08-13T10:58:29.090] _run_prolog: prolog with lock for job 62791 ran
> > > for 0 seconds
> > >
> > > [2020-08-13T10:58:29.090] Launching batch job 62791 for UID 847694
> > >
> > > [2020-08-13T11:17:45.280] [62791.batch] sending
> > > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 35072
> > >
> > > [2020-08-13T11:17:45.405] [62791.batch] done with job
> > >
> > >
> > >
> > >
> > >
> > > [root at XXXXXXXXXXXXX] /var/log/slurm# sacct -j 62791
> > >
> > > JobID JobName Partition Account AllocCPUS State
> > > ExitCode
> > >
> > > ------------ ---------- ---------- ---------- ---------- ----------
> > > --------
> > >
> > > 62791 nf-normal+ normal (null) 0 FAILED
> > > 9:0
> > >
> > >
> > >
> > > [root at XXXXXXXXXXXXX] /var/log/slurm# sacct -lc | tail -n 100 | grep 62791
> > >
> > > JobID UID JobName Partition NNodes NodeList
> > > State Start End Timelimit
> > >
> > > 62791 847694 nf-normal+ normal 1 XXXXXXXXXXX.+
> > > FAILED 2020-08-13T10:58:29 2020-08-13T11:17:45 UNLIMITED
> > >
> > >
> > >
> > >
> > >
> > > Thank you!
> > >
> > >
> > >
> > > Anthony
> > >
> > >
> > >
> > >
> > > ________________________________________
> > > *IMPORTANT* - PLEASE READ: This electronic message, including its
> > > attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY
> > > PRIVILEGED or PROTECTED information and is intended for the authorized
> > > recipient of the sender. If you are not the intended recipient, you are
> > > hereby notified that any use, disclosure, copying, or distribution of this
> > > message or any of the information included in it is unauthorized and
> > > strictly prohibited. If you have received this message in error, please
> > > immediately notify the sender by reply e-mail and permanently delete this
> > > message and its attachments, along with any copies thereof, from all
> > > locations received (e.g., computer, mobile device, etc.). To the extent
> > > permitted by law, we may monitor electronic communications for the purposes
> > > of ensuring compliance with our legal and regulatory obligations and
> > > internal policies. We may also collect email traffic headers for analyzing
> > > patterns of network traffic and managing client relationships. For further
> > > information see: https://www.iqvia.com/about-us/privacy/privacy-policy.
> > > Thank you.
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> > >
> > > --
> > > Prentice Bisbal
> > > Lead Software Engineer
> > > Research Computing
> > > Princeton Plasma Physics Laboratoryhttp://www.pppl.gov
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> > >
>
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
>
> --
> Skylar
--
Skylar
More information about the Beowulf
mailing list