[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?
chris at csamuel.org
Tue Sep 25 18:42:09 PDT 2018
On 10/09/18 11:16, Joe Landman wrote:
> If you have dumps from the crash, you could load them up in the
> debugger. Would be the most accurate route to determine why that was
Thanks Joe, after a bit of experimentation we've now successfully got a
crash dump. It seems to confirm what I thought was the case, in that the
process is off in kernel space dealing with an APIC interrupt (a timer
in this case) when a SIMD exception gets raised.
PID: 138341 TASK: ffff9fd7eb3c6eb0 CPU: 27 COMMAND: "shuangTwoPhaseE"
#0 [ffff9ff02ee6bc38] machine_kexec at ffffffff938629da
#1 [ffff9ff02ee6bc98] __crash_kexec at ffffffff93916692
#2 [ffff9ff02ee6bd68] crash_kexec at ffffffff93916780
#3 [ffff9ff02ee6bd80] oops_end at ffffffff93f1d738
#4 [ffff9ff02ee6bda8] die at ffffffff9382f96b
#5 [ffff9ff02ee6bdd8] math_error at ffffffff9382cca8
#6 [ffff9ff02ee6be98] do_simd_coprocessor_error at ffffffff9382cec8
#7 [ffff9ff02ee6bec0] simd_coprocessor_error at ffffffff93f28c9e
#8 [ffff9ff02ee6bf48] apic_timer_interrupt at ffffffff93f26791
RIP: 00002b1b5d406828 RSP: 00007fff1f596148 RFLAGS: 00000293
RAX: 00000000000005c8 RBX: 0000000000002bce RCX: 0000000002c979e0
RDX: 00000000000005cb RSI: 0000000002dcedf0 RDI: 00000000000000b9
RBP: 00007fff1f5a25d8 R8: 0000000000002d00 R9: 00000000000000b4
R10: 0000000000000000 R11: 00000000026bcb48 R12: ffff9ff05c1461e8
R13: 0000000000000000 R14: ffff9ff05c146200 R15: 0000000000010082
ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b
The kernel code is pretty short for it, basically in the RHEL7 kernel
it comes down to:
Are we in user space?
No? Oh dear.
Is there a fixup registered for this address?
No? OK, goodbye cruel world...
I've reached out to the maintainers of the arch/x86/ part of the tree
in case they had any general ideas on whether this was all the kernel
could be expected to do. Only feedback so far is that yes this is odd,
and a query to another developer regarding whether some additional
checks that are done for when the process is in user space might be
applicable if that process has called into the kernel at that point.
My suspicion is that is the process is off doing some AVX stuff when
the timer occurs and an exception is either generated or just happens
to be delivered from the AVX unit at a bad time.
Going to see if I can persuade Easybuild to compile OpenFOAM without
AVX-512 optimisations first and try (if that doesn't fix it) turn off
different things until the problem goes away.
All the best,
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the Beowulf