[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

Christopher Samuel chris at csamuel.org
Tue Sep 25 18:42:09 PDT 2018


On 10/09/18 11:16, Joe Landman wrote:

> If you have dumps from the crash, you could load them up in the
> debugger.  Would be the most accurate route to determine why that was
> triggered.

Thanks Joe, after a bit of experimentation we've now successfully got a 
crash dump. It seems to confirm what I thought was the case, in that the
process is off in kernel space dealing with an APIC interrupt (a timer
in this case) when a SIMD exception gets raised.

crash> bt
PID: 138341  TASK: ffff9fd7eb3c6eb0  CPU: 27  COMMAND: "shuangTwoPhaseE"
  #0 [ffff9ff02ee6bc38] machine_kexec at ffffffff938629da
  #1 [ffff9ff02ee6bc98] __crash_kexec at ffffffff93916692
  #2 [ffff9ff02ee6bd68] crash_kexec at ffffffff93916780
  #3 [ffff9ff02ee6bd80] oops_end at ffffffff93f1d738
  #4 [ffff9ff02ee6bda8] die at ffffffff9382f96b
  #5 [ffff9ff02ee6bdd8] math_error at ffffffff9382cca8
  #6 [ffff9ff02ee6be98] do_simd_coprocessor_error at ffffffff9382cec8
  #7 [ffff9ff02ee6bec0] simd_coprocessor_error at ffffffff93f28c9e
  #8 [ffff9ff02ee6bf48] apic_timer_interrupt at ffffffff93f26791
     RIP: 00002b1b5d406828  RSP: 00007fff1f596148  RFLAGS: 00000293
     RAX: 00000000000005c8  RBX: 0000000000002bce  RCX: 0000000002c979e0
     RDX: 00000000000005cb  RSI: 0000000002dcedf0  RDI: 00000000000000b9
     RBP: 00007fff1f5a25d8   R8: 0000000000002d00   R9: 00000000000000b4
     R10: 0000000000000000  R11: 00000000026bcb48  R12: ffff9ff05c1461e8
     R13: 0000000000000000  R14: ffff9ff05c146200  R15: 0000000000010082
     ORIG_RAX: ffffffffffffff10  CS: 0033  SS: 002b

The kernel code is pretty short for it, basically in the RHEL7 kernel
it comes down to:

Are we in user space?
No?  Oh dear.
Is there a fixup registered for this address?
No?  OK, goodbye cruel world...

I've reached out to the maintainers of the arch/x86/ part of the tree
in case they had any general ideas on whether this was all the kernel
could be expected to do.  Only feedback so far is that yes this is odd,
and a query to another developer regarding whether some additional
checks that are done for when the process is in user space might be
applicable if that process has called into the kernel at that point.

My suspicion is that is the process is off doing some AVX stuff when
the timer occurs and an exception is either generated or just happens
to be delivered from the AVX unit at a bad time.

Going to see if I can persuade Easybuild to compile OpenFOAM without
AVX-512 optimisations first and try (if that doesn't fix it) turn off
different things until the problem goes away.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


More information about the Beowulf mailing list