[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?
Joe Landman
joe.landman at gmail.com
Sun Sep 9 18:16:56 PDT 2018
I've not seen this one, but looking around a bit, I am wondering if the
code path hit a denormal underflow in a SIMD instruction, and didn't
have the appropriate SIMD exception mask. See
https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz
for info.
Basically, if there is a SIMD exception, and the exception isn't masked
off with an FTZ or similar, or an exception interrupt handler
registered, it could wind up somewhere like this.
If you have dumps from the crash, you could load them up in the
debugger. Would be the most accurate route to determine why that was
triggered.
On 09/09/2018 09:04 PM, Christopher Samuel wrote:
> Hi folks,
>
> We've had 2 different nodes crash over the past few days with kernel
> panics triggered by (what is recorded as) a "simd exception" (console
> messages below). In both cases the triggering application is given as
> the same binary, a user application built against OpenFOAM v16.06.
>
> This doesn't happen every time, I can see about 28 successful runs of
> the application this month (the binary was built at the end of August).
>
> The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.
>
> Any ideas?
>
>
> ------------------8< snip snip 8<------------------
>
> 2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
> 2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop
> 8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
> intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul
> ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm
> dcdbas aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm
> ib_uverbs cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad
> sysimgblt fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801
> shpchp nfit ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad
> acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE)
> lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm
> ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod
> cdrom sd_mod crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit
> i2c_core ahci crct10dif_pclmul crct10dif_common crc32c_intel libahci
> ib_core libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
> 2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm:
> shuangTwoPhaseE Tainted: P OE ------------ T
> 3.10.0-862.9.1.el7.x86_64 #1
> 2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge
> R740/06G98X, BIOS 1.4.8 05/21/2018
> 2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti:
> ffff995c1988c000 task.ti: ffff995c1988c000
> 2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>]
> [<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
> 2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200 EFLAGS:
> 00010082
> 2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX:
> 0000000001a95e00 RCX: 0000000000000090
> 2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI:
> ffff995c1da46200 RDI: ffff995c1988ff70
> 2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08:
> 0000000000000c40 R09: 0000000000000031
> 2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11:
> 0000000000e72148 R12: 0000000001c4e770
> 2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14:
> 00000000011935b0 R15: 0000000000000038
> 2018-09-09 17:14:34 [179203.861040] FS: 00002ad83f7afa00(0000)
> GS:ffff995c1da40000(0000) knlGS:0000000000000000
> 2018-09-09 17:14:34 [179203.869213] CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> 2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3:
> 00000017963f8000 CR4: 00000000007607e0
> 2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> 2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
> 2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
> 2018-09-09 17:14:34 [179203.899530] Call Trace:
> 2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00
> fe ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83
> c7 28 48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04
> 25 60 0e 01 00 65 48 0f 44
> 2018-09-09 17:14:34 [179203.922628] RIP [<ffffffffbe121791>]
> apic_timer_interrupt+0x141/0x170
> 2018-09-09 17:14:34 [179203.929259] RSP <ffff995c1da46200>
> 2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
> 2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal
> exception
> 2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from
> 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
>
> ------------------8< snip snip 8<------------------
>
> ------------------8< snip snip 8<------------------
>
> 2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
> 2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop
> 8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac
> intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm
> drm_kms_helper irqbypass syscopyarea sysfillrect crc32_pclmul
> sysimgblt iTCO_wdt fb_sys_fops ib_ucm iTCO_vendor_support
> ghash_clmulni_intel rdma_ucm dm_mod dcdbas drm ib_uverbs aesni_intel
> lrw gf128mul glue_helper ablk_helper cryptd mei_me sg lpc_ich i2c_i801
> shpchp ib_umad mei ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm
> tpm_crb acpi_pad acpi_power_meter binfmt_misc overlay(OET) osc(OE)
> mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE)
> ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE)
> ib_ipoib ib_cm sd_mod sr_mod cdrom crc_t10dif crct10dif_generic hfi1
> rdmavt i2c_algo_bit ahci i2c_core crct10dif_pclmul libahci
> crct10dif_common crc32c_intel ib_core libata megaraid_sas pps_core
> libcrc32c [last unloaded: pcspkr]
> 2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm:
> shuangTwoPhaseE Tainted: P W OE ------------ T
> 3.10.0-862.9.1.el7.x86_64 #1
> 2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge
> R740/06G98X, BIOS 1.4.8 05/21/2018
> 2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti:
> ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
> 2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>]
> [<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
> 2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200 EFLAGS:
> 00010082
> 2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX:
> 00007fffe0d21d78 RCX: 0000000000000090
> 2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI:
> ffff9e345c006200 RDI: ffff9e2f88a0ff70
> 2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08:
> 000000000001e800 R09: 00000000000007a0
> 2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11:
> 0000000002818868 R12: 0000000002d20790
> 2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14:
> 00007fffe0d15b40 R15: 00007fffe0d15a20
> 2018-09-07 22:37:16 [201527.347772] FS: 00002b835d26da00(0000)
> GS:ffff9e345c000000(0000) knlGS:0000000000000000
> 2018-09-07 22:37:16 [201527.356659] CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> 2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3:
> 0000002fdd8f6000 CR4: 00000000007607e0
> 2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> 2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
> 2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
> 2018-09-07 22:37:16 [201527.390523] Call Trace:
> 2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00
> fe ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83
> c7 28 48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04
> 25 60 0e 01 00 65 48 0f 44
> 2018-09-07 22:37:16 [201527.415810] RIP [<ffffffffa6721791>]
> apic_timer_interrupt+0x141/0x170
> 2018-09-07 22:37:16 [201527.423189] RSP <ffff9e345c006200>
> 2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
> 2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal
> exception
> 2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from
> 0xffffffff81000000 (relocation range:
> 0xffffffff80000000-0xffffffffbfffffff)
>
> ------------------8< snip snip 8<------------------
>
>
> All the best!
> Chris
--
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
More information about the Beowulf
mailing list