[Beowulf] SIMD exception kernel panic on Skylake-EP triggered by OpenFOAM?

Jonathan Engwall engwalljonathanthereal at gmail.com
Sun Sep 9 21:23:18 PDT 2018


If it is helpful there are a few similar bugs, generally considered unreproducible. One thread calls it bogus xcomp_bv...the kernel clobbers itself writing zeroes when that is not the state. And spectre came up. One suggestion is to disable IBRS; according to other sources IBRS is dangerous to disable and should protect against Spectre. Maybe the OpenFOAM is to blame.

Something interesting about Spectre:
https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown/MitigationControls

And something with a little similarity:
https://www.suse.com/support/kb/doc/?id=7017833

Unable to handle null pointer:
https://groups.google.com/forum/m/#!msg/linux.kernel/NQjqgvrJ18o/4DoP2nggAgAJ

And here with nvidia - I see you have nicks and it seems things went wrong with POE. Maybe this can help:
https://devtalk.nvidia.com/default/topic/972567/crash-in-centos-with-driver-319-76/


On September 9, 2018, at 6:05 PM, Christopher Samuel <chris at csamuel.org> wrote:

Hi folks,

We've had 2 different nodes crash over the past few days with kernel
panics triggered by (what is recorded as) a "simd exception" (console
messages below). In both cases the triggering application is given as
the same binary, a user application built against OpenFOAM v16.06.

This doesn't happen every time, I can see about 28 successful runs of
the application this month (the binary was built at the end of August).

The system in question has 2 x 16C Xeon Gold 6140 Skylake-EP CPUs.

Any ideas?


------------------8< snip snip 8<------------------

2018-09-09 17:14:34 [179203.697285] simd exception: 0000 [#1] SMP
2018-09-09 17:14:34 [179203.701527] Modules linked in: squashfs loop 
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac 
intel_powerclamp coretemp intel_rapl iosf_mbi irqbypass crc32_pclmul 
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support rdma_ucm ib_ucm dcdbas 
aesni_intel mgag200 lrw gf128mul glue_helper ablk_helper ttm ib_uverbs 
cryptd drm_kms_helper dm_mod syscopyarea sysfillrect ib_umad sysimgblt 
fb_sys_fops drm mei_me sg ipmi_si mei lpc_ich i2c_i801 shpchp nfit 
ipmi_devintf ipmi_msghandler libnvdimm tpm_crb acpi_pad acpi_power_meter 
binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) 
mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm ptlrpc(OE) 
obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sr_mod cdrom sd_mod 
crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit i2c_core ahci 
crct10dif_pclmul crct10dif_common crc32c_intel libahci ib_core libata 
megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-09 17:14:34 [179203.784359] CPU: 2 PID: 159455 Comm: 
shuangTwoPhaseE Tainted: P           OE  ------------ T 
3.10.0-862.9.1.el7.x86_64 #1
2018-09-09 17:14:34 [179203.795389] Hardware name: Dell Inc. PowerEdge 
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-09 17:14:34 [179203.802958] task: ffff995c1aee8fd0 ti: 
ffff995c1988c000 task.ti: ffff995c1988c000
2018-09-09 17:14:34 [179203.810539] RIP: 0010:[<ffffffffbe121791>] 
[<ffffffffbe121791>] apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.819515] RSP: 0000:ffff995c1da46200  EFLAGS: 
00010082
2018-09-09 17:14:34 [179203.824928] RAX: ffff995c1988ff70 RBX: 
0000000001a95e00 RCX: 0000000000000090
2018-09-09 17:14:34 [179203.832146] RDX: 0000000000000000 RSI: 
ffff995c1da46200 RDI: ffff995c1988ff70
2018-09-09 17:14:34 [179203.839364] RBP: 00007ffd8b8ba848 R08: 
0000000000000c40 R09: 0000000000000031
2018-09-09 17:14:34 [179203.846591] R10: 0000000000000000 R11: 
0000000000e72148 R12: 0000000001c4e770
2018-09-09 17:14:34 [179203.853827] R13: 0000000000000007 R14: 
00000000011935b0 R15: 0000000000000038
2018-09-09 17:14:34 [179203.861040] FS:  00002ad83f7afa00(0000) 
GS:ffff995c1da40000(0000) knlGS:0000000000000000
2018-09-09 17:14:34 [179203.869213] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
2018-09-09 17:14:34 [179203.875042] CR2: 0000000002a18000 CR3: 
00000017963f8000 CR4: 00000000007607e0
2018-09-09 17:14:34 [179203.882274] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
2018-09-09 17:14:34 [179203.889495] DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
2018-09-09 17:14:34 [179203.896714] PKRU: 55555554
2018-09-09 17:14:34 [179203.899530] Call Trace:
2018-09-09 17:14:34 [179203.902065] Code: 48 39 cc 77 2f 48 8d 81 00 fe 
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28 
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60 
0e 01 00 65 48 0f 44
2018-09-09 17:14:34 [179203.922628] RIP  [<ffffffffbe121791>] 
apic_timer_interrupt+0x141/0x170
2018-09-09 17:14:34 [179203.929259]  RSP <ffff995c1da46200>
2018-09-09 17:14:34 [179203.933970] ---[ end trace 3912e5e8b3b86da4 ]---
2018-09-09 17:14:34 [179203.984039] Kernel panic - not syncing: Fatal 
exception
2018-09-09 17:14:34 [179203.989451] Kernel Offset: 0x3ca00000 from 
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------

------------------8< snip snip 8<------------------

2018-09-07 22:37:16 [201527.171417] simd exception: 0000 [#1] SMP
2018-09-07 22:37:16 [201527.176270] Modules linked in: squashfs loop 
8021q garp mrp stp llc nvidia_uvm(POE) nvidia(POE) xfs skx_edac 
intel_powerclamp coretemp intel_rapl iosf_mbi mgag200 ttm drm_kms_helper 
irqbypass syscopyarea sysfillrect crc32_pclmul sysimgblt iTCO_wdt 
fb_sys_fops ib_ucm iTCO_vendor_support ghash_clmulni_intel rdma_ucm 
dm_mod dcdbas drm ib_uverbs aesni_intel lrw gf128mul glue_helper 
ablk_helper cryptd mei_me sg lpc_ich i2c_i801 shpchp ib_umad mei ipmi_si 
ipmi_devintf ipmi_msghandler nfit libnvdimm tpm_crb acpi_pad 
acpi_power_meter binfmt_misc overlay(OET) osc(OE) mgc(OE) lustre(OE) 
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) rdma_cm iw_cm 
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ib_ipoib ib_cm sd_mod sr_mod 
cdrom crc_t10dif crct10dif_generic hfi1 rdmavt i2c_algo_bit ahci 
i2c_core crct10dif_pclmul libahci crct10dif_common crc32c_intel ib_core 
libata megaraid_sas pps_core libcrc32c [last unloaded: pcspkr]
2018-09-07 22:37:16 [201527.264079] CPU: 17 PID: 32227 Comm: 
shuangTwoPhaseE Tainted: P        W  OE  ------------ T 
3.10.0-862.9.1.el7.x86_64 #1
2018-09-07 22:37:16 [201527.275789] Hardware name: Dell Inc. PowerEdge 
R740/06G98X, BIOS 1.4.8 05/21/2018
2018-09-07 22:37:16 [201527.284045] task: ffff9e345a42eeb0 ti: 
ffff9e2f88a0c000 task.ti: ffff9e2f88a0c000
2018-09-07 22:37:16 [201527.292302] RIP: 0010:[<ffffffffa6721791>] 
[<ffffffffa6721791>] apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.301978] RSP: 0000:ffff9e345c006200  EFLAGS: 
00010082
2018-09-07 22:37:16 [201527.308091] RAX: ffff9e2f88a0ff70 RBX: 
00007fffe0d21d78 RCX: 0000000000000090
2018-09-07 22:37:16 [201527.316032] RDX: 0000000000000000 RSI: 
ffff9e345c006200 RDI: ffff9e2f88a0ff70
2018-09-07 22:37:16 [201527.323969] RBP: 00007fffe0d21d78 R08: 
000000000001e800 R09: 00000000000007a0
2018-09-07 22:37:16 [201527.331906] R10: 0000000000000000 R11: 
0000000002818868 R12: 0000000002d20790
2018-09-07 22:37:16 [201527.339839] R13: 00007fffe0d159d0 R14: 
00007fffe0d15b40 R15: 00007fffe0d15a20
2018-09-07 22:37:16 [201527.347772] FS:  00002b835d26da00(0000) 
GS:ffff9e345c000000(0000) knlGS:0000000000000000
2018-09-07 22:37:16 [201527.356659] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
2018-09-07 22:37:16 [201527.363209] CR2: 0000000003a6ff88 CR3: 
0000002fdd8f6000 CR4: 00000000007607e0
2018-09-07 22:37:16 [201527.371144] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
2018-09-07 22:37:16 [201527.379079] DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
2018-09-07 22:37:16 [201527.387010] PKRU: 55555554
2018-09-07 22:37:16 [201527.390523] Call Trace:
2018-09-07 22:37:16 [201527.393780] Code: 48 39 cc 77 2f 48 8d 81 00 fe 
ff ff 48 39 e0 77 23 57 48 29 e1 65 48 8b 3c 25 78 0e 01 00 48 83 c7 28 
48 29 cf 48 89 f8 48 89 e6 <f3> a4 48 89 c4 5f 48 89 e6 65 ff 04 25 60 
0e 01 00 65 48 0f 44
2018-09-07 22:37:16 [201527.415810] RIP  [<ffffffffa6721791>] 
apic_timer_interrupt+0x141/0x170
2018-09-07 22:37:16 [201527.423189]  RSP <ffff9e345c006200>
2018-09-07 22:37:16 [201527.428646] ---[ end trace a6a14aed798e889f ]---
2018-09-07 22:37:17 [201527.477875] Kernel panic - not syncing: Fatal 
exception
2018-09-07 22:37:17 [201527.484041] Kernel Offset: 0x25000000 from 
0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

------------------8< snip snip 8<------------------


All the best!
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list