[Beowulf] NMI (Non maskable interrupts)

Mark Hahn hahn at mcmaster.ca
Mon Mar 17 15:02:24 PDT 2008


> From my understanding, NMI is not good since the processors really
> have to handle these interrupts right away and these might degrade the
> performance of the nodes.

I think you're mistaken - NMI's of the sort you're talking about will
result in a panic.  these NMI's are probably just low-level kernel 
synchronization like where one CPU needs to cause others to immediately do
something like changing the status of a page in their MMUs.

for instance, I notice that more recent kernels classify interrupts
more finely:

[root at experiment ~]# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
   0:         68          0          0          0   IO-APIC-edge      timer
   1:          0          0          0         10   IO-APIC-edge      i8042
   4:          0          0          0          2   IO-APIC-edge
   8:          0          0          0          0   IO-APIC-edge      rtc
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          0          0          0          4   IO-APIC-edge      i8042
  14:          0          0          0          0   IO-APIC-edge      ide0
  17:          0          0          0          0   IO-APIC-fasteoi   sata_nv
  18:          0          0          0          0   IO-APIC-fasteoi   sata_nv
  19:     123229        148        514       4698   IO-APIC-fasteoi   sata_nv
362:  127524168    5281605     236961     121506   PCI-MSI-edge      eth1
377:     519748   12731137     607115   42573852   PCI-MSI-edge      eth0:MSI-X-2-RX
378:     109154      80191  302109913    6487104   PCI-MSI-edge      eth0:MSI-X-1-TX
NMI:          0          0          0          0   Non-maskable interrupts
LOC:  300446104  300446082  300446060  300446038   Local timer interrupts
RES:    2698262      44102    2234502    3677120   Rescheduling interrupts
CAL:       4135       4379       4460        415   function call interrupts
TLB:      14018      15088       4079       7251   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0

I suspect that all the counts listed after RES are, in earlier kernels,
lumped into NMI.  obviously, rescheduling, function call and TLB shootdowns
are perfectly normal, not indicating any error (though you might want to 
minimize them as well...)

how about trying a new kernel?  the above is 2.6.24.3.  note that there are 
important security fixes that you might be missing if you're running certain
ranges of old kernels...



More information about the Beowulf mailing list