intermittent crashing of programs

Donald Becker becker at scyld.com
Thu Feb 21 08:48:45 PST 2002


On Thu, 21 Feb 2002, Patrick Geoffray wrote:

> Kris Thielemans wrote:
> > Any suggestions on how we figure out what the problem is (aside from
> > replacing all memory chips)? Is it necessarily RAM, or could it be e.g. the
> > hard disk controller or so?
> 
> It's usually RAM, but it can also be a PCI device whining. I have seen 
> NMIs from SCSI boards when they were waiting too long to access the PCI 
> bus for example.

Could you elaborate?  What PCI problems cause a NMI, and on which
motherboards.  You obviously have some first-hand experience with the
problem.  I'm guessing that you have helped many customers debug their
hardware problems.

I think of parity errors being connected to NMI as being an obscure
legacy part of the PC architecture, much like the "A20" line being
switched by the keyboard controller.  If the backwards compatibility
broke, no one would notice.

> The last time I got one, it was a bad RAM chip and memtest didn't find 
> anything. Try to swap memory with another node to see if the NMIs 
> migrate with the chips.

A good point: memory tests often fail to find problems that programs
such as 'gcc' trigger immediately.  We compile kernels overnight to test
new machines.

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list