intermittent crashing of programs
Daniel Kidger
Daniel.Kidger at quadrics.com
Thu Feb 21 09:46:00 PST 2002
Donald Becker wrote:
>I think of parity errors being connected to NMI as being an obscure
>legacy part of the PC architecture, much like the "A20" line being
>switched by the keyboard controller. If the backwards compatibility
>broke, no one would notice.
Nope not legacy - just look for example at any brand new Dell Pentium 4
system with RAMBUS ECC memory.
Any 'multibit errors', generate an NMI.
Single bit errors in ecc memory get spotted by the BIOS too but the O/S will
not be told - since they are corrected 'on-the-fly' by the hardware on
reading the data. Hence 'memtest' will never detect these single-bit errors.
The other thing to get is 'ecc.o'. This is a kernal module that polls the
motherboard chipset every second - it will show in /proc/ram the single and
multibit errors and will collate them by memory bank.
eg.
[dan at fridge8]$ cat /proc/ram
Chipset ECC capability : ECC detection and correction
Current ECC mode : ECC detection and correction
Bank Size Type ECC SBE MBE
0 256M RMBS Y 202758 0
1 256M RMBS Y 0 5
2 256M RMBS Y 0 2
3 256M RMBS Y 0 0
4 256M RMBS Y 0 0
5 256M RMBS Y 0 257
6 256M RMBS Y 0 0
7 256M RMBS Y 0 0
Yours,
Daniel.
--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd. daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505
----------------------- www.quadrics.com --------------------
More information about the Beowulf
mailing list