intermittent crashing of programs
Donald Becker
becker at scyld.com
Thu Feb 21 08:43:01 PST 2002
On Thu, 21 Feb 2002, Kris Thielemans wrote:
> (2nd resubmit after subscribing with a different email address...)
OK, I just deleted them from the moderation-hold queue.
I usually approve held posts in a few hours during the week. The volume
of attempted spam has become very high in the past few months, so I'm
unlikely to loosen the requirement that non-member messages be held for
moderation.
> we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux
> 7.1. On one of the PCs, our programs occasionally crash with a segmentation
> fault. This also happens with an ordinary serial program with all its IO to
> local disks. (It does use NIS to get user info though, so I cannot easily
> test it without network). The crash NEVER occurs on any of the other
> systems.
This is pretty clearly a hardware problem. Luckily you have other
similar system to compare against.
> Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but
> trying to continue
> Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your
> RAM chips
Hmmm, there is a similar problem reported in the eepro100 list on a Dell
4400 server. There the problem occurs when a PCI device is accessed
(and of course the driver is blamed). I'm guessing that problem
is a datapath parity error, which is slightly different than a PCI
parity error.
You might want to read that thread which starts 16 Feb 2002.
http://www.scyld.com/pipermail/eepro100/2002-February/
The important detail to remember is that NMI is once again being used to
report system data errors, there are additional error sources beyond
memory parity errors.
> So, we ran memtest86-2.5 for 4 days continuously. No error was reported.
I would swap RAM between two systems and see if the problem follows. If
the problem just goes away, you should still relegate the suspect RAM to
a machine that doesn't need to be reliable.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list