intermittent crashing of programs
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Donald Becker becker at scyld.comThu Feb 21 08:43:01 PST 2002
- Previous message: intermittent crashing of programs
- Next message: intermittent crashing of programs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 21 Feb 2002, Kris Thielemans wrote: > (2nd resubmit after subscribing with a different email address...) OK, I just deleted them from the moderation-hold queue. I usually approve held posts in a few hours during the week. The volume of attempted spam has become very high in the past few months, so I'm unlikely to loosen the requirement that non-member messages be held for moderation. > we have a cluster of 4 dual Pentium III 600 MHz systems, running SuSE Linux > 7.1. On one of the PCs, our programs occasionally crash with a segmentation > fault. This also happens with an ordinary serial program with all its IO to > local disks. (It does use NIS to get user info though, so I cannot easily > test it without network). The crash NEVER occurs on any of the other > systems. This is pretty clearly a hardware problem. Luckily you have other similar system to compare against. > Feb 21 14:22:58 pp4 kernel: Uhhuh. NMI received. Dazed and confused, but > trying to continue > Feb 21 14:22:58 pp4 kernel: You probably have a hardware problem with your > RAM chips Hmmm, there is a similar problem reported in the eepro100 list on a Dell 4400 server. There the problem occurs when a PCI device is accessed (and of course the driver is blamed). I'm guessing that problem is a datapath parity error, which is slightly different than a PCI parity error. You might want to read that thread which starts 16 Feb 2002. http://www.scyld.com/pipermail/eepro100/2002-February/ The important detail to remember is that NMI is once again being used to report system data errors, there are additional error sources beyond memory parity errors. > So, we ran memtest86-2.5 for 4 days continuously. No error was reported. I would swap RAM between two systems and see if the problem follows. If the problem just goes away, you should still relegate the suspect RAM to a machine that doesn't need to be reliable. Donald Becker becker at scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993
- Previous message: intermittent crashing of programs
- Next message: intermittent crashing of programs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
