Mysterious kernel hangs
award at andorra.ad
Thu Mar 15 06:35:35 PST 2001
It may seem simplistic, but have you any reason to think
your machines aren't simply overheating?
There can be a lot of Joules going 'round in a dual box. Try
them out at say, 800 MHz, see if there's a difference. Idem
the case open.
Felix Rauch ha escrit:
> We recently bought a new 16 node cluster with dual 1 GHz PentiumIII
> nodes, but machines mysteriously freeze :-(
> The nodes have STL2 boards (Version A28808-301), onboard adaptec SCSI
> controllers (7899P), onboard intel Fast Ethernet adapters (82557
> [Ethernet Pro 100]) and additional Packet Engines Hamachi GNIC-II
> Gigabit Ethernet cards.
> We tried kernels 2.2.x, 2.4.1 and now even 2.4.2-ac20, but it seems to
> be the same problem with all kernels: When we run experiments which
> use the network intensively, any of the machines will just freeze
> after a few hours. The frozen machine does not respond to anything and
> up to now we were not able to see any log-entries related to the
> freeze on virtual console 10 :-( We switched now on all the "Kernel
> Hacking" stuff in the kernel configuration (especially the logging)
> and we will try again, hopefuly we will at least see some log outputs.
> The freezes do also happen if we let non-network-intensive jobs run on
> the machines (e.g. SETI at home), but clearly they happen less often.
> Does anyone of you have any ideas what could go wrong or what we could
> try to find the cause of the problems?
> Felix Rauch | Email: rauch at inf.ethz.ch
> Institute for Computer Systems | Homepage: http://www.cs.inf.ethz.ch/~rauch/
> ETH Zentrum / RZ H18 | Phone: ++41 1 632 7489
> CH - 8092 Zuerich / Switzerland | Fax: ++41 1 632 1307
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf