HPL residual check failure
Patrick Geoffray
patrick at myri.com
Thu Nov 8 03:53:08 PST 2001
Yoon Jae Ho wrote:
> but I guess if you use Myrinet instead of 10/100 LAN. then please check the Cable & Myrinet mpich version.
FYI, bad Myrinet cables do not produced corrupted data, there is
a hardware CRC check on the NIC. Corrupted packets are just dropped,
so symptoms of bad cables are messages timing out or very slow.
You can look at the number of bad CRCs (badcrc_cnt) with "
gm_counters" (if you are using GM).
In the context of Keaton's failure, bad memory is certainely the
problem. Usually, if things works after cooling the unit, it's
very likely to be overheating hardware.
Patrick
----------------------------------------------------------
| Patrick Geoffray, Ph.D. patrick at myri.com
| Myricom, Inc. http://www.myri.com
| Cell: 865-389-8852 685 Emory Valley Rd (B)
| Phone: 865-425-0978 Oak Ridge, TN 37830
----------------------------------------------------------
More information about the Beowulf
mailing list