Myrinet hardware reliability
v.pennington at man.ac.uk
Fri Feb 7 01:40:05 PST 2003
We have a 113 node IBM x330 cluster with Myrinet 2000. We're
experiencing very high failure rates on Myrinet switch ports
(average 3 per month) and on Myrinet NICs to a lesser extent
(about 1 per month). Ports and NICs are fine one minute,
then one or the other just dies (for good). Cables
(fibre, not copper) seem fine - one or two failures only in
nearly a year.
There is no pattern in the failures, and they are entirely
unrelated to usage levels; seldom used nodes are just as
likely to have failures as heavily used nodes.
We have another small IBM cluster with Myrinet 2000
(16 port switch with copper cables), and this has run solidly
for nearly 2 years with not one Myrinet hardware fault.
I'd be really interested to know of others' experiences with
Myrinet kit, especially in larger clusters.
Dr Victoria Pennington
Manchester Computing, Kilburn Building,
University of Manchester,
Oxford Road, Manchester M13 9PL
tel. 0161 275 6830, email: v.pennington at man.ac.uk
More information about the Beowulf