Myrinet hardware reliability

Fri Feb 7 09:38:54 PST 2003

Victoria,

You're not alone on this one. We have a 416 processor P3 cluster, with
Myrinet 2000 as well. Installed January of 2002, been running since
then. If your situation was like ours, your cluster only started to show
these signs after 6 months of operation, correct? 

To quote an email from Myricom that we received back in October:

"From all of the Myrinet-fiber products shipped between 1 July 2001 (the
beginning of the present series of fiber products) to 30 September 2002,
and all of the RMAs from 1 July 2001 to 11 October 2002, we can tell you
that the experienced annualized failure rate (AFR) of these products is:

          Product                AFR
          -------                ---
  M3F-PCI64B & M3F-PCI64C        0.88%
  M3-SW16-8F & M3-SPINE-8F       3.8% (0.48% per port)

"By industry standards, these AFR numbers are good, particularly for the
first 15 months of shipments of a given product.  (The field failure
rates of installed products will generally be highest initially, and
decrease over time.)  These AFRs are not, however, up to Myricom's
reliability standards.  Most earlier Myrinet products had AFRs of ~0.2%
per port, and we expect AFRs in this range for current production.

"As you've observed, we have had a reliability problem with the fiber
transceivers.  The failed fiber transceivers have all been sent back to
their manufacturer, E2O, for root-cause analysis.  You are right that
the great majority of these failures are failures of the laser diode
(the light emitter).

"What is particularly insidious about this particular failure mode is
that it tends to occur only after ~200 days of operation.  In our
monthly monitoring of failure rates, we observed *zero* transceiver
failures in the first 5 months and ~30M hours of field operation,
causing us to believe initially that these transceivers were quite
reliable.  It was not until July 2002 that this failure mode of E2O
transceivers -- in particular, of the laser diode used inside of the
transceiver -- became evident from field failures of components shipped
during 4Q01." 

Now, that being said, I am still seeing errors on my old equipment, at a
much lower rate of failure than I did at the peak. At one point, I had
to swap out 7 M3 chassis cards, and 3 PCI cards in one week. Now, I'm
down to one or two a week. Problem tracking tools are quite important to
keep everything straight... I recommend bugzilla.

ALL of the equipment I'm getting from Myricom now is from a different
fibre transceiver manufacturer, and I have had zero problems with the
new equipment. To Myricom's credit, I have no issues getting equipment
RMA'd... determining the immediate issues are pretty easy.

Pull the plug out of a failed node, take a quick peek and see if you see
the laser emitter. If no light, the PCI emitter is bad. If there is one,
look in the cable. If you don't see the light coming from the switch,
chances are the emitter died in the port. 

I have had a higher failure rate of switch ports versus PCI cards like
you have seen... guessing that it's *somewhat* related to heat, but no
way to confirm on my side. 

-Eric

On Fri, 2003-02-07 at 04:40, Victoria Pennington wrote:
> Hi,
> 
> We have a 113 node IBM x330 cluster with Myrinet 2000.  We're
> experiencing very high failure rates on Myrinet switch ports
> (average 3 per month) and on Myrinet NICs to a lesser extent
> (about 1 per month).  Ports and NICs are fine one minute,
> then one or the other just dies (for good).  Cables
> (fibre, not copper) seem fine - one or two failures only in
> nearly a year.
> 
> There is no pattern in the failures, and they are entirely
> unrelated to usage levels; seldom used nodes are just as
> likely to have failures as heavily used nodes.
> 
> We have another small IBM cluster with Myrinet 2000
> (16 port switch with copper cables), and this has run solidly
> for nearly 2 years with not one Myrinet hardware fault.
> 
> I'd be really interested to know of others' experiences with
> Myrinet kit, especially in larger clusters.
> 
> Thanks
> Victoria
> ---
> Dr Victoria Pennington
> Manchester Computing, Kilburn Building,
> University of Manchester,
> Oxford Road, Manchester M13 9PL
> tel. 0161 275 6830, email: v.pennington at man.ac.uk
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 

Eric Wages		Supercomputer Manager
Tel: (207) 866-6510	University of Maine
Fax: (207) 866-6510	20 Godfrey Drive
wages at eece.maine.edu    Orono, ME 04473