[Beowulf] help on building Beowulf

Patrick Geoffray patrick at myri.com
Tue Nov 20 16:34:09 PST 2007


Hi Bo,

Li, Bo wrote:
> According to my experiences to run HPC applications in Shanghai Super Computing Center. Myrinet interconnection brought to many failure with even a small application. All users are crazy with the interconnections and we had to restart the applications once and once again. I am not sure if there were improvement when Myrinet involved. During my staying there for three months, nothing done by the Myrinet when guys from Dawning called them for help. Sorry, if I put too many private opinions on the case.

I have looked at all of the 46 Help Tickets opened by Dawning with 
Myricom Tech Support between 2004 and 2007, and all of them were first 
handled under 2 business days. Final resolution varied from a few hours 
to one week (RMA of switch enclosure).

Doing a cross-reference with Shanghai Supercomputing Center 
(Dawning4000A cluster), I saw the same software problem reported 
multiple times over a several months period. It was answered each time 
the following day. The reported problem was MPICH-GM unable to open a GM 
port (which could have many causes but a common one was MPI jobs 
terminating abnormally and not being cleaned up properly). We were not 
made aware of continuing problems after relevant information was sent. 
Further tickets referred to performance tuning, not operational stability.

When exactly did you experience the problems on this machine ?

We do our best to support our customers. Sometimes, communication is 
hard due to language barriers and lack of steady contact. Other times, 
problems do not reach us because integrators/customers try to fix them 
internally. This is not perfect, but we tend to fix things that are broken.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com



More information about the Beowulf mailing list