[Beowulf] lost in parallel computing

Joe Landman landman at scalableinformatics.com
Tue Dec 13 12:25:41 PST 2005


Hi Xiaoming:

CHEN, XIAOMING wrote:
> Dear all,
> 
> I've been practicing scientific parallel computing for 3~4 years, but as
> a remote user I never really touched the subjects on parallel computer
> management. Things work out if the remote computers I am working on are
> managed well. However, when they are not in good hands, they will go on
> 'strike' for a long time. 

Yes... it has been our experience that good hardware that is not well 
managed is painful to use.

> This is what I am experiencing now. One remote
> cluster just reloated recently and it lost myrinet.

As in no longer functioning? Lots of possible reasons for this.

> A new cluster
> purchased from Dell hasn't been working since it was installed 3 months
> ago.

:(

I am going to hold my comment back on this.

> Another one has some strange behavior. For example, sometimes it
> writes data twice into a file in a random order; a user cannot kill his
> process unless he terminates the xwindow (i.e, exit).

Is this true of all applications or just one?  Sounds like something is 
not quite write with the IO system (thats the limit of diagnostics that 
one can do with this information).  Many possible reasons for this, it 
would take time to isolate/fix.

> I guess during
> this holiday season nobody will stand out to solve the problem. 

Not true.

> But it
> seems such problems will continue to exist and evolve as computer
> technologies evolve themselves. I am wondering if a inexpensive but
> robust parallel executing environment is possible to build.

Yes.  Depends upon how well it is designed and implemented.  To make 
something easy to manage, much complexity needs to be eschewed.

>  If it is so
> difficult to maintain a parallel computer, how can we persuade people to
> invest money in parallel computers? 

It shouldn't be difficult.

> 
> 
> This is the first time for me to post a message. Please kindly remind me
> if I do not follow the rules. I appreciate your response. 
> 
> Xiaoming Chen
> University of South Carolina
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615



More information about the Beowulf mailing list