[Beowulf] Personal Introduction & First Beowulf Cluster Question

Mon Dec 8 10:45:23 PST 2008

Hello Steve and list

In the likely case that the original vendor will no longer support this 
5-year old cluster,
you can try installing the Rocks cluster suite, which is free from SDSC,
and you already came across to:

http://www.rocksclusters.org/wordpress/

This would be a path or least resistance, and may get your cluster up and
running again with relatively small effort.
Of course there are many other solutions, but they may require more effort
from the system administrator.

Rocks is well supported and documented.
It is based on CentOS (free version of RHEL).

There is no support for SLES on Rocks,
so if you must keep the current OS distribution, it won't work for you.
I read your last paragraph, but you may argue with your bosses that the 
age of this
machine doesn't justify being picky about the particular OS flavor.
Bringing it back to life, making it an useful asset,
with a free software stack, would be a great benefit.
You would spend money only in application software (e.g. Fortran 
compiler, Matlab, etc).
Other solutions (e.g. Moab) will cost money, and may not work with
this old hardware.
Sticking to SLES may be a catch-22, a shot on the foot.

Rocks has a relatively large user base, and an active mailing list for help.

Moreover, for Rocks minimally you must have 1GB of RAM on every node,
two Ethernet ports on the head node, and one Ethernet port on each 
compute node.
Check the hardware you have.
Although PXE boot capability is not strictly required, it makes 
installation much easier.
Check your motherboard and BIOS.

I have a small cluster made of five salvaged Dell Precision 410 (dual 
Pentium III)
running Rocks 4.3, and it works well.
For old hardware Rocks is a very good solution, requiring a modest 
investment of time,
and virtually no money.
(In my case I only had to buy cheap SOHO switches and Ethernet cables,
but you probably already have switches.)

If you are going to run parallel programs with MPI,
the cheapest thing would be to have GigE ports and switches.
I wouldn't invest on fancier interconnect on such an old machine.
(Do you have any fancier interconnect already, say Myrinet?)
However, you can buy cheap GigE NICs for $15-$20, and high end ones (say 
Intel Pro 1000) for $30 or less.
This would be needed only if you don't have GigE ports on the nodes already.
Probably your motherboards have dual GigE ports, I don't know.
MPI over 100T Ethernet is a real pain, don't do it, unless you are a 
masochist.
A 64-port GigE switch to support MPI traffic would also be a worthwhile 
investment.
Keeping MPI on a separate network, distinct from the I/O and cluster 
control net, is a good thing.
It avoids contention and improves performance.

A natural precaution would be to backup all home directories before you 
start,
and any precious data or filesystems.

I suggest sorting out the hardware issues before anything else.

It would be good to evaluate the status of your RAID,
and perhaps use that particular node as a separate storage appliance.
You can try just rebuilding the RAID, and see if it works, or perhaps 
replace the defective disk(s),
if the RAID controller is still good.

Another thing to look at is how functional your Ethernet (or GigE) 
switch or switches are,
and if you have more than one switch how they are/can be connected to 
each other.
(One for the whole cluster? Two or more separate? Some specific topology 
connecting many switches?)

I hope this helps,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Steve Herborn wrote:

> Good day to the group. I would like to make a brief introduction to 
> myself and raise my first question to the forum.
>
> My name is Steve Herborn and I am a new employee at the United States 
> Naval Academy in the Advanced Research Computing group which supports 
> the IT systems used for faculty research. Part of my responsibilities 
> will be the care & feeding of our Beowulf Cluster which is a 
> commercially procured Cluster from Aspen Systems. It purchased & 
> installed about four or five years ago. As delivered the system was 
> originally configured with two Head nodes each with 32 compute nodes. 
> One head node was running SUSE 9.x and the other Head Node was running 
> //Scyld (version unknown) also with 32 compute nodes. While I don’t 
> know all of the history, apparently this system was not very actively 
> maintain and had numerous hardware & software issues, to include 
> losing the array on which Scyld was installed. //Prior to my arrival a 
> decision was made to reconfigure the system from having two different 
> head nodes running two different OS Distributions to one Head Node 
> controlling all 64 Compute Nodes. In addition SUSE Linux Enterprise 
> Server (10SP2) (X86-64) was selected as the OS for all of the nodes.
>
> Now on to my question which will more then likely be the first of 
> many. In the collective group wisdom what would be the most efficient 
> & effective way to “push” the SLES OS out to all of the compute nodes 
> once it is fully installed & configured on the Head Node. In my 
> research I’ve read about various Cluster packages/distributions that 
> have that capability built in, such as ROCKS & OSCAR which appear to 
> have the innate capability to do this as well as some additional tools 
> that would be very nice to use in managing the system. However, from 
> my current research in appears that they do not support SLES 10sp2 for 
> the AMD 64-bit Architecture (although since I am so new at this I 
> could be wrong). Are there any other “free” (money is always an issue) 
> products or methodologies I should be looking at to push the OS out & 
> help me manage the system? It appears that a commercial product Moab 
> Cluster Builder will do everything I need & more, but I do not have 
> the funds to purchase a solution. I also certainly do not want to 
> perform a manual OS install on all 64 Compute Nodes.
>
> Thanks in advance for any & all help, advice, guidance, or pearls of 
> wisdom that you can provide this Neophyte. Oh and please don’t ask why 
> SLES 10sp2, I’ve already been through that one with management. It is 
> what I have been provided & will make work.
>
> **Steven A. Herborn**
>
> **U.S.**** Naval Academy**
>
> **Advanced Research Computing**
>
> **410-293-6480 (Desk)**
>
> **757-418-0505 (Cell)******
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>  
>