[Beowulf] Remote console management

Sun Sep 25 13:35:44 PDT 2005

This will of course impact the performance of
the compute jobs whenever I/O happens.

An alternative approach could be to deshuffle the money of the 
distributed local storage
that you sketched out, and have cheaper diskless (and therefore almost 
stateless)
compute nodes (or with a non-raided single drive for scratchspace of 
intermediate results)
plus a gang of storage nodes that are dedicated access points to a bunch 
of iscsi or fibre
attached drive enclosures.

Michael

Bruce Allen wrote:

> Doug,
>
> Good to "see you" in this discussion -- I think this thread would be 
> the basis for a nice article.
>
> Spending the $$$ to buy some extra nodes won't work in our case.  We 
> don't just use the cluster for computing, we also use it for data 
> storage. Each of the 400+ nodes will have four 250GB disks and a 
> hardware RAID controller (3ware 9500 or Areca 1110).  If a node is 
> acting odd, we'd like to be able to diagnose/fix/reboot/restore it 
> quickly if possible.  To replicate the data from a distant tape-backed 
> repository will take many hours. So having some 'extra' machines 
> doesn't help us so much, since we wouldn't know what data to keep on 
> them, and moving the data onto them when needed would normally take 
> much longer than bringing back to life the node that's gone down.
>
> Cheers,
>     Bruce
>
>
> On Sat, 24 Sep 2005, Douglas Eadline wrote:
>
>>
>>> We're getting ready to put together our next large Linux compute 
>>> cluster.
>>> This time around, we'd like to be able to interact with the machines
>>> remotely.  By this I mean that if a machine is locked up, we'd like 
>>> to be
>>> able to see what's on the console, power cycle it, mess with BIOS
>>> settings, and so on, WITHOUT having to drive to work, go into the 
>>> cluster
>>> room, etc.
>>>
>> This brings up an interesting point and I realize this does come down to
>> a design philosophy, but cluster economics sometimes create non standard
>> solutions. So here is another way to look at "out of band monitoring".
>> Instead of adding  layers of monitoring and control, why not take that
>> cost and buy extra nodes. (but make sure you have a remote hard power
>> cycle capability). If a node dies and cannot be rebooted, turn it 
>> off, and
>> fix it later. Of course monitoring fans and temperatures is a good thing
>> (tm), but if node will not boot, and you have to play with the BIOS, 
>> then
>> I would consider it broken.
>>
>> Because you have "over capacity" in your cluster (you bought extra 
>> nodes)
>> this does not impact the amount work that needs to get done. Indeed, 
>> prior
>> to the failure you can have the extra nodes working for you. You fully
>> understand that at various time one or two nodes will be off line. They
>> are taken out of the scheduler and there is no need to fix them right
>> away.
>>
>> This approach also depends on what you are doing with your
>> cluster and the cost of nodes etc. In some cases out-of-band access
>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node
>> in the head and fix it tomorrow" approach is also reasonable.
>>
>>
>> -- 
>> Doug
>>
>> check out http://www.clustermonkey.net
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com