[Beowulf] Remote console management

Sun Sep 25 19:09:03 PDT 2005

Bruce:

   If the point is read speed over data safety, you really want to be 
looking at RAID0, and software RAID0 at that.  I am currently sustaining 
in excess of 110 MB/s on a few real world codes (e.g. not 
microbenchmarks like bonnie) on a 2 disk RAID0 array attached to a 
machine we use for testing.   If you want safety and speed, you should 
look seriously at RAID10 solutions.  RAID5 and variants are safe against 
the loss of a single disk (as is a RAID10), but you have this annoying 
read-modify-write bit to contend with.   You could do this without the 
hardware raid controllers (lowering cost per node), and drastically 
increasing your performance (RAID1 implementations are typically faster 
on reads as they can amortize large reads across two or more drives as 
long as the RAID is in an operational/non-degraded state).  You will 
lose a little on the storage side (you will get 500 GB/machine with 250 
GB disks versus 750 GB/machine with the RAID 5 version).  If you really 
need the larger capacity, you could simply go for the larger 500 GB disk 
from Hitachi.  This may be price neutral or slightly more costly than 
the RAID5, but it gives you 1TB of disk that you should be able to 
sustain a somewhat higher read speed from.

Also, since you are interested in read speed, there are a number of 
tunables you can play with under the 2.6 kernel to help.  Also, choice 
of file system is absolutely critical here.  If you are doing large 
block sequential reads/writes, there really aren't too many good 
alternatives to xfs right now (jfs is possible).   Couple the tuning you 
can do with the striping into the xfs file system, and that system would 
be fast.

Joe

Bruce Allen wrote:
>> An alternative approach could be to deshuffle the money of the 
>> distributed local storage that you sketched out, and have cheaper 
>> diskless (and therefore almost stateless) compute nodes (or with a 
>> non-raided single drive for scratchspace of intermediate results) plus 
>> a gang of storage nodes that are dedicated access points to a bunch of 
>> iscsi or fibre attached drive enclosures.
> 
> Nope -- not enough bandwidth to the data.  With our current plan our 
> bandwidth to the data will be 400 x 100 MB/sec = 40 GB/sec.  This is 
> enough to read ALL 400 TB of data on the cluster in 10000 sec, or about 
> three hours.  You can't even come close with centralized 
> (non-distributed) storage systems.
> 
> Cheers,
>     Bruce
> 
>> Bruce Allen wrote:
>>
>>> Doug,
>>>
>>> Good to "see you" in this discussion -- I think this thread would be 
>>> the basis for a nice article.
>>>
>>> Spending the $$$ to buy some extra nodes won't work in our case.  We 
>>> don't just use the cluster for computing, we also use it for data 
>>> storage. Each of the 400+ nodes will have four 250GB disks and a 
>>> hardware RAID controller (3ware 9500 or Areca 1110).  If a node is 
>>> acting odd, we'd like to be able to diagnose/fix/reboot/restore it 
>>> quickly if possible.  To replicate the data from a distant 
>>> tape-backed repository will take many hours. So having some 'extra' 
>>> machines doesn't help us so much, since we wouldn't know what data to 
>>> keep on them, and moving the data onto them when needed would 
>>> normally take much longer than bringing back to life the node that's 
>>> gone down.
>>>
>>> Cheers,
>>>     Bruce
>>>
>>>
>>> On Sat, 24 Sep 2005, Douglas Eadline wrote:
>>>
>>>>
>>>>> We're getting ready to put together our next large Linux compute 
>>>>> cluster.
>>>>> This time around, we'd like to be able to interact with the machines
>>>>> remotely.  By this I mean that if a machine is locked up, we'd like 
>>>>> to be
>>>>> able to see what's on the console, power cycle it, mess with BIOS
>>>>> settings, and so on, WITHOUT having to drive to work, go into the 
>>>>> cluster
>>>>> room, etc.
>>>>>
>>>> This brings up an interesting point and I realize this does come 
>>>> down to
>>>> a design philosophy, but cluster economics sometimes create non 
>>>> standard
>>>> solutions. So here is another way to look at "out of band monitoring".
>>>> Instead of adding  layers of monitoring and control, why not take that
>>>> cost and buy extra nodes. (but make sure you have a remote hard power
>>>> cycle capability). If a node dies and cannot be rebooted, turn it 
>>>> off, and
>>>> fix it later. Of course monitoring fans and temperatures is a good 
>>>> thing
>>>> (tm), but if node will not boot, and you have to play with the BIOS, 
>>>> then
>>>> I would consider it broken.
>>>>
>>>> Because you have "over capacity" in your cluster (you bought extra 
>>>> nodes)
>>>> this does not impact the amount work that needs to get done. Indeed, 
>>>> prior
>>>> to the failure you can have the extra nodes working for you. You fully
>>>> understand that at various time one or two nodes will be off line. They
>>>> are taken out of the scheduler and there is no need to fix them right
>>>> away.
>>>>
>>>> This approach also depends on what you are doing with your
>>>> cluster and the cost of nodes etc. In some cases out-of-band access
>>>> is a good thing. In other cases, the "STONIH-AFIT" (shoot the other 
>>>> node
>>>> in the head and fix it tomorrow" approach is also reasonable.
>>>>
>>>>
>>>> -- 
>>>> Doug
>>>>
>>>> check out http://www.clustermonkey.net
>>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>>
>> -- 
>> Michael Will
>> Penguin Computing Corp.
>> Sales Engineer
>> 415-954-2822
>> 415-954-2899 fx
>> mwill at penguincomputing.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615