[Beowulf] SATA II - PXE+NFS - diskless compute nodes
mwill at penguincomputing.com
Fri Dec 8 13:14:31 PST 2006
Geoff Jacobs wrote:
> Mark Hahn wrote:
>> it's interesting that SAS advertising has obscured the fact that SAS is
>> just a further development of SCSI, and not interchangable
>> with SATA. for instance, no SATA controller will support any SAS disk,
>> and any SAS setup uses a form of encapsulation to communicate with
>> the foreign SATA protocol. SAS disks follow the traditional price
>> formula of SCSI disks (at least 4x more than non-boutique disks),
>> and I suspect the rest of SAS infrastructure will be in line with that.
> Yes, SAS encapsulates SATA, but not vice-versa. The ability to use a
> hardware raid SAS controller with large numbers of inexpensive SATA
> drives is very attractive. I was also trying to be thorough.
>>> and be mindful of reliability issues with desktop drives.
>> I would claim that this is basically irrelevant for beowulf.
>> for small clusters (say, < 100 nodes), you'll be hitting a negligable
>> number of failures per year. for larger clusters, you can't afford
>> any non-ephemeral install on the disks anyway - reboot-with-reimage
>> should only take a couple minutes more than a "normal" reboot.
>> and if you take the no-install (NFS root) approach (which I strongly
>> recommend) the status of a node-local disks can be just a minor node
>> property to be handled by the scheduler.
> PXE/NFS is absolutely the slickest way to go, but any service nodes
> should have some guarantee of reliability. In my experience, disks
> (along with power supplies) are two of the most common points of failure
Most of the clusters we configure for our customers use diskless compute
nodes to minimize compute node failure for
precisely the reason you mentioned unless either the application can
benefit from additional
local scratchspace (i.e. software raid0 over four sata drives allows to
read/write large datastreams
at 280MB/s in a 1U server with 3TB of disk space on each compute node),
or because they need to
sometimes run jobs that require more virtual memory than they can afford
to put in physically -> local swapspace.
We find that customers don't typically want to pay for the premium for
redundant power supplies+pdus+cabling
for the compute nodes through, that's something that is typically
requested for head nodes and NFS servers.
Also we find that NFS-offloading on the NFS-server with the rapidfile
card helps avoid scalability issues
where the NFS server bogs down under massively parallel requests from
say 128 cores in a 32 compute node dual
cpu dual core cluster. The rapidfile card is a pci-x card with two fibre
channel ports + two gige ports +
nfs/cifs offloading processor on the same card. Since most bulk data
transfer is redirected from fibre channel to gige nfs
clients without passing through the NFS server cpu+ram itself, the nfs
servers cpu load is not becoming the bottleneck,
we find it's rather the amount of spindles before saturating the two
We configure clusters for our customers with Scyld Beowulf which does
root but rather just nfs-mounts the home directories because of its
compute node model, (PXE booting into RAM) and so does not run into the
nfs-root scalability issues.
SE Technical Lead / Penguin Computing / www.penguincomputing.com
More information about the Beowulf