[Beowulf] HPC and SAN

Sat Dec 18 17:39:46 PST 2004

Michael Will wrote:

>Veritas has something called VxFS that could be used for that, and there also special cluster-filesystems
>  
>

Hmmm... Last I heard VxFS was limited to 4 or 8 hosts.  Not very HPC like...

>like gfs and lustre that are supposed to solve that problem. In that case, you can also have just some
>compute nodes act as storage nodes, and so you don't need fibre channel cards in all of them. The
>storage nodes then act similar to redundant nfs servers.
>  
>

I remain skeptical on the value proposition for a SAN in a cluster. 

In short, you need to avoid single points of information flow within 
clusters.  The absolute best aggregate bandwidth you are going to get 
will be local storage.  At 50+ MB/s, a SATA drive in a compute node 
multipled by N compute nodes rapidly outdistances all (save one) 
hardware storage design that I am aware of.   And it does it at a tiny 
fraction of the cost.  Unfortunately you have N namespaces for your 
files (think of the file URI as file://node/path/to/filename.ext, and 
the value of "node" varies).  Most code designs assume a single shared 
storage, or common namespace for the files.  This is where the file 
systems folks earn their money (well one does anyway IMO).

>Another interesting case is PVFS (and hopefully soon PVFS2) that accumulates local storage of the
>nodes into a parallel virtual filesystem allowing distributed storage and access. In case of PVFS
>  
>

Having used PVFS (or at least tried to use PVFS) for a project, I 
discovered rather quickly some of the missing functionality (soft links, 
etc), resulted in large chunks of wrapper code not working (and no, it 
made no sense to change the wrapper code to suit this file system), and 
at least 2 MPI codes that I played with did not like it.  I don't want 
to knock all the hard work that went into it, but I am not sure I would 
try PVFS2 without a very convincing argument that it implements full 
unix file system (POSIX) interfaces, and things work transparently.

>the data is not distributed redundandly, which means that one node going down means part of
>your filesystem data disappears - so unless you have rock solid nodes connected to a UPS, this
>might be good only for a large fast /tmp.
>  
>

There are alternatives to this that work today.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 612 4615