[Beowulf] Re: scratch File system for small cluster

Jan Heichler jan.heichler at gmx.net
Thu Sep 25 10:44:24 PDT 2008


Hallo Greg,

Donnerstag, 25. September 2008, meintest Du:


Glen,

I have had great success with the *right* 10GbE nic and NFS.  The important things to consider are:


I have to say my experience was different. 



How much bandwidth will your backend storage provide?  2 x FC 4 I'm guessing best case is 600Mb but likely less.


600 MB/s is already a good value for a SAN-based Storage ;-)


What access patterns do the "typical apps" have?  
All nodes read from a single file (no prob for NFS, and fscache may help even more)  
All nodes write to a single file (NFS may need some help or may be too slow when tuned for this)
All nodes read and write to separate files (NFS is fine if the files aren't too big for the OS to cache reasonably).

The number of IO servers really is a function of how much disk throughput you have on the backend, frontend, and through the kernel/filesystem goo.  My experience is a 10GbE nic from Myricom can easily sustain 500-700MB/s if the storage behind it can and the access patterns aren't evil.  Other nics


My experience was this: you get app. half of what you have on blockdevice-level to the network. So i had a setup with 16 x 15k rpm SAS drives. RAID5 on them showed 1.1 GB/s read (limited by PCIe x8 probably) and 550 MB/s write (Controller was LSI 8888ELP). With exporting this to a number of clients i was not able to get more than app. 500 MB/s read and 400 MB/s write with multiple clients. I could show the real measurements if that is of interest. 

If you look at the hardware that was thrown on the problem the result is a little pathetic. 

My experience with lustre is that it eats up 10 to 15% of the blockdevice-speed. And the rest you have in the network.

So a cheap lustre-setup for scratch would probably include 2 Servers with internal storage and exporting it to the cluster with 10GE or IB. Internal storage is cheap and it is easy to achieve 500+ MB/s on SATA drives. That way you can reach 1 GB/s with just having 2 Servers and 32 to 48 disks involved. 


 from large and small vendors can fall apart at 3-4 Gb so be careful and test the network first before assuming your FS is the troublemaker.  There are cheap switches with 2 or 4 10GbE CX4 connectors that make this much simpler and safer with or without the Parallel FS options.


I never tested anything but Myricom 10GE but you can find cheap Intel-Based cards with CX4 (and i doubt that they are bad) . The Dell PowerConnect 62xx-Series can give you cheap CX4 uplinks - and you get a decent switch that is stackable. 



Depending on how big/small and how "scratch" the need is... a big tmpfs/ramdisk can be fun :)


I tried once to export tmpfs via NFS - didn't work out of the box.

Bye Jan                            
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080925/7f9c852a/attachment.html>


More information about the Beowulf mailing list