[Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance?

Sat Oct 3 12:10:26 PDT 2009

Hi Rahul,

I implemented a custom NFS solution based on Gentoo Linux for a
cluster some time ago, which has been going fine until now but it is a
very small cluster. It's an 8 node machine which will be upgraded to
16 this year. Still, some codes used there write a lot to disk so the
NFS link could be easily saturated without much effort. There are no
funds for 10GbE, FC or Infiniband, so I decided to do NFS + local disk
for all compute nodes. This was done mainly to have a way to keep all
compute nodes updated with little effort and not to increase I/O
performance. It's also easier to maintain than Lustre, so I went for
it.
It goes something like this:

- NFS server is the entry node and has RAID 1. It stores the base
install which is not bootable and a small copy of installation files
that must be writable(/etc and /var) for each node, with the rest
being bind mounted. I export those directories to the nodes.

- Nodes boot a kernel image by PXE and mount exported filesystem as /
and then write some files(not much data) to /etc and /var at boot, the
rest is read only with the exception of /tmp and /home (also some swap
for safety reasons) which are running on a single SAS disk on the
node. Typically scratch files run either on /home and /tmp is there to
keep the pressure of the single link to the NFS server. I have
dedicated a single GbE port on each blade to serve/access the NFS
shares, leaving the other one for MPI, which we aren't using either
way because it's too slow for the codes run there.

- All configurations and user management are done in the base install
which are then rsync'd to all other installations /etc and /var, which
is a fast procedure by now and that can run on-the-fly without
problems for the compute nodes. Backup is also easy, it's just a
backup of the base install which is always in an "unbootable" state,
with no redundant files.

So far it has been working great and scaling nodes is very easy. I
would say something like this is feasible for 300 nodes due to the
lack of pressure put on the network. They only basically go load the
executable file and shared libraries at the start of a job and that's
it.

I can provide the scrips I have set up to do this if you want to take
a look at them.

Best regards,
Tiago Marques

On Thu, Sep 24, 2009 at 10:54 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> On Thu, Sep 10, 2009 at 11:18 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>
>>
>> root at dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d  -v
>
> In order to see how good (or bad) my current operating point is I was
> trying to replicate your test. But what is "io-bm.exe"? Is that some
> proprietary code or could I have it to run a similar test?
>
> --
> Rahul
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>