[Beowulf] NFS alternative for 200 core compute (beowulf) cluster

Robert Taylor rgt at wi.mit.edu
Thu Aug 10 19:43:34 UTC 2023


Two 4tb spinning drives are not going to have a lot of throughput, and with
40 tasks all working on different files, if it's random IO, I think they
will get crushed.

What are the sequential read and write rates from any one node doing single
threaded io to the nfs server?

Can you do a dd test?
This should write a 1gig file straight from memory on the node it is run on.

dd if=/dev/zero of=/mnt/nfsshare bs=1M count=1000

(make sure zfs compression is off, or that will give bogus numbers)
You should get a time summary, and a throughput speed.
That is pure sequential IO that comes from memory, which is probably the
best that one machine can do, (unless the dd becomes cpu bound).

We have some high end netapp and isilon storage systems where I work, and
I've gotten between 400MB/s to 1GB/s out of nfs, and the 1gig I believe was
bottlenecked at the source node, because all it had was a 10 gig connection
to the network. Once I can get the nodes to 25g, I will test again, but I'm
not there yet.

Also are you sure the storage is going over IB and not gige? (is the cat6e
1gig ethernet, or do you have copper 10gig)

rgt





On Thu, Aug 10, 2023 at 3:29 PM Bernd Schubert <bernd.schubert at fastmail.fm>
wrote:

>
>
> On 8/10/23 21:18, leo camilo wrote:
> > Hi everyone,
> >
> > I was hoping I would seek some sage advice from you guys.
> >
> > At my department we have build this small prototyping cluster with 5
> > compute nodes,1 name node and 1 file server.
> >
> > Up until now, the name node contained the scratch partition, which
> > consisted of 2x4TB HDD, which form an 8 TB striped zfs pool. The pool is
> > shared to all the nodes using nfs. The compute nodes and the name node
> > and compute nodes are connected with both cat6 ethernet net cable and
> > infiniband. Each compute node has 40 cores.
> >
> > Recently I have attempted to launch computation from each node (40 tasks
> > per node), so 1 computation per node.  And the performance was abysmal.
> > I reckon I might have reached the limits of NFS.
> >
> > I then realised that this was due to very poor performance from NFS. I
> > am not using stateless nodes, so each node has about 200 GB of SSD
> > storage and running directly from there was a lot faster.
> >
> > So, to solve the issue,  I reckon I should replace NFS with something
> > better. I have ordered 2x4TB NVMEs  for the new scratch and I was
> > thinking of :
> >
> >   * using the 2x4TB NVME in a striped ZFS pool and use a single node
> >     GlusterFS to replace NFS
> >   * using the 2x4TB NVME with GlusterFS in a distributed arrangement
> >     (still single node)
> >
> > Some people told me to use lustre,but I reckon that might be overkill.
> > And I would only use a single fileserver machine(1 node).
> >
> > Could you guys give me some sage advice here?
> >
>
> So glusterfs is using fuse, which doesn't have the best performance
> reputation (although hopefully not for long - feel free to search for
> "fuse" + "uring").
>
> If you want to avoid complexity of Lustre, maybe look into BeeGFS. Well,
> I would recommend to look into it anyway (as former developer I'm biased
> again ;) ).
>
>
> Cheers,
> Bernd
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20230810/f60f15a0/attachment.htm>


More information about the Beowulf mailing list