[Beowulf] diskless booting over jumbo frames

Wed Apr 25 08:02:30 PDT 2007

Bogdan Costescu wrote:
> On Tue, 24 Apr 2007, Mark Hahn wrote:
>
>> so the main question is whether jumbo is worth the effort.
>
> I would rephrase it to say: whether jumbo is worth the effort for the 
> root FS. When I used NFSroot, most of the traffic was queries of 
> file/dir existence and file/dir attributes, which are small, so a 
> large maximum packet size would not help. Furthermore, most of the 
> files accessed were small which means that the client could be quite 
> successful in caching them for long times and the actual transfer (if 
> the cache is emptied) would not take too long.
>
I agree that Jumbo Frames would not be a great help with the root file 
system but we hope to get a better performance from other NFS servers. 
As all the machines on the same subnet have to be using the jumbo 
frames, I have to boot the machines from a server that has jumbo frames 
enabled. (Or I will have to have an extra ethernet card on every node 
just for the booting and then boot-server can be on a different subnet 
with 1500B.)

We are very sure that our current bottlenecks lie at the NFS level. The 
hard drives or the ethernet are not saturated. Even though NFS is 
extremely slow, copying files over scp is still very fast between a 
client and server. We have tried all different ways to tune the NFS for 
a better performance (increasing NFS deamons on the servers, changing 
rsize & wsize, using TCP vs UDP, using async vs sync, noatime, timeo). 
The only thing we have not been able to try yet is jumbo frames. We 
could redistribute our data across even more NFS servers but that is not 
possible with the current state of application. If we don't find a 
solution soon, we might have to give up on NFS and try some clustered 
file system solution.

> I think that it is more important to think thoroughly the placement 
> and exporting of the files on the NFS server. If you can manage to 
> export a single directory which is mounted as-is by the clients and 
> have the few client-specific files either mounted one by one or 
> copied/generated on the fly and placed on a tmpfs (and linked from 
> there), you can speed up the serving of the files, as the most 
> accessed files will stay in the cache of the server. The Stateless 
> Linux project from Red Hat/Fedora used such a system (single root FS 
> then client-specific files mounted one by one) last time I looked at it.
>
>> here's a cute hack: if you're using pxelinux, it has an "ipappend" 
>> feature,
>> ...
>> I haven't had time to try this...
>
> It works as you described it.
>
> But even the first idea that you mentioned, using dhclient to get an 
> IP would work just as fine if the number of nodes is not too big - I 
> have 100+ nodes configured that way, with 2 DHCP requests per boot of 
> node (the PXE one and the dhclient one) as I was just too lazy to try 
> to eliminate the second DHCP request by re-using the info from PXE - 
> and the master node doesn't feel the load at all, although it is 
> hardware-wise the poorest machine from the cluster (as opposed to most 
> other clusters that I know of ;-)).
>
The way nodes are booting now, they use pxelinux to get the ip address, 
and then download the kernel from the tftp server. The configuration for 
pxe is as below:

DEFAULT bzImage
APPEND acpi=off debug earlyprintk=vga  initcall_debug console=tty0 
initrd=bzImage ramdisk=40960 root=/dev/nfs 
nfsroot=192.168.1.254:/slave/root ro ip=dhcp

Is it being suggested that somehow MTU size can be configured over here? 
/sbin/dhclient-script will not be available until the nfsroot is 
mounted. Am I missing something here ?

thanks

Amrik