[Beowulf] PXE/kickstart and 10GBase-T issues
joshua.bakerlepain at gmail.com
Thu Feb 28 14:53:35 PST 2019
I've got a few-hundred node cluster here that I've had humming along
for several years. All the nodes are set to PXE boot. The default
entry in the PXE menu is to boot off the local hard drive, and we drop
in a kickstart if need be (new nodes, node refreshes, I just feel like
it, etc). I'm currently moving the cluster from CentOS-6 to CentOS-7.
At the same time, I have ~200 nodes with onboard 10GBase-T NICs
(X540-AT2 based) that had been plugged into 1Gbps switches (from
Brocade) that I'm moving over to 10Gbps switches (Cisco Nexus
C93120TX). The ones I'm currently working with have fairly short
cable runs (<7ft), and are using Cat 6a cables.
I'm running into a major issue where a large percentage (well over 50)
of attempted PXE kickstarts fails. The failures occur in multiple
places, but all seem to be related to slow initialization of the
network interface. I've seen:
1) dracut-initqueue timeouts leading to "/dev/root does not exist"
2) the node loads the kickstart file but then fails while trying to
read the repo metadata.
3) the kickstart actually succeeds, but during reboot a bunch of
network services (NFS mounts, SGE, etc) attempt to start but fail
because the network isn't fully up yet.
To fix things, I've tried:
1) adding "inst.waitfornet=120 rd.net.timeout.carrier=120
rd.net.timeout.iflink=100 rd.net.timeout.ifup=120 rd.net.dhcp.retry=5"
to the kernel parameters in the PXE menu *and* the default grub
2) adding "LINKDELAY=120" to the ifcfg-$INTERFACE scripts (still using
the network service here, not NetworkManager)
3) turning on PortFast on the network ports, i.e. "spanning-tree port
Nothing has really made a huge difference. PortFast seemed to at
first, but larger scale tests still have rather high failure rates.
Has anyone seen anything like this? And, more importantly, has anyone
fixed it? Thanks!
QB3 Shared Cluster Sysadmin
More information about the Beowulf