[Beowulf] PXE/kickstart and 10GBase-T issues
engwalljonathanthereal at gmail.com
Thu Feb 28 16:31:36 PST 2019
Cisco's website info on PortFast makes me wonder how it did you any good at
all, while in a transition. Any misconfiguration could block all ports,
some configurations being "type-inconsistent."
I love these puzzles and will watch this carefully. Sorry I cannot be of
On Thu, Feb 28, 2019, 2:54 PM Joshua Baker-LePain <
joshua.bakerlepain at gmail.com> wrote:
> I've got a few-hundred node cluster here that I've had humming along
> for several years. All the nodes are set to PXE boot. The default
> entry in the PXE menu is to boot off the local hard drive, and we drop
> in a kickstart if need be (new nodes, node refreshes, I just feel like
> it, etc). I'm currently moving the cluster from CentOS-6 to CentOS-7.
> At the same time, I have ~200 nodes with onboard 10GBase-T NICs
> (X540-AT2 based) that had been plugged into 1Gbps switches (from
> Brocade) that I'm moving over to 10Gbps switches (Cisco Nexus
> C93120TX). The ones I'm currently working with have fairly short
> cable runs (<7ft), and are using Cat 6a cables.
> I'm running into a major issue where a large percentage (well over 50)
> of attempted PXE kickstarts fails. The failures occur in multiple
> places, but all seem to be related to slow initialization of the
> network interface. I've seen:
> 1) dracut-initqueue timeouts leading to "/dev/root does not exist"
> 2) the node loads the kickstart file but then fails while trying to
> read the repo metadata.
> 3) the kickstart actually succeeds, but during reboot a bunch of
> network services (NFS mounts, SGE, etc) attempt to start but fail
> because the network isn't fully up yet.
> To fix things, I've tried:
> 1) adding "inst.waitfornet=120 rd.net.timeout.carrier=120
> rd.net.timeout.iflink=100 rd.net.timeout.ifup=120 rd.net.dhcp.retry=5"
> to the kernel parameters in the PXE menu *and* the default grub
> 2) adding "LINKDELAY=120" to the ifcfg-$INTERFACE scripts (still using
> the network service here, not NetworkManager)
> 3) turning on PortFast on the network ports, i.e. "spanning-tree port
> type edge".
> Nothing has really made a huge difference. PortFast seemed to at
> first, but larger scale tests still have rather high failure rates.
> Has anyone seen anything like this? And, more importantly, has anyone
> fixed it? Thanks!
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf