[Beowulf] PXE/kickstart and 10GBase-T issues

Thu Feb 28 16:31:36 PST 2019

Cisco's website info on PortFast makes me wonder how it did you any good at
all, while in a transition. Any misconfiguration could block all ports,
some configurations being "type-inconsistent."
I love these puzzles and will watch this carefully. Sorry I cannot be of
more help.
Jonathan Engwall

On Thu, Feb 28, 2019, 2:54 PM Joshua Baker-LePain <
joshua.bakerlepain at gmail.com> wrote:

> I've got a few-hundred node cluster here that I've had humming along
> for several years.  All the nodes are set to PXE boot.  The default
> entry in the PXE menu is to boot off the local hard drive, and we drop
> in a kickstart if need be (new nodes, node refreshes, I just feel like
> it, etc).  I'm currently moving the cluster from CentOS-6 to CentOS-7.
> At the same time, I have ~200 nodes with onboard 10GBase-T NICs
> (X540-AT2 based) that had been plugged into 1Gbps switches (from
> Brocade) that I'm moving over to 10Gbps switches (Cisco Nexus
> C93120TX).  The ones I'm currently working with have fairly short
> cable runs (<7ft), and are using Cat 6a cables.
>
> I'm running into a major issue where a large percentage (well over 50)
> of attempted PXE kickstarts fails.  The failures occur in multiple
> places, but all seem to be related to slow initialization of the
> network interface.  I've seen:
>
> 1) dracut-initqueue timeouts leading to "/dev/root does not exist"
>
> 2) the node loads the kickstart file but then fails while trying to
> read the repo metadata.
>
> 3) the kickstart actually succeeds, but during reboot a bunch of
> network services (NFS mounts, SGE, etc) attempt to start but fail
> because the network isn't fully up yet.
>
> To fix things, I've tried:
>
> 1) adding "inst.waitfornet=120 rd.net.timeout.carrier=120
> rd.net.timeout.iflink=100 rd.net.timeout.ifup=120 rd.net.dhcp.retry=5"
> to the kernel parameters in the PXE menu *and* the default grub
> parameters
>
> 2) adding "LINKDELAY=120" to the ifcfg-$INTERFACE scripts (still using
> the network service here, not NetworkManager)
>
> 3) turning on PortFast on the network ports, i.e. "spanning-tree port
> type edge".
>
> Nothing has really made a huge difference.  PortFast seemed to at
> first, but larger scale tests still have rather high failure rates.
> Has anyone seen anything like this?  And, more importantly, has anyone
> fixed it?  Thanks!
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20190228/a36bf276/attachment.html>