[Beowulf] substantial RX packet drops during Pallas over e1000 (Rocks 4.1)

Sat May 27 17:00:43 PDT 2006

Greetings,

	I have an update on this problem. Many list members emailed me with 
suggestions (sysctl, rx buffers, etc). None of those had an affect on 
the problem. Intel engineers had me edit the e1000 driver source to 
repartition the fifo on the 82573V controller but that did not help either.

	I swapped in an Intel 82546EB PCI-X/133 card on a fwe nodes and 
retested. The dropped packets disappeared on the nodes with the PCI-X 
nic. I added some PCIe adapters that were different than those on the 
motherboard, still Intel. Reran Pallas and the nodes with the PCIe nic 
still showed dropped packets.

	It appears as if this issue affects PCIe only. PCI-X nics are
stable. I am running 2.6.16.11. PCIe support is enabled in the kernel. 
The onboard PCIe nics are PCIe-x1. The PCIe card I added is PCIe-x4.

Does anyone have any thoughts as to why these dropped packets would only 
appear under PCIe?

--Jeff

#----------previous posting--------------------#
Greetings,

    Running Rocks 4.1 on a 30 node system and seeing serious RX packet
loss, drops and overruns while running heavy MPI i/o over e1000. I have
replaced cabling, and switches, updated e1000 drivers, ran multiple
kernels, etc. No  modifications seem to affect the issue. I am pursuing
a hardware resolution with Intel and Supermicro but I am posting here in
case someone has seen similar events.

    System details:
       30 nodes - Intel Pentium-D 840, 4GB RAM, 80GB SATA
             Supermicro PDSMI motherboard
             Intel 82573E and 82573L gigabit ethernet controllers
             (only one network connected)
             2.6.9-34.ELsmp  /*and*/   2.6.16.11
             e1000-7.0.38-1 driver

    Run details:
       mpirun -nolocal -np 18 -machinefile /home/test/machines.20-29
/home/test/IMB-MPI1 Alltoall -npmin 18 -msglen /home/test/Lengths
(msglen values of 32, 256, 512 and 1024 have been run exclusively, each
resulting in packet drops)

   Packet drop example: (other nodes post similar numbers)
           RX packets:1843133 errors:0 dropped:1245 overruns:0 frame:0
           TX packets:1764828 errors:0 dropped:0 overruns:0 carrier:0

    I have tried increasing the e1000 RxDescriptors value to the maximum
of 4096 thinking that the Alltoall test may be overtasking receive
buffer resources but the drops still occur.

    At Intel's advice I set arp filtering but it did nothing to change
the behavior of the problem. (/proc/sys/net/ipv4/conf/all/arp_filter)

Any ideas?

--Jeff