[Beowulf] Whats up with these newer Intel NICs?

jeff.johnson jeff.johnson at wsm.com
Tue Sep 25 14:55:33 PDT 2007


Joe Landman wrote:

(responses embedded)
> Jeff Johnson wrote:
>   
>> Joe,
>>
>>    I think you may be dealing with a PCIe fifo issue.
>>     
>
> Hi Jeff:
>
>   Possibly.  I had thought about that.  I was thinking more along the
> lines of "it is a motherboard NIC, so we don't need no steenkeen high
> performance things like 64 bit buffers ..."

The controller in question is a LOM component but not quite a "desktop 
grade" chip. It is a server class controller. IMHO, Intel and other 
silicon spinners haven't quite grasped the difference between 
"enterprise grade" and "hpc grade". I think it is highly likely that 
this chip can cut your storage cluster mustard but the driver and i/o 
options may not run well for your application without some Kentucky windage.
>   
>>    I have seen issues with the Intel PCIe gigabit ethernet onboard parts
>> when compared to PCIe slot cards and PCIX cards like the ones you are
>> testing. Specifically the partitioning of the controller's buffers
>> between rcv and xmit operations (internal to the controller chip itself)
>> and the controller's relationship with the PCIe buffer on the
>> northbridge. PCIe, being serial, has different challenges when reaching
>> the top end of a device's performance capabilities. In this case you are
>> suffering some buffer throttling.
>>     
>
> I played with some (OS/NIC) buffer settings, txqueuelen, and a few other
> tunables.  Nothing seems to have impacted it.
>   
The way the Intel controller and e1000 driver interact is that the e1000 
driver sets up the rcv buffer at initialization time and the *remainder* 
is left for xmit. This is not something that can be adjusted using 
ethtool or a modload option. You have to get into the e1000 driver 
source, find the rcv buffer size definition and then change it to suit 
your evil needs. Recompile and enjoy. Here is where the Kentucky windage 
comes in as you may have to try a few values. Lather, rinse, repeat 
until you get it right.
>   
>>    By default the buffers are partitioned for "one size fits most"
>> scenario. If you know your i/o profile you can use ethtool (or modify
>>     
>
> Yeah ...
>
>   
>> e1000 driver source) to repartition the controller's fifo to favor rcv
>> or xmit operations. This results in better performance in situations
>> where you know you will have heavier writes over reads or vice versa.
>>     
>
> Of course, though without knowing your workload in advance you can't
> really tune this.
>
> Aside from that, I can't say I have seen many people tune their storage
> clusters for workloads of one particular type.  You basically never know
> what users will throw your way, and you really don't want one "corner
> case" test being the important thing that drives down overall performance.
>   
Unless you are building a generic use resource it is possible to figure 
out if the environment is favoring reads over writes, etc. You don't 
have to be exact. Now you are dealing with a 50/50 balance in terms of 
your ethernet and PCIe rcv/xmit buffer resources. Moving to 60/40 in 
favor of one direction could make the difference in terms of exhausting 
your buffer resources and experiencing the slow down.

I could bury a 8 node mpich run of Pallas on the 82573 (first gen Intel 
gigabite PCIe LOM) until I monkied with the buffer settings. Running 
Pallas, even at very small message sizes, the buffers were getting 
buried so bad that it wasn't a matter of slow down but rapidly 
incrementing dropped packets.

Any chance you are running jumbo frames? If so, turn it off and retest.

Also, use the e1000.sourceforge,net driver. If you are using a driver 
from Intel or a distro, ditch it.

One of the comments in your original message is key. PCIX works, PCIe is 
slower. With PCIe being serial you have the ethernet buffering and the 
PCIe buffering to contend with as well.

>>    *OR* it is because you are using a Supermicro motherboard..  =)
>>     
>
> Owie ... that left a mark ...
>   
Try deploying a 256 node cluster with a motherboard defect that they 
wouldn't acknowledge. That leaves a mark as well as hefty bar tabs.
> I thought it was that I hadn't given the appropriate HPC deity their
> burnt (processor) offering ...
>   
The gods prefer FBDIMMs these days..


-- 
Best Regards,

Jeff Johnson
Vice President
Engineering/Technology
Western Scientific, Inc
jeff.johnson at wsm.com
http://www.wsm.com

5444 Napa Street - San Diego, CA 92110
Tel 800.443.6699  +001.619.220.6580
Fax +001.619.220.6590

"Braccae tuae aperiuntur"




More information about the Beowulf mailing list