[Beowulf] tcp error: Need ideas!

Gerry Creager gerry.creager at tamu.edu
Wed Jan 21 14:40:26 PST 2009


History/background/description of the cluster
* 126 node Dell 1950 cluster with dual-quad core Xeons
* HP 5412zl switch for gigabit cluster backplane and 10GBE interconnect 
to selected services (file server, etc)
* Gigabit interconnect
* Hand compiled 2.6.26 kernel
* bnx2 module loaded for the Broadcom onboard nics
* Switch, compute nodes, head node set to 9000 byte MTU

We're seeing the following error in WRF compiled with openMPI and the 
PGI 7.2 compiler:
mca_btl_tcp_frag_send:writev failed with errno=104

While all nodes were accessible prior to the run and returned 
appropriate "stuff" when queried with, eg., ssh and a command, two nodes 
now return something like this:
[gerry at brazos SCOOP12km]$ ssh c0522
Received disconnect from 192.168.200.154: 2: Bad packet length 808464432.

I'm stumped and looking for causes and solutions.  Yeah, the WRF as 
compiled did run before the change to Jumbos.

Do I reduce the size of the frames to something smaller, like 8800 
bytes? 7500?  1500?

I'm not completely out of ideas but stumped.

Thanks, gerry
-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list