[Beowulf] Re: Bugfix for Broadcom NICs losing connectivity

Tina Friedrich Tina.Friedrich at diamond.ac.uk
Fri Jun 4 01:39:38 PDT 2010


We've had that happen on some of our servers. Currently using the 
disable_msi workaround, which seems to have stopped it. I believe 
there's supposed to be a fix in the latest Red Hat kernel but we haven't 
really tested that yet.

You loose all network connectivity (including IPMI) to the server - not 
all connectivity, so e.g. serial console (not SOL, proper serial 
console, or using a console server) still works (as would a locally 
attached keyboard/monitor). Unless you require network to log in :) . If 
one runs into this, it's a really weird one (before you find the bug 
report) - to all appearances, the server works happily, no strangeness 
in the logs - just network gone completely.

It's not one to trigger easily - hard to track down sort of thing. Had 
610s and 710s for a while before this first happened (and loads we never 
saw it on, still). We first saw it on a rather heavily used NFS server 
(i.e. lots of network I/O).

Tina


Cris Rhea wrote:
>> In case it helps anyone using Dell R410 / 610 / 710 etc. servers: I have had
>> machines lose their eth connections periodically (CentOS 5.4 bnx2 driver).
>> Seems like a bug with the Broadcom NIC drivers. [luckily read of it on a
>> Dell mailing list]
>>
>> Bug Reports:
>>
>> http://kbase.redhat.com/faq/docs/DOC-26837
>> http://patchwork.ozlabs.org/patch/51106
>>
>> Not sure yet if this is exactly my issue but I'm giving it a shot now.
>> Thought I'd post since, anecdotally I've seen many people use these servers
>> on the list.
>>
>> -- 
>> Rahul
> 
> I've been following this on the Dell list as I have approx. 50 R410s  
> in our cluster.
> 
> One thing that isn't clear--  When this happens, do you lose all 
> connectivity to the node (i.e., do you have to reboot the node to 
> re-establish eth0)?
> 
> My R410s are running CentOS 5.2 - 5.4 and I rarely have one go 
> down.
> 
> --- Cris
> 
> 


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442



More information about the Beowulf mailing list