[Beowulf] InfiniBand VL15 error

Prentice Bisbal prentice at ias.edu
Tue Dec 2 14:02:59 PST 2008


See my answers inline.

Nifty Tom Mitchell wrote:
> On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote:
>> I'm getting this error when I run ibchecknet on my cluster:
>>
>> #warn: counter VL15Dropped = 476        (threshold 100) lid 1 port 1
>> Error check on lid 1 (aurora HCA-1) port 1:  FAILED
>>
>> I've googled around this morning, but haven't found anything helpful.
>> Most of the hits turn up code with the phrase "VL15Dropped", but nothing
>> explaining what this error means, what causes it, or how to fix it.
>>
>> After clearing the counters with 'perfquery -r', the VL15Dropped count
>> starts increasing from zero almost immediately.
>>
>> Any ideas what this error represents or how to fix? Could it be a bad
>> cable?
>>
> 
> Can you be specific about the hardware (HCA and switch) and software?
> How large is the fabric?
> What subnet manager is running and where?
>
> The host behind LID-1 is the one of interest.

IB Switch: Cisco 7012 D, 144-port
HCAs: Cisco, which is really Mellanox:

# lspci | grep Infini
0b:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev 20)

The subnet manager is OpenSM 3.1.8-1.el5, which is provided by my Linux
Distro, PU_IAS 5.2, which is a rebuild of RHEL 5.2. It is running on the
master node, aurora. The HCA with the error is on this node (see errors
message in original post).

> 
> If I recall correctly, VL15  is reserved exclusively for subnet management
> and is not optional.  Traffic to VL15 might be randomly dropped by the
> switch, SMA or interrupt handler.  As long as the subnet is OK modest
> dropped traffic on VL15 may not be an issue.
> 
> What is running on the fabric concurrently with ibchecknet (and on the LID-1 host)?

Not sure what you mean. Do you want to see the output of ibchecknet?

> 
> Subnet management traffic should be light, very light.  Tell us about 
> the subnet manager situation on your fabric.   There should only
> be one active subnet manager.   Mixed and uncooperating  SMs could
> cause this, as could basic IB errors (connectors, cables, connections).
> If the SM is running on LID-1 then traffic will reflect the fabric size.

There is only one SM running. It's running on the master node. The other
nodes don't even have the OpenSM package installed.
> 
> What other IB errors are you seeing..  If the port for LID-1 is not seeing
> IB errors other than VL15 you should be OK -- do look for multiple SMs.

I'm not seeing any other errors. This one is a new development, too.

> If you stop your subnet manager does the counter reflect the pause.
> 

Haven't tried yet. And since it's almost quitting time, I'm not going to
try until tomorrow.

-- 
Prentice



More information about the Beowulf mailing list