[Beowulf] InfiniBand VL15 error
Prentice Bisbal
prentice at ias.edu
Tue Dec 2 14:02:59 PST 2008
See my answers inline.
Nifty Tom Mitchell wrote:
> On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote:
>> I'm getting this error when I run ibchecknet on my cluster:
>>
>> #warn: counter VL15Dropped = 476 (threshold 100) lid 1 port 1
>> Error check on lid 1 (aurora HCA-1) port 1: FAILED
>>
>> I've googled around this morning, but haven't found anything helpful.
>> Most of the hits turn up code with the phrase "VL15Dropped", but nothing
>> explaining what this error means, what causes it, or how to fix it.
>>
>> After clearing the counters with 'perfquery -r', the VL15Dropped count
>> starts increasing from zero almost immediately.
>>
>> Any ideas what this error represents or how to fix? Could it be a bad
>> cable?
>>
>
> Can you be specific about the hardware (HCA and switch) and software?
> How large is the fabric?
> What subnet manager is running and where?
>
> The host behind LID-1 is the one of interest.
IB Switch: Cisco 7012 D, 144-port
HCAs: Cisco, which is really Mellanox:
# lspci | grep Infini
0b:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev 20)
The subnet manager is OpenSM 3.1.8-1.el5, which is provided by my Linux
Distro, PU_IAS 5.2, which is a rebuild of RHEL 5.2. It is running on the
master node, aurora. The HCA with the error is on this node (see errors
message in original post).
>
> If I recall correctly, VL15 is reserved exclusively for subnet management
> and is not optional. Traffic to VL15 might be randomly dropped by the
> switch, SMA or interrupt handler. As long as the subnet is OK modest
> dropped traffic on VL15 may not be an issue.
>
> What is running on the fabric concurrently with ibchecknet (and on the LID-1 host)?
Not sure what you mean. Do you want to see the output of ibchecknet?
>
> Subnet management traffic should be light, very light. Tell us about
> the subnet manager situation on your fabric. There should only
> be one active subnet manager. Mixed and uncooperating SMs could
> cause this, as could basic IB errors (connectors, cables, connections).
> If the SM is running on LID-1 then traffic will reflect the fabric size.
There is only one SM running. It's running on the master node. The other
nodes don't even have the OpenSM package installed.
>
> What other IB errors are you seeing.. If the port for LID-1 is not seeing
> IB errors other than VL15 you should be OK -- do look for multiple SMs.
I'm not seeing any other errors. This one is a new development, too.
> If you stop your subnet manager does the counter reflect the pause.
>
Haven't tried yet. And since it's almost quitting time, I'm not going to
try until tomorrow.
--
Prentice
More information about the Beowulf
mailing list