[Beowulf] Mellanox ConnectX-3 MT27500 problems
Jörg Saßmannshausen
j.sassmannshausen at ucl.ac.uk
Sun Apr 28 00:35:42 PDT 2013
Hi Brice,
thanks for the feedback. Good to hear that it is working and I got the right
cards! For a moment I thought I ordered something wrong here.
> These cards are QDR and even FDR, you should get 56Gbit/s (we see about
> 50Gbit/s in benchmarks iirc). That what I get on sandy-bridge servers
> with the exact same IB card model.
>
> $ ibv_devinfo -v
> [...]
> active_width: 4X (2)
> active_speed: 14.0 Gbps (16)
>
That is what I get:
$ ibv_devinfo -v
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.10.700
[ ... ]
active_width: 4X (2)
active_speed: 5.0 Gbps (2)
phys_state: LINK_UP (5)
I got the latest firmware as that was my first thougth.
> These nodes have been running Debian testing/wheezy (default kernel and
> IB packages) for 9 months without problems.
Sounds encouraging here. For a moment I thought there is a problem with OFED.
> I had to fix the cables to get 56Gbit/s link state. Without Mellanox FDR
> cables, I was only getting 40. So maybe check your cables. And if you're
> not 100% sure about your switch, try connecting the nodes back-to-back.
The switch is brand new. I am not saying it is impossible there is a problem
here but I think it would be unlikely, given that I plugged the cable into the
main switch as well.
Is there physical difference I can see on the cable what speed they suppose to
run with? Like one being black and the other being grey?
> You can try upgrading the IB card firmware too. Mine is 2.10.700 (likely
> not uptodate anymore, but at least this one works fine).
That is the same one I got (as above).
> Where does your "8.5Gbit/s" come from? IB status or benchmarks?
$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0025:90ff:ff17:8f65
base lid: 0x4b
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
link_layer: InfiniBand
> If
> benchmarks, it could be related to the PCIe link speed. Upgrading the
> BIOS and IB firmware help me too (some reboot gave PCIe Gen1 instead of
> Gen3). Here's what you should see in lspci if you get PCIe Gen3 8x as
> expected:
>
> $ sudo lspci -d 15b3: -vv
> [...]
> LnkSta: Speed 8GT/s, Width x8
That is interesting, I get that:
$ lspci -d 15b3: -vv
LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
I will check if I can get an upgrade of the BIOS. Maybe it is as simple as
that.
Thanks for your help and have a nice Sunday!
Jörg
>
> Brice
>
> Le 27/04/2013 22:05, Jörg Saßmannshausen a écrit :
> > Dear all,
> >
> > I was wondering whether somebody has/had similar problems as I have.
> >
> > We have recenctly purchased a bunch of new nodes. These are Sandybridge
> > ones with Mellanox ConnectX-3 MT27500 InfiniBand connectors and this is
> > where I got problems with.
> >
> > I am usually using Debian Squeeze for my clusters (kernel
> > 2.6.32-5-amd64). Unfortunately, as it turned out I cannot use that
> > kernel as my Intel NIC is not supported here. So I upgraded to
> > 3.2.0-0.bpo.2-amd64 (backport kernel to sqeeze). Here I got network but
> > the InfiniBand is not working. The device is not even recognized by
> > ibstatus. Thus, I decided to do an upgrade (not dist- upgrade) to wheezy
> > to get the newer OFED stack.
> >
> > Here I get the InfiniBand working but only with 8.5 Gb/sec. A simple
> > reseating of the plug increases that to 20 Gb/sec (4X DDR), which is
> > still slower than the speed of the older nodes (40 Gb/sec (4X QDR)).
> >
> > So I upgraded completely to wheezy (dist-upgrade now) but the problem
> > does not vanish.
> > I re-installed squeeze again and installed a vanilla kernel (3.8.8) and
> > the latest OFED stack from their site. And guess what: same experiences
> > here: After a reboot the IfniniBand speed is 8.5 and reseating the plug
> > increases that to 20 Gb/sec. It does not matter whether I connect to the
> > edge switch or to the main switch, in both cases I got the same
> > experiences/observations.
> >
> > Frankly, I am out of ideas now. I don't think the observed speed change
> > after reseating the plug should happen. I am in touch with the technical
> > support here as well but I think we both are a bit confused.
> >
> > Now, am I right to assume that the Mellanox ConnectX-3 MT27500 are QDR
> > cards so I should get 40 Gb/sec and not 20 Gb/sec?
> >
> > Has anybody made similar experiences? Any ideas?
> >
> > All the best from London
> >
> > Jörg
--
*************************************************************
Jörg Saßmannshausen
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ
email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the Beowulf
mailing list