[Beowulf] Infinipath memory parity errors
Nifty niftyompi Mitch
niftyompi at niftyegg.com
Wed Aug 13 17:12:40 PDT 2008
On Wed, Aug 13, 2008 at 05:03:46PM +0100, Dave Love wrote:
> [I know in an ideal world the vendor between us and PathScale^WQlogic
> would sort this out.]
>
> I'm interested in the cause (and possible cure!) of intermittent errors
> on various nodes in our Infinipath system which stop MPI jobs with
> kernel messages like this, in case anyone's familiar with them:
>
> lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]}
>
> They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but
> probably just manifested themselves in some other way previously.
>
> Google didn't produce any leads, and a brief look in the source suggests
> that tracking it down where it's generated in the ib_ipath module is
> non-trivial and likely won't tell me a lot.
>
> For what it's worth, the adaptors are
>
> 06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02)
>
> in two different sorts of Supermicro whose model numbers I don't know.
>
Dave,
Which driver is active? Which Infinipath software release
is installed? The tool "ipath_control -i" can show which...
The kernel.org/ofed driver does not have as rich a set of error recovery
code for this card as the shipped driver. The recovery code was seen
as a badness and not accepted by the kernel.org folk....
With a kernel update the driver will not have been recompiled
and the kernel.org driver would become active.
Look for this stuff in the Install Guide.
# To rebuild the drivers, do the following (as root):
# cd /usr/src/infinipath/drivers
# ./make-install.sh
# /etc/init.d/infinipath restart
--
T o m M i t c h e l l
Got a great hat... now what.
More information about the Beowulf
mailing list