[Beowulf] Anyone having IPMI problems on Intel S3200 series
henning.fehrmann at aei.mpg.de
Wed Apr 22 00:18:15 PDT 2009
On Mon, Apr 20, 2009 at 07:22:40AM -0400, Joe Landman wrote:
> Henning Fehrmann wrote:
>> Hi Greg,
>> On Mon, Apr 20, 2009 at 01:28:50AM -0700, Greg Lindahl wrote:
>>> On Mon, Apr 20, 2009 at 09:24:09AM +0200, Henning Fehrmann wrote:
>>>> We also had this problem with Supermicro boards and IPMI cards in a large
>>>> scale. Finally we found a solution by upgrading the firmware of the NICs which are
>>>> actually from Intel.
>>> I take it that you didn't get the version with the dedicated ethernet
>>> port for IPMI?
>> Correct, we do not have IPMI cards with a dedicated ethernet port, at
>> least for the nodes.
> We've found that, for nodes like this, when you reboot the machine, the
> OS tears down the NIC setup, and things like arp responses suddenly
> disappear. Which means either keeping a log of arp -> IP mapping, as
> the nics, with a torn down arp daemon, won't respond ...
Well, there was not an obvious correlation between nodes going down and
non working IPMI cards. We have been able to access nodes via IPMI-SOL,
even when the OS was not loaded yet and, contrary, some of the nodes staying up
for month lost suddenly the IPMI connection.
> .. or avoid buying the no-NIC BMCs. We started doing that a few years
> ago, as our customers were getting annoyed by those problems (and our
> 'solution' of maintaining arp tables with a script to force the mac->ip
> mapping, while working, was at best a hack).
> From our experience ... spend the extra several dollars on the NIC
> version, and the cable. It will save time/effort over the long haul.
> Fewer surprises.
We did it for the server right from the beginning.
But hmmm ... x1680 nodes - Will be difficult to get the money ;).
Finally, installing a very recent firmware on all involved components
solved the problem for free.
More information about the Beowulf