[Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive?

Sun Oct 25 06:17:33 PDT 2009

Oh, it doesn't always work, but even with the Dell hardware, it USUALLY 
does.  As recently as Friday, I had to ssh into an IPMI module and 
reboot busybox (this was on a supermicro system, not a dell) because 
IPMI had gotten stupid.  When I did, it regained its intelligence and 
has performed properly ever since.  NO ideas what confused it, though, 
which is a bit disconcerting.

gerry

Rahul Nabar wrote:
> Now that I have remote-IPMI and SOL working my next step is to try and
> crash Linux to see if there might be "pathological crash cases" where
> I will end up having to go to the server room. So far, whatever I do
> I'm pleasantly surprised that "chassis power cycle" always seems to
> work!
> 
> I tried:
> 
>  `echo "c" > /proc/sysrq-trigger` to produce kernel panic. The node
> still reboots on its IPMI interface.
> 
> What surprised me was that even if I take down my eth interface with a
> ifdown the IPMI still works. How does it do that? I mean I am using
> the shared NIC approach and I was expecting the IPMI to clam up the
> moment the OS took a port down.
> 
> On Sept 30 Joe Landman said:
> 
>> After years of configuring and helping run/manage both, we recommend strongly *against* the shared physical connector approach.  The extra cost/hassle of the extra cheap >switch and wires is well worth the money.
>> Why do we take this view?  Many reasons, but some of the bigger ones are
> 
> 
> (I know Joe Landman and others had warned me against this but I tried
> to start with configuring a single shared NIC and then go for two
> NICs. Just keeping things simple to start with.)
> 
> But my single shared NIC results seem good enough already. Which is
> why I was trying to see if there are any worse possibilities of
> crashes that will render contacting the IPMI impossible.
> 
> On Sept 30 Joe Landman said:
> 
>> a) when the OS takes the port down, your IPMI no longer responds to arp requests.  Which means ping, and any other service (IPMI) will fail without a continuous updating of the >arp tables, or a forced hardwire of those ips to those mac addresses.
> 
> Another point that surprises me is how the IPMI kept working even
> after CentOS took the port down. I definitely see Joe Landman's
> arguments about why it shouldn't be responding to ARP's any more
> (unless I did something special). That's why I am a bit surprised that
> my IPMI I/P continues to respond to the pings even after the primary
> I/P is dead.
> 
> #Ping primary I/P address
> ping 10.0.0.25
> [no response]
> 
> #Ping IPMI IP address
> ping 10.0.0.26
> PING 10.0.0.26 (10.0.0.26) 56(84) bytes of data.
> 64 bytes from 10.0.0.26: icmp_seq=1 ttl=64 time=0.574 ms
> 64 bytes from 10.0.0.26: icmp_seq=2 ttl=64 time=0.485 ms
> 
> Interestingly arp shows the primary IP as incomplete but the secondary
> IP resolves to the correct IP. This means that the BMC continues to
> respond to the second MAC even after the OS took the eth port down.
> How exactly does this "magic" happen. I'm just curious.
> 
> node25                           (incomplete)                              bond0
> 10.0.0.26                ether   00:24:E8:63:D6:9E   C                     bond0
> 
> Another mysterious observation was this: Whenever I took eth down via
> the OS there is a latent period when the IPMI stops responding but
> then somehow it magically resurrects itself and starts working again.
> 
> Just making sure this isn't a fluke case......Any comments or more
> disaster scenario simulations are welcome!
>