[Beowulf] Good post-mortem of a Lustre outage at CSC
Adam DeConinck
ajdecon at ajdecon.org
Fri Apr 1 08:53:08 PDT 2016
In case some of the folks on this list haven't seen this particular
horror story yet :)
https://csc.fi/web/blog/post/-/blogs/the-largest-unplanned-outage-in-years-and-how-we-survived-it
"The DDN controller replacement went quite smoothly and around 10 a.m.
we were ready to bring the system back online. However, when
restarting the Lustre filesystem, the metadata server reported
anomalies in its filesystem and requested to do a filesystem check
(fsck). Typically these are fairly routine operations, especially when
the filesystem has been up for a long time. Any problems that the
check finds are typically fixed automatically with no impact.
In this case, however, the tool could not fix all the problems it
identified. A faulty inode persisted. Trying to bring the Lustre up
resulted in a system crash (kernel panic) with this inode a very
likely cause."
-Adam
More information about the Beowulf
mailing list