[Beowulf] delayed savings time crashes
David Kewley
kewley at gps.caltech.edu
Wed Apr 12 12:57:30 PDT 2006
On Wednesday 12 April 2006 11:34, David Mathog wrote:
> Hmm, now that we know the cause of it that might explain
> why all those that did reboot were plugged into just 2 surge
> suppressors, where the loss was 9/10 machines, whereas the
> other 2 surge suppressors lost 0/10 machines. Each surge
> suppressor is on its own circuit which is 1/3rd of a 3 phase line.
> Maybe only one phase had the glitch and by good luck the
> two circuits which lost no machines were wired between the
> two good phases?
I do not know how this worked, but I did see something similar but even
stranger. Our UPS feeds two PDUs, each responsible for about 1/2 the
computers. One PDU saw all computers on phases 1 & 2 fail, and the other
saw all computers on phases 1 & 3 fail. On both PDUs, the third,
unaffected phase saw all its computers stay up. I have no idea how to
explain this.
> > This info comes from the responsible EE at Caltech. As for its
> > effects, believe me, I know about it the hard way, as it took down 2/3
> > of our compute nodes, 1/3 of our disk shelves, and 3/4 of our
> > fileservers.
>
> That's a lot of machines in your case. Did any sustain permanent
> damage?
It was a voltage drop rather than a spike, and that probably explains why we
had no hardware damage. Just quite a bit of filesystem corruption to clean
up (which leaves lost files & corrupted file data for some small subset of
user files).
David
More information about the Beowulf
mailing list