[Beowulf] delayed savings time crashes

David Mathog mathog at caltech.edu
Wed Apr 12 11:34:33 PDT 2006


> The reboots were due to a City of Pasadena power glitch at 9:17 that 
> morning. :)  It was raining, and a 34kV city feeder line that runs
between 
> the generating plant at the entrance of the 110 and a substation at
Del Mar 
> & Los Robles faulted.  The responsible breaker took 13 cycles to break, 
> during which time the single-phase voltage seen at Caltech dropped to
about 
> 75V.

I was on campus at that time and didn't notice it.  My desktop
machine didn't even hiccup.

Hmm, now that we know the cause of it that might explain
why all those that did reboot were plugged into just 2 surge
suppressors, where the loss was 9/10 machines, whereas the
other 2 surge suppressors lost 0/10 machines.  Each surge
suppressor is on its own circuit which is 1/3rd of a 3 phase line. 
Maybe only one phase had the glitch and by good luck the
two circuits which lost no machines were wired between the
two good phases?

Usually power glitches just crash the nodes and they stay down
but this one may have looked enough like power off/power to have
allowed a reboot.  The servers are all plugged into UPS's so
they saw none of this.

> This info comes from the responsible EE at Caltech.  As for its effects, 
> believe me, I know about it the hard way, as it took down 2/3 of our 
> compute nodes, 1/3 of our disk shelves, and 3/4 of our fileservers.  

That's a lot of machines in your case.  Did any sustain permanent
damage?

> As for the time glitch, that is probably induced by the fact that
Daylight 
> Savings Time changes only take place on the "system" clock,

Right, that makes perfect sense.  There had been no planned shutdown
since the DST change and they would have come up an hour off.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list