[Beowulf] reboot without passing through BIOS?

Kilian CAVALOTTI kilian at stanford.edu
Thu Jul 31 13:00:45 PDT 2008


On Wednesday 30 July 2008 09:13:56 am David Mathog wrote:
> If one were to build nodes without ECC memory it would probably be a
> good idea to reboot them from time to time to clean out whatever bad
> bits might have accumulated.  It then occurred to me that doing so
> would require a trip through the BIOS on every reboot, at least on
> every x86 based computer I'm familiar with.  That is not a terrible
> thing, but it made me wonder if it is really necessary. 

I may be totally missing the point, but doesn't the memory need to be 
physically (as in electrically) reset in order to clean out those bad 
bits? And doesn't this require a hard reboot, for the machine to be 
power cycled, so that memory cells are reinitialized? 

I mean, if the BIOS stage is skipped, as in kexec'ing a new kernel, 
electrical initialization doesn't occur, and the bad bits will probably 
stick there. Unless the kernel does this kind of scrubbing in its 
initialization phase, which I don't know, I don't see any reason why 
the memory would be cleaned from errors.


And another point I wonder about, is to know if a reboot would do any 
good for non-ECC memory anyway. As far as I understand it, a memory 
error is either a repeatable, hard one, like a bad chip, and a reboot 
won't change anything about it, since the hardware is faulty ; either a 
transient, soft error, where a bad value is read once, but where next 
reads are ok. So unless there's a sort of accumulation somewhere in the 
soft case, I don't really understand what a reboot could do about it?

If you got some light to shed on this, I'd be interested.

Cheers,
-- 
Kilian



More information about the Beowulf mailing list