updating the Linux kernel

Crutcher Dunnavant dunna001 at bama.ua.edu
Fri Jun 9 19:45:50 PDT 2000


Now, I might completly miss something here, but shouldn't all *distibuted* 
parallel programs assume that a node may not return. After all, what do you 
assume about hardware failures? So, while it may not be a *good* way to do it, 
In a properlly paralized application, shouldn't you be able to take down any 
random node other than the job allocation node, AT ANY TIME, and have that job 
reallocated (Yeah, you lose the local work, but those tasks should be 
checkpointed frequently). I just don't think that you should EVER be able to 
lose more than 5-10 minutes worth of work on a given node, and if you can, you 
should re-examine your program design. So just kill the boxes, and update 
them, one at a time.

-Crutcher Dunnavant
"Elegant, Documented, On Time; Choose 2"
Email: dunna001 at bama.ua.edu
Resume: http://resumes.dice.com/crutcher
Home:(256)-232-7883





More information about the Beowulf mailing list