[Beowulf] [EXTERNAL] Power Cycling Question
Lux, Jim (US 7140)
james.p.lux at jpl.nasa.gov
Sat Jul 17 00:38:01 UTC 2021
An interesting question.
The power cycling reliability thing is probably not a big deal - the temperatures change a lot between light load and heavy load already, and if a "server class" PC can't take a power cycle per day, when the grungiest consumer unit can do it, I'd be surprised. It's not like you're cycling between -40C and 70C every hour like in an automotive application.
Managing the chillers, though - That might be a bigger problem.
And as Jörg points out, there's a fair amount of sophistication needed in setting your turn on and turn off thresholds.
On the "spool RAM to disk" idea - That's sort of like checkpointing, and it can take surprisingly long, so there's another tradeoff there.
On 7/16/21, 12:35 PM, "Beowulf on behalf of Douglas Eadline" <beowulf-bounces at beowulf.org on behalf of deadline at eadline.org> wrote:
Reducing power use has become an important topic. One
of the questions I always wondered about is
why more cluster do not turn off unused nodes. Slurm
has hooks to turn nodes off when not in use and
turn them on when resources are needed.
My understanding is that power cycling creates
temperature cycling, that then leads to premature node
failure. Makes sense and has anyone ever studied/tested
The only other reason I can think of is that the delay
in server boot time makes job starts slow or power
I'm curious about other ideas or experiences.
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!ef5Z3NxzUcVChBwMKSYQ9u5d4nI_weKdbvUWM6BY8x2UyBeye1j64LNSRzJZUkml3wOJ0TM$
More information about the Beowulf