[Beowulf] [EXTERNAL] Power Cycling Question

Prentice Bisbal pbisbal at pppl.gov
Mon Jul 19 15:46:33 UTC 2021


> On the "spool RAM to disk" idea - That's sort of like checkpointing, and it can take surprisingly long, so there's another tradeoff there.

Not really, especially not with NVMe disk drives. I have NVMe drives in 
both my laptop and my desktop, and it startling how fast they boot and 
resume from suspend with NVMe disks.

I think the bigger issue with this approach is if enterprise servers 
would support this. I believe there has to be some level of hardware 
support for this, which I doubt servers designed for constant-on use 
have. Someone please jump in and correct me if I'm wrong here.

Prentice

On 7/16/21 8:38 PM, Lux, Jim (US 7140) via Beowulf wrote:
> An interesting question.
> The power cycling reliability thing is probably not a big deal - the temperatures change a lot between light load and heavy load already, and if a "server class" PC can't take a power cycle per day, when the grungiest consumer unit can do it, I'd be surprised. It's not like you're cycling between -40C and 70C every hour like in an automotive application.
>
> Managing the chillers, though - That might be a bigger problem.
>
> And as Jörg points out, there's a fair amount of sophistication needed in setting your turn on and turn off thresholds.
>
> On the "spool RAM to disk" idea - That's sort of like checkpointing, and it can take surprisingly long, so there's another tradeoff there.
>
>
> On 7/16/21, 12:35 PM, "Beowulf on behalf of Douglas Eadline" <beowulf-bounces at beowulf.org on behalf of deadline at eadline.org> wrote:
>
>
>      Hi everyone:
>
>      Reducing power use has become an important topic. One
>      of the questions I always wondered about is
>      why more cluster do not turn off unused nodes. Slurm
>      has hooks to turn nodes off when not in use and
>      turn them on when resources are needed.
>
>      My understanding is that power cycling creates
>      temperature cycling, that then leads to premature node
>      failure. Makes sense and has anyone ever studied/tested
>      this ?
>
>      The only other reason I can think of is that the delay
>      in server boot time makes job starts slow or power
>      surge issues.
>
>      I'm curious about other ideas or experiences.
>
>      Thanks
>
>      --
>      Doug
>
>
>
>
>      --
>      Doug
>
>      _______________________________________________
>      Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>      To change your subscription (digest mode or unsubscribe) visit https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!ef5Z3NxzUcVChBwMKSYQ9u5d4nI_weKdbvUWM6BY8x2UyBeye1j64LNSRzJZUkml3wOJ0TM$
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list