[Beowulf] Power Cycling Question

Andrew M.A. Cater amacater at einval.com
Sat Jul 17 10:19:42 UTC 2021


On Sat, Jul 17, 2021 at 12:43:27AM +0100, Jörg Saßmannshausen wrote:
> Hi Doug,
> 
> interesting topic and quite apt when I look at the flooding in Germany, 
> Belgian and The Netherlands. 
> 
> I guess there are a number of reasons why people are not doing it. Discarding 
> the usual "we never done that" one, I guess the main problem is: when do you 
> want  to turn it off? After 5 mins being idle? Maybe 10 mins? One hour? How 
> often do you then need to boot them up again and how much energy does that 
> cost? From chatting to a few people who tried it in the past it somehow 
> transpired that you do not save as much energy as you were hoping for. 
> 
> However, on thing came to my mind: is it possible to simply suspend it to disc 
> and then let it be sleeping? That way, you wake the node up quicker and 
> probably need less power when it is suspended. Think of laptops. 
> 

If your disks are spinning rust - and have been spinning for a long time -
you may not want to power them down and back up again. If you've got a 
storage shelf of RAID-ed disks that is fairly large, there's a small but
non-zero chance that one of the individual disks will fail with power/heat
cycling. I've seen shelves run with five or six disks out - which is a 
very bad sign, because it means that no-one has looked at them in a while -
but still running.

A second hand anecdote from a few years ago: a supercomputer in an aerospace
facility in Hyderabad. Apparently when the aircon couldn't cope, they'd race
to shutdown the computer nodes before they failed. In consequence they had
enough failures that they couldn't really do any work.

> The other way around would simply be: we know in say the summer, there is less 
> demand so we simply turn X number of nodes off and might do some maintenance 
> on them. So you are running the whole cluster for say 6 weeks with limited 
> capacity. That might mean a few jobs are queuing but that also will give us a 
> window to do things. Once people are coming back, the maintenance is done and 
> the cluster can run at full capacity again. 
> 
That's a much more realistic point.

All best,

Andy Cater

> Just some (crazy?) ideas.
> 
> All the best
> 
> Jörg
> 
> Am Freitag, 16. Juli 2021, 20:35:11 BST schrieb Douglas Eadline:
> > Hi everyone:
> > 
> > Reducing power use has become an important topic. One
> > of the questions I always wondered about is
> > why more cluster do not turn off unused nodes. Slurm
> > has hooks to turn nodes off when not in use and
> > turn them on when resources are needed.
> > 
> > My understanding is that power cycling creates
> > temperature cycling, that then leads to premature node
> > failure. Makes sense and has anyone ever studied/tested
> > this ?
> > 
> > The only other reason I can think of is that the delay
> > in server boot time makes job starts slow or power
> > surge issues.
> > 
> > I'm curious about other ideas or experiences.
> > 
> > Thanks
> > 
> > --
> > Doug
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list