[Beowulf] [External] Power Cycling Question
Prentice Bisbal
pbisbal at pppl.gov
Mon Jul 19 16:12:24 UTC 2021
Doug,
I don't think thermal cycling is as big of an issue as it used to be.
From what I've always been told/read, the biggest problem with thermal
cycling was "chip creep", where the expansion/contraction of a chip in a
socket would cause the chip to eventually work itself loose enough to
cause faulty connections. 20+ years ago, I remember looking at
motherboards with chips inserted into sockets. On a modern motherboard,
just about everything is soldered to the motherboard, except the CPU and
DIMMs. The CPUs are usually locked securely into place so chip creep
won't happen, and the DIMMs have a latching mechanism, although anyone
who has every reseated a DIMM to fix a problem knows that mechanism
isn't perfect.
As someone else has pointed out, components with moving parts, like
spinning disks are at higher risk of failure. Here, too, that risk is
disappearing, as SSDs are becoming more common, with even NVMe drives
available in servers.
I know they there is a direct relationship between system failure and
operating temperature, but I don't know if that applies to all
components, or just those with moving parts. Someone somewhere must
have done research on this. I know Google did research on hard drive
failure that was pretty popular. I would imagine they would have
researched this, too.
As an example, when I managed an IBM Blue Gene/P, I remember IBM touting
that all the components on a node (which was only the size of a PCI
card) were soldered to the board - nothing was plugged into a socket.
This was to completely eliminate chip creep and increase reliability.
Also, the BG/P would shutdown nodes between jobs, just as your asking
about here. If there was another job waiting in the queue for those
nodes, the nodes would at least reboot between every job.
I do have to say that even though my BG/P was small for a Blue Gene, it
still had 2048 nodes, and given that number of nodes, I had extremely
few hardware problems at the node-level, so there's something to be said
for that logic. I did, however, have to occasionally reseat a node into
a node card, which is the same as reseating a DIMM or a PCI card in a
regular server.
Prentice
7/16/21 3:35 PM, Douglas Eadline wrote:
> Hi everyone:
>
> Reducing power use has become an important topic. One
> of the questions I always wondered about is
> why more cluster do not turn off unused nodes. Slurm
> has hooks to turn nodes off when not in use and
> turn them on when resources are needed.
>
> My understanding is that power cycling creates
> temperature cycling, that then leads to premature node
> failure. Makes sense and has anyone ever studied/tested
> this ?
>
> The only other reason I can think of is that the delay
> in server boot time makes job starts slow or power
> surge issues.
>
> I'm curious about other ideas or experiences.
>
> Thanks
>
> --
> Doug
>
More information about the Beowulf
mailing list