[Beowulf] immersion

Scott Atchley e.scott.atchley at gmail.com
Sun Mar 24 17:46:03 UTC 2024


On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <mdidomenico4 at gmail.com>
wrote:

> i'm curious to know
>
> 1 how many servers per vat or U
> 2 i saw a slide mention 1500w/sqft, can you break that number into kw per
> vat?
> 3 can you shed any light on the heat exchanger system? it looks like
> there's just two pipes coming into the vat, is that chilled water or oil?
> is there a CDU somewhere off camera?
> 4 that power bar in the middle is that DUG custom?
> 5 any stats on reliability?  like have you seen a decrease in the hw
> failures?
>
> are you selling the vats/tech as a product?  can i order one? :)
>
> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
> the near future, and i'm working on building a new site, i'm keenly
> interested in thoughts on DLC or immersion tech from anyone else too
>

As with all things in life, everything has trade-offs.

We have looked at immersion at ORNL and these are my thoughts:

*Immersion*

   - *Pros*
      - Low Power Usage Efficiency (PUE) - as low as 1.03. This means that
      you only spend $0.03 per dollar to cool a system for each $1.00 that the
      system consumes in power. In contrast, air-cooled data centers can range
      from 1.30 to 1.60 or higher.
      - No special racks - can install white box servers and remove the
      fans.
      - No cooling loops - no fittings that can leak, get kinked, or
      accidentally clamped off.
      - No bio-growth issues
   - *Cons*
   - Low power density - take a vertical rack and lay it sideways. DLC
      allows the same power density with the rack being vertical.
      - Messy - depends on the fluid, but oil is common and cheap. Many
      centers build a crane to hoist out servers and then let them
drip dry for a
      day before servicing.
      - High Mean-Time-To-Repair (MTTR) - unless you have two cranes, you
      cannot insert a new node until the old one has dripped dry and
been removed
      from the crane.
      - Some solutions can be expensive and/or lead to part failures due to
      residue build up on processor pins.

*Direct Liquid Cooling (DLC)*

   - *Pros*
      - Low PUE compared to air-cooled. Depends on how much water capture.
      Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs, NICs,
      SSDs, and power supply) with ~22°C water. Summit's PUE can range
from 1.03
      to 1.10 depending on the time of year. Frontier, on the other
hand, is 100%
      DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
      range from 1.03 to 1.06 depending on the time of year. Both PUEs include
      the pumps for the water towers and to move the water between the Central
      Energy Plant and the data center.
      - High power density - the HPE Cray EX 4000 "cabinet" can supply up
      to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
      standard rack). If your data center is space constrained, this
is a crucial
      factor.
      - No mess - DLC with Deionized water (DI water) or with Propylene
      Glycol Water (PGW) systems use dripless connectors.
      - Low MTTR - remove a server and insert another if you have a spare.
   - *Cons*
      - Special racks - HPE cabinets are non-standard and require HPE
      designed servers. This is changing. I saw many examples of ORv3 racks at
      GTC that use the OCP standard with DLC manifolds.
      - Cooling loops - Loops can leak at fittings, be kinked, or crimped
      that restricts flow and cause overheating. Hybrid loops are simpler while
      100% DLC loops are more complex (i.e., expensive). Servers tend
to include
      drip sensors to detect this, but we have found that the DIMMs are better
      drip sensors (i.e., the drips hit them before finding the drip sensor). 😆
      - Bio-growth
         - DI water includes biocides and you have to manage it. We have
         learned that no system can be bio-growth free (e.g.,
inserting a blade will
         recontaminate the system). That said, Summit has never had any
         biogrowth-induced overheating and Frontier has gone close to
nine months
         without overheating issues due to growth.
         - PGW systems should be immune to any bio-growth but you lose ~30%
         of the heat removal capacity compared to DI water. Depending on your
         environment, you might be able to avoid trim water (i.e.,
mixing in chilled
         water to reduce the temperature).
      - Can be expensive to upgrade the facility (i.e., to install
      evaporative coolers, piping, pumps, etc.).

For ORNL, we are space constrained. For that alone, we prefer DLC over
immersion.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240324/b0fbf8d9/attachment.htm>


More information about the Beowulf mailing list