[Beowulf] immersion
Scott Atchley
e.scott.atchley at gmail.com
Sun Mar 24 17:46:03 UTC 2024
On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <mdidomenico4 at gmail.com>
wrote:
> i'm curious to know
>
> 1 how many servers per vat or U
> 2 i saw a slide mention 1500w/sqft, can you break that number into kw per
> vat?
> 3 can you shed any light on the heat exchanger system? it looks like
> there's just two pipes coming into the vat, is that chilled water or oil?
> is there a CDU somewhere off camera?
> 4 that power bar in the middle is that DUG custom?
> 5 any stats on reliability? like have you seen a decrease in the hw
> failures?
>
> are you selling the vats/tech as a product? can i order one? :)
>
> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
> the near future, and i'm working on building a new site, i'm keenly
> interested in thoughts on DLC or immersion tech from anyone else too
>
As with all things in life, everything has trade-offs.
We have looked at immersion at ORNL and these are my thoughts:
*Immersion*
- *Pros*
- Low Power Usage Efficiency (PUE) - as low as 1.03. This means that
you only spend $0.03 per dollar to cool a system for each $1.00 that the
system consumes in power. In contrast, air-cooled data centers can range
from 1.30 to 1.60 or higher.
- No special racks - can install white box servers and remove the
fans.
- No cooling loops - no fittings that can leak, get kinked, or
accidentally clamped off.
- No bio-growth issues
- *Cons*
- Low power density - take a vertical rack and lay it sideways. DLC
allows the same power density with the rack being vertical.
- Messy - depends on the fluid, but oil is common and cheap. Many
centers build a crane to hoist out servers and then let them
drip dry for a
day before servicing.
- High Mean-Time-To-Repair (MTTR) - unless you have two cranes, you
cannot insert a new node until the old one has dripped dry and
been removed
from the crane.
- Some solutions can be expensive and/or lead to part failures due to
residue build up on processor pins.
*Direct Liquid Cooling (DLC)*
- *Pros*
- Low PUE compared to air-cooled. Depends on how much water capture.
Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs, NICs,
SSDs, and power supply) with ~22°C water. Summit's PUE can range
from 1.03
to 1.10 depending on the time of year. Frontier, on the other
hand, is 100%
DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
range from 1.03 to 1.06 depending on the time of year. Both PUEs include
the pumps for the water towers and to move the water between the Central
Energy Plant and the data center.
- High power density - the HPE Cray EX 4000 "cabinet" can supply up
to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
standard rack). If your data center is space constrained, this
is a crucial
factor.
- No mess - DLC with Deionized water (DI water) or with Propylene
Glycol Water (PGW) systems use dripless connectors.
- Low MTTR - remove a server and insert another if you have a spare.
- *Cons*
- Special racks - HPE cabinets are non-standard and require HPE
designed servers. This is changing. I saw many examples of ORv3 racks at
GTC that use the OCP standard with DLC manifolds.
- Cooling loops - Loops can leak at fittings, be kinked, or crimped
that restricts flow and cause overheating. Hybrid loops are simpler while
100% DLC loops are more complex (i.e., expensive). Servers tend
to include
drip sensors to detect this, but we have found that the DIMMs are better
drip sensors (i.e., the drips hit them before finding the drip sensor). 😆
- Bio-growth
- DI water includes biocides and you have to manage it. We have
learned that no system can be bio-growth free (e.g.,
inserting a blade will
recontaminate the system). That said, Summit has never had any
biogrowth-induced overheating and Frontier has gone close to
nine months
without overheating issues due to growth.
- PGW systems should be immune to any bio-growth but you lose ~30%
of the heat removal capacity compared to DI water. Depending on your
environment, you might be able to avoid trim water (i.e.,
mixing in chilled
water to reduce the temperature).
- Can be expensive to upgrade the facility (i.e., to install
evaporative coolers, piping, pumps, etc.).
For ORNL, we are space constrained. For that alone, we prefer DLC over
immersion.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240324/b0fbf8d9/attachment.htm>
More information about the Beowulf
mailing list