[Beowulf] immersion
Scott Atchley
e.scott.atchley at gmail.com
Sun Mar 24 19:59:04 UTC 2024
On Sun, Mar 24, 2024 at 2:38 PM Michael DiDomenico <mdidomenico4 at gmail.com>
wrote:
> thanks, there's some good info in there. just to be clear to others that
> might chime in i'm less interested in the immersion/dlc debate, then
> getting updates from people that have sat on either side of the fence.
> dlc's been around awhile and so has immersion, but what i can't get from
> sales glossy's is real world maintenance over time.
>
> being in the DoD space, i'm well aware of the HPE stuff, but they're also
> whats making me look at other stuff. i'm not real keen on +100kw racks,
> there are many safety concerns with that much amperage in a single
> cabinet. not to mention all that custom hardware comes at stiff cost and
> in my opinion doesn't have a good ROI if you're not buying 100's of racks
> worth of it. but your space constrained issue is definitely one i'm
> familiar with. our new space is smaller then i think we should build, but
> we're also geography constrained.
>
> the other info i'm seeking is futures, DLC seems like a right now solution
> to ride the AI wave. i'm curious if others think DLC might hit a power
> limit sooner or later, like Air cooling already has, given chips keep
> climbing in watts. and maybe it's not even a power limit per se, but DLC
> is pretty complicated with all the piping/manifolds/connectors/CDU's, does
> there come a point where its just not worth it unless it's a big custom
> solution like the HPE stuff
>
The ORv3 rack design's maximum power is the number of power shelves times
the power per shelf. Reach out to me directly at <my first name> @ ornl.gov
and I can connect you with some vendors.
>
>
> On Sun, Mar 24, 2024 at 1:46 PM Scott Atchley <e.scott.atchley at gmail.com>
> wrote:
>
>> On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <
>> mdidomenico4 at gmail.com> wrote:
>>
>>> i'm curious to know
>>>
>>> 1 how many servers per vat or U
>>> 2 i saw a slide mention 1500w/sqft, can you break that number into kw
>>> per vat?
>>> 3 can you shed any light on the heat exchanger system? it looks like
>>> there's just two pipes coming into the vat, is that chilled water or oil?
>>> is there a CDU somewhere off camera?
>>> 4 that power bar in the middle is that DUG custom?
>>> 5 any stats on reliability? like have you seen a decrease in the hw
>>> failures?
>>>
>>> are you selling the vats/tech as a product? can i order one? :)
>>>
>>> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
>>> the near future, and i'm working on building a new site, i'm keenly
>>> interested in thoughts on DLC or immersion tech from anyone else too
>>>
>>
>> As with all things in life, everything has trade-offs.
>>
>> We have looked at immersion at ORNL and these are my thoughts:
>>
>> *Immersion*
>>
>> - *Pros*
>> - Low Power Usage Efficiency (PUE) - as low as 1.03. This means
>> that you only spend $0.03 per dollar to cool a system for each $1.00 that
>> the system consumes in power. In contrast, air-cooled data centers can
>> range from 1.30 to 1.60 or higher.
>> - No special racks - can install white box servers and remove the
>> fans.
>> - No cooling loops - no fittings that can leak, get kinked, or
>> accidentally clamped off.
>> - No bio-growth issues
>> - *Cons*
>> - Low power density - take a vertical rack and lay it sideways. DLC
>> allows the same power density with the rack being vertical.
>> - Messy - depends on the fluid, but oil is common and cheap. Many
>> centers build a crane to hoist out servers and then let them drip dry for a
>> day before servicing.
>> - High Mean-Time-To-Repair (MTTR) - unless you have two cranes,
>> you cannot insert a new node until the old one has dripped dry and been
>> removed from the crane.
>> - Some solutions can be expensive and/or lead to part failures due
>> to residue build up on processor pins.
>>
>> *Direct Liquid Cooling (DLC)*
>>
>> - *Pros*
>> - Low PUE compared to air-cooled. Depends on how much water
>> capture. Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs,
>> NICs, SSDs, and power supply) with ~22°C water. Summit's PUE can range from
>> 1.03 to 1.10 depending on the time of year. Frontier, on the other hand, is
>> 100% DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
>> range from 1.03 to 1.06 depending on the time of year. Both PUEs include
>> the pumps for the water towers and to move the water between the Central
>> Energy Plant and the data center.
>> - High power density - the HPE Cray EX 4000 "cabinet" can supply
>> up to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
>> standard rack). If your data center is space constrained, this is a crucial
>> factor.
>> - No mess - DLC with Deionized water (DI water) or with Propylene
>> Glycol Water (PGW) systems use dripless connectors.
>> - Low MTTR - remove a server and insert another if you have a
>> spare.
>> - *Cons*
>> - Special racks - HPE cabinets are non-standard and require HPE
>> designed servers. This is changing. I saw many examples of ORv3 racks at
>> GTC that use the OCP standard with DLC manifolds.
>> - Cooling loops - Loops can leak at fittings, be kinked, or
>> crimped that restricts flow and cause overheating. Hybrid loops are simpler
>> while 100% DLC loops are more complex (i.e., expensive). Servers tend to
>> include drip sensors to detect this, but we have found that the DIMMs are
>> better drip sensors (i.e., the drips hit them before finding the drip
>> sensor). 😆
>> - Bio-growth
>> - DI water includes biocides and you have to manage it. We have
>> learned that no system can be bio-growth free (e.g., inserting a blade will
>> recontaminate the system). That said, Summit has never had any
>> biogrowth-induced overheating and Frontier has gone close to nine months
>> without overheating issues due to growth.
>> - PGW systems should be immune to any bio-growth but you lose
>> ~30% of the heat removal capacity compared to DI water. Depending on your
>> environment, you might be able to avoid trim water (i.e., mixing in chilled
>> water to reduce the temperature).
>> - Can be expensive to upgrade the facility (i.e., to install
>> evaporative coolers, piping, pumps, etc.).
>>
>> For ORNL, we are space constrained. For that alone, we prefer DLC over
>> immersion.
>>
>>
>>
>> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240324/db525e07/attachment.htm>
More information about the Beowulf
mailing list