[Beowulf] [External] Re: traverse @ princeton

Thu Oct 10 10:13:40 PDT 2019

I forgot to add that power capacity per rack might have something to do 
with it, too. I don't remember the number of PDUs in those racks, or the 
power input to each one (1-phase 60-amp, etc.)

On 10/10/19 1:09 PM, Prentice Bisbal wrote:
>
> It's 4 racks of 10, and one rack of 6. For a total of 5 racks, not 
> counting the storage system.
>
> I believe  this is because of power/cooling limitations of the 
> air-cooled systems. We have water-cooled rear-door heat exchangers, 
> but they're only good up to about 35 KW/rack. Since we have 4 GPUs per 
> server these things are consuming more power and putting out more heat 
> than your average 1U pizza-box or blade server. Bill can answer more 
> authoritatively, since he was involved in those discussions.
>
> --
> Prentice
>
> On 10/10/19 12:57 PM, Scott Atchley wrote:
>> That is better than 80% peak, nice.
>>
>> Is it three racks of 15 nodes? Or two racks of 18 and 9 in the third 
>> rack?
>>
>> You went with a single-port HCA per socket and not the shared, 
>> dual-port HCA in the shared PCIe slot?
>>
>> On Thu, Oct 10, 2019 at 8:48 AM Bill Wichser <bill at princeton.edu 
>> <mailto:bill at princeton.edu>> wrote:
>>
>>     Thanks for the kind words.  Yes, we installed more like a
>>     mini-Sierra
>>     machine which is air cooled.  There are 46 nodes of the IBM
>>     AC922, two
>>     socket, 4 V100 where each socket uses the SMT threading x4. So
>>     two 16
>>     core chips, 32/node, 128 threads per node.  The GPUs all use NVLink.
>>
>>     There are two EDR connections per host, each tied to a CPU, 1:1
>>     per rack
>>     of 12 and 2:1 between racks.  We have a 2P scratch filesystem
>>     running
>>     GPFS.  Each node also has a 3T NVMe card as well for local scratch.
>>
>>     And we're running Slurm as our scheduler.
>>
>>     We'll see if it makes the top500 in November.  It fits there
>>     today but
>>     who knows what else got on there since June.  With the help of
>>     nVidia we
>>     managed to get 1.09PF across 45 nodes.
>>
>>     Bill
>>
>>     On 10/10/19 7:45 AM, Michael Di Domenico wrote:
>>     > for those that may not have seen
>>     >
>>     >
>>     https://insidehpc.com/2019/10/traverse-supercomputer-to-accelerate-fusion-research-at-princeton/
>>     >
>>     > Bill Wischer and Prentice Bisbal are frequent contributors to the
>>     > list, Congrats on the acquisition.  Its nice to see more HPC
>>     expansion
>>     > in our otherwise barren hometown... :)
>>     >
>>     > Maybe one of them will pass along some detail on the machine...
>>     > _______________________________________________
>>     > Beowulf mailing list, Beowulf at beowulf.org
>>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>>     > To change your subscription (digest mode or unsubscribe) visit
>>     https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>     >
>>     _______________________________________________
>>     Beowulf mailing list, Beowulf at beowulf.org
>>     <mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
>>     To change your subscription (digest mode or unsubscribe) visit
>>     https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>>
>> _______________________________________________
>> Beowulf mailing list,Beowulf at beowulf.org  sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visithttps://beowulf.org/cgi-bin/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20191010/fabcc527/attachment.html>