[Beowulf] China aims for 100 PF
Bill Broadley
bill at cse.ucdavis.edu
Tue Jun 21 18:54:18 PDT 2016
On 06/21/2016 05:14 AM, Remy Dernat wrote:
> Hi,
>
> 100 PF is really not far from reality right now:
> http://www.top500.org/news/new-chinese-supercomputer-named-worlds-fastest-system-on-latest-top500-list/
I was curious about the CPU/architecture and I found:
http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
I do wonder if they will sell this CPU on the open market and how hard it is to
port normal linux+mpi codes to it.
My quick summary of possible interest. This seems like a pretty novel design.
Kind of odd that they claim a node = socket. But a socket has 4 core groups
each with access to 8GB of memory. So while sunway describes a single 32GB ram
node, the normal terminology would call it 4 8GB nodes in a single socket.
Cluster:
* 40 racks
* 93 PFlop/sec
* 74.16% efficiency (much better than ORNL titan and NUDT tianhe-2)
* 1024 nodes per rack
* 40960 nodes total
* 6 Gflops/watt (around 3x anything in the top 6)
Physical layout:
* 4 node groups per socket
* 1 socket per node
* two nodes per card
* four cards per board
* 32 boards per supernode
* 4 supernodes per rack
* 40 racks in cluster
Network:
* 70TB/sec bisection bandwidth
* nodes connected using pci-e 3.0 connections
* supernode contains 256 nodes
* network diameter of 7
* node MPI bandwidth of 12GB/sec and a latency of about 1us.
Each rack:
* 4 supernodes
* 256 nodes per super node
* total 1024 cores
Each node has:
* 3.06 Tflop/sec
* 1 socket, 260 cores (4 MPE and 4x64 CPE)
* 4 Core groups, each with:
+ 8x8 grid of cores (CPEs)
+ own memory space managed by MPE (management processing element)
+ 1 management CPU (MPE)
+ access to 8GB of DDR3 memory
* 4 128 bit memory controllers (DDR3-2133), each connected to 8GB of DDR3,
total theoretical peak = 136.51GB/sec per chip.
* Network on chip (NoC) - bidirectional bandwidth of 16GB/sec to network, around
1us latency.
* 6 Gflops/watt for processor, memory, and interconnect.
Each management core (4 per chip, one per 64 CPEs) has:
* 64 bit risc OoO core
* 264 bit vector instruction
* 32KB l1i/32 KB l1d
* 256KB L2
* 16 flops/cycle
Each CPE (256 per chip) has:
* 64 bit risc OoO core
* supports only user mode
* 264 bit vector instruction
* 16KB L1i
* 64KB scratch pad memory SPM
* 8 double flops/cycle per core (6 at linpack)
More information about the Beowulf
mailing list