[Beowulf] China aims for 100 PF

Tue Jun 21 18:54:18 PDT 2016

On 06/21/2016 05:14 AM, Remy Dernat wrote:
> Hi,
>
> 100 PF is really not far from reality right now:
> http://www.top500.org/news/new-chinese-supercomputer-named-worlds-fastest-system-on-latest-top500-list/

I was curious about the CPU/architecture and I found:
   http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

I do wonder if they will sell this CPU on the open market and how hard it is to 
port normal linux+mpi codes to it.

My quick summary of possible interest.  This seems like a pretty novel design. 
Kind of odd that they claim a node = socket.  But a socket has 4 core groups 
each with access to 8GB of memory.   So while sunway describes a single 32GB ram 
node, the normal terminology would call it 4 8GB nodes in a single socket.

Cluster:
* 40 racks
* 93 PFlop/sec
* 74.16% efficiency (much better than ORNL titan and NUDT tianhe-2)
* 1024 nodes per rack
* 40960 nodes total
* 6 Gflops/watt (around 3x anything in the top 6)

Physical layout:
* 4 node groups per socket
* 1 socket per node
* two nodes per card
* four cards per board
* 32 boards per supernode
* 4 supernodes per rack
* 40 racks in cluster

Network:
* 70TB/sec bisection bandwidth
* nodes connected using pci-e 3.0 connections
* supernode contains 256 nodes
* network diameter of 7
* node MPI bandwidth of 12GB/sec and a latency of about 1us.

Each rack:
   * 4 supernodes
   * 256 nodes per super node
   * total 1024 cores

Each node has:
* 3.06 Tflop/sec
* 1 socket, 260 cores (4 MPE and 4x64 CPE)
* 4 Core groups, each with:
   + 8x8 grid of cores (CPEs)
   + own memory space managed by MPE (management processing element)
   + 1 management CPU (MPE)
   + access to 8GB of DDR3 memory
* 4 128 bit memory controllers (DDR3-2133), each connected to 8GB of DDR3,
   total theoretical peak = 136.51GB/sec per chip.
* Network on chip (NoC) - bidirectional bandwidth of 16GB/sec to network, around
        1us latency.
* 6 Gflops/watt for processor, memory, and interconnect.

Each management core (4 per chip, one per 64 CPEs) has:
   * 64 bit risc OoO core
   * 264 bit vector instruction
   * 32KB l1i/32 KB l1d
   * 256KB L2
   * 16 flops/cycle

Each CPE (256 per chip) has:
   * 64 bit risc OoO core
   * supports only user mode
   * 264 bit vector instruction
   * 16KB L1i
   * 64KB scratch pad memory SPM
   * 8 double flops/cycle per core (6 at linpack)