<div dir="ltr"><div dir="ltr"><br><div>All,</div><div><br></div><div>I think the comparison with RoadRunner is off.   Any application that already has</div><div>a CUDA version can be largely converted to run on AMD GPUs with a perl script</div><div>with some minor adjustments.   Those without GPU implementations will have to </div><div>be converted (many are already having this done under ECP at the labs), but that</div><div>seems to be the price of getting power efficient exaflop performance.  And there</div><div>are other options like OpenMP 4.5 and 5.0 to help if you do not want to write in</div><div>HIP.</div><div><br></div><div>In my opinion, the main issue will be a getting suite of fully performant ROCm libraries </div><div>analogous to those NVIDIA already provides ready to deliver performance on AMD's</div><div>devices.  I am sure some already exist and that AMD will be devoting significant resources </div><div>to this task in the almost 2 years until delivery.  </div><div><br></div><div>rbw</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 8, 2019 at 10:51 AM Prentice Bisbal via Beowulf <<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

On 5/7/19 6:14 PM, Lux, Jim (337K) wrote:<br>

><br>

> On 5/7/19, 2:00 PM, "Beowulf on behalf of Prentice Bisbal via Beowulf" <<a href="mailto:beowulf-bounces@beowulf.org" target="_blank">beowulf-bounces@beowulf.org</a> on behalf of <a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>> wrote:<br>

><br>

>      >   I think it is interesting that they are using AMD for<br>

>      > both the CPUs and GPUs<br>

>      <br>

>      I agree. That means a LOT of codes will have to be ported from CUDA to<br>

>      whatever AMD uses. I know AMD announced their HIP interface to convert<br>

>      CUDA code into something that will run on AMD processors, but I don't<br>

>      know how well that works in theory. Frankly, I haven't heard anything<br>

>      about it since it was announced at SC a few years ago.<br>

>      <br>

>      I would not be surprised if AMD pursued this bid quite agressively,<br>

>      possibly at a significant loss, for the opportunity to prove their GPUs<br>

>      can compete with NVIDIA and demonstrate that codes can be successfully<br>

>      converted from CUDA to something AMD GPUs can use to demonstrate GPU<br>

>      users don't need to be locked in to a single vendor. If so, this could<br>

>      be a costly gamble for the DOE and AMD, but if it pays off, I imagine it<br>

>      could change AMD's fortunes in HPC.<br>

>      <br>

>        "Win on Sunday, sell on Monday" doesn't apply just to cars.<br>

>      <br>

>      Prentice<br>

>      <br>

><br>

> --<br>

> I think they're deliberately looking for architectural diversity, rather than "ease of porting from existing machine"<br>

><br>

> " CORAL-2 has a mandate to field architecturally diverse machines in a way that manages risk during a period of rapid technological evolution. “Regardless of which system or systems are being discussed, the systems residing at or planned to reside at ORNL and ANL must be diverse from one another,” notes the CORAL-2 RFP cover letter [PDF]."<br>

><br>

> <a href="https://asc.llnl.gov/coral-2-benchmarks/" rel="noreferrer" target="_blank">https://asc.llnl.gov/coral-2-benchmarks/</a><br>

<br>

<br>

I understand the requirement for architetcural diversity. The 3 DOE <br>

Leadership Computing Facilities (LCFs) have always practiced hardware <br>

diversity. ANL typically used IBM Hardware in the form of Blue Genes <br>

(Intrepid, Miro), and ORNL typically used Cray. Those two sites used <br>

bleeding-edge architectures, and NERSRC,the 3rd DOE LCF, would usually <br>

go with less bleeding-edge systems.<br>

<br>

However  this particular choice brings the risk of users not being able <br>

to, or not wanting to port their code to a unique architecture. Not only <br>

is it different than past DOE Leadership systems, it is using an <br>

architecture that currently has about 0% market share, so the work of <br>

porting code to this architecture to run on a single system may not be <br>

enough incentive for some users, despite the performance advantage, <br>

since the cost of that effort can't be spread over a larger number of  <br>

other systems they can now use. (based on current market trends, at least)<br>

<br>

LANL's RoadRunner is a good analog to consider. It was the first <br>

petascale system, but it had a rather unique architecture. The DOE <br>

decommissioned the system when it was about 5 years old, even though it <br>

was still ranked quite highly on the Top500. It's replacement was Cielo, <br>

which wasn't much newer or faster than RoadRunner. From conversations <br>

I've had with people familiar with RoadRunner, I heard it was difficult <br>

to program, and too expensive to continue supporting. I don't know how <br>

accurate those statements are, because I don't remember the DOE saying <br>

much about why they EOLed RoadRunner, but thos explanations seemed <br>

reasonable.<br>

<br>

And yes, I know DOE LCF systems are a bit unique in the market they <br>

serve - their users are bleeding-edge users who probably are willing to <br>

port their codes to new or unique architectures for the benefit of more <br>

compute capabilities, but I think it's safe to say Roadrunner's user <br>

base had the same or very similar characteristics.<br>

<br>

Prentice<br>

<br>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

</blockquote></div></div>