<div dir="ltr"><div><div><div>I think this analysis is missing some important points.<br><br></div></div>1) Comparing a single low power APU to a single high power discrete GPU doesn't make sense for HPC. Rather we should compare a rack of equipment that can operate in the same power envelope.<br>
<br>2) You can bolt GDDR5 onto an APU, eliminating the local bandwidth advantage (AMD is doing exactly this for the PS4). Also, we should really be comparing the bandwidth available to each GPU "core".<br><br></div>
3) Almost every GPGPU research paper devotes significant space (perhaps the whole paper) to figuring out ways of doing some step in their algorithm, that is trivial on a CPU, efficiently on a GPU. Avoiding round trips is a driving force in most algorithm development. So the programming should be easier, even if you still need (for now) the explicit API calls for memory "transfers".<br>
<div><div><div><div><div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Mar 10, 2013 at 5:55 PM, Joshua mora acosta <span dir="ltr"><<a href="mailto:joshua_mora@usa.net" target="_blank">joshua_mora@usa.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">See this paper<br>
<a href="http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf" target="_blank">http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf</a><br>
<br>
While discrete GPUs underperform wrt APU on host to/from device transfers in a<br>
ratio of ~2X, it compensates by far the computing power and local bandwidth<br>
~8-10X.<br>
<br>
You can cook though a test where you do little computation and it is all bound<br>
by the host to/from device transfers.<br>
<br>
Programming wise there is no difference as there isn't yet coherence so<br>
explicit transfers through API calls are needed.<br>
<span class=""><font color="#888888"><br>
Joshua<br>
</font></span><div class=""><div class="h5"><br>
------ Original Message ------<br>
Received: 04:06 PM CDT, 03/10/2013<br>
From: Vincent Diepeveen <<a href="mailto:diep@xs4all.nl">diep@xs4all.nl</a>><br>
To: Mark Hahn <<a href="mailto:hahn@mcmaster.ca">hahn@mcmaster.ca</a>>Cc: Beowulf List <<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>><br>
Subject: Re: [Beowulf] difference between accelerators and co-processors<br>
<br>
><br>
> On Mar 10, 2013, at 9:03 PM, Mark Hahn wrote:<br>
><br>
> >> Is there any line/point to make distinction between accelerators and<br>
> >> co-processors (that are used in conjunction with the primary CPU<br>
> >> to boost<br>
> >> up the performance)? or these terms can be used interchangeably?<br>
> ><br>
> > IMO, a coprocessor executes the same instruction stream as the<br>
> > "primary" processor. this was the case with the x87, for instance,<br>
> > though the distinction became less significant once the x87 came<br>
> > onchip.<br>
> > (though you certainly notice that FPU on any of these chips is mostly<br>
> > separate - not sharing functional units or register files,<br>
> > sometimes even<br>
> > with separate micro-op schedulers.)<br>
> ><br>
> >> Specifically, the word "accelerator" is used commonly with GPU. On<br>
> >> the<br>
> >> other hand the word "co-processors" is used commonly with Xeon Phi.<br>
> ><br>
> > I don't think it is a useful distinction: both are basiclly<br>
> > independent<br>
> > computers. obviously, the programming model of Phi is dramatically<br>
> > more<br>
> > like a conventional processor than Nvidia.<br>
> ><br>
><br>
> Mark, that's the marketing talk about Xeon Phi.<br>
><br>
> It's surprisingly the same of course except for the cache coherency;<br>
> big vector processors.<br>
><br>
> > there is a meaningful distinction between offload and coprocessor<br>
> > approaches.<br>
> > that is, offload means you use the device to accelerate a set of<br>
> > libraries<br>
> > (offload matrix multiply, eig, fft, etc). to use a coprocessor, I<br>
> > think the<br>
> > expectation is that the main code will be very much aware of the<br>
> > state of the<br>
> > PCIe-attached hardware.<br>
> ><br>
> > I suppose one might suggest that "accelerator" to some extent implies<br>
> > offload usage: you're accelerating a library.<br>
> ><br>
> > another interesting example is AMD's upcoming HSA concept: since<br>
> > nearly all<br>
> > GPUs are now on-chip, AMD wants to integrate the CPU and GPU<br>
> > programming<br>
> > models (at least to some extent). as far as I understand it, HSA<br>
> > is based<br>
> > on introducing a quite general intermediate ISA that can be<br>
> > executed using<br>
> > all available hardware resources: CPU and/or GPU. although Nvidia<br>
> > does have<br>
> > its own intermediate ISA, they don't seem to be trying to make it<br>
> > general,<br>
> > *and* they don't seem interested in making it work on both C/GPU.<br>
> > (well,<br>
> > so far at least - I wouldn't be surprised if they _did_ have a PTX<br>
> > JIT for<br>
> > their ARM-based C/GPU chips...)<br>
> ><br>
> > I think HSA is potentially interesting for HPC, too.<br>
> > I really expect<br>
> > AMD and/or Intel to ship products this year that have a C/GPU chip<br>
> > mounted on<br>
> > the same interposer as some high-bandwidth ram.<br>
><br>
> How can an integrated gpu outperform a gpgpu card?<br>
><br>
> Something like what is it 25 watt versus 250 watt, what will be faster?<br>
><br>
> I assume you will not build 10 nodes with 10 cpu's with integrated<br>
> gpu in order to rival a<br>
> single card.<br>
><br>
> > a fixed amount of very high<br>
> > performance memory sounds very tasty to me. a surprising amount of<br>
> > power<br>
> > in current systems is spend getting high-speed signals off-socket.<br>
> ><br>
> > imagine a package dissipating say 40W containing a, say, 4 CPU cores,<br>
> > 256 GPU ALUs and 2GB of gddr5. the point would be to tile 32 of them<br>
> > in a 1U box. (dropping socketed, off-package dram would probably make<br>
> > it uninteresting for memcached and some space-intensive HPC.<br>
> ><br>
> > then again, if you think carefully about the numbers, any code today<br>
> > that has a big working set is almost as anachronistic as codes that<br>
> > use<br>
> > disk-based algorithms. (same conceptual thing happening: capacity is<br>
> > growing much faster than the pipe.)<br>
> ><br>
> > regards, mark hahn.<br>
> > _______________________________________________<br>
> > Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin<br>
> > Computing<br>
> > To change your subscription (digest mode or unsubscribe) visit<br>
> > <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
><br>
> _______________________________________________<br>
> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
> To change your subscription (digest mode or unsubscribe) visit<br>
<a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div></div></div></div></div></div></div>