[Beowulf] difference between accelerators and co-processors

Tue Mar 12 08:48:38 PDT 2013

Good comments.
My comments inline.

Joshua

------ Original Message ------
Received: 11:02 PM CDT, 03/11/2013
From: Brendan Moloney <moloney.brendan at gmail.com>
To: Joshua mora acosta <joshua_mora at usa.net>Cc: Vincent Diepeveen
<diep at xs4all.nl>, Mark Hahn <hahn at mcmaster.ca>, Beowulf List
<beowulf at beowulf.org>
Subject: Re: [Beowulf] difference between accelerators and co-processors

> I think this analysis is missing some important points.
> 
> 1) Comparing a single low power APU to a single high power discrete GPU
> doesn't make sense for HPC. Rather we should compare a rack of equipment
> that can operate in the same power envelope.
[Joshua] I was comparing, or the paper compares a system (APU) vs a system
(CPU+GPU).
If you add the network to it, then you would need to add that too for both,
say compare 1 full rack with APUs vs server nodes with GPUs connected with IB.
I am not sure how much insight it would provide that as that would be the same
technology for both systems. I rather simplify/reduce the analysis within the
system where the main differences are observed. I am not disagreeing though on
a view/analysis from scalability point of view.
> 
> 2) You can bolt GDDR5 onto an APU, eliminating the local bandwidth
> advantage (AMD is doing exactly this for the PS4). Also, we should really
> be comparing the bandwidth available to each GPU "core".
[Joshua] I believe there are power constraints on what you can do with APUs in
terms of high speed memory. That is why you get discrete GPUs burning ~250W
but capable of feeding the streaming cores at an aggregated ~150GB/s from
global memory. If you have to chunk by a factor of 10X at least the work for
the APUs you will incur into higher transfer overheads. I don't think the
performance/watt ratio including those overheads is going to be better on APU
for a wide variety of HPC apps. This is like cloud studies of Atom microsevers
aggregated vs single multisocket Xeon server. You have to meet a bar of
performance first to make it attractive. That does not mean that there isnt a
market for it (eg. consolidation) where performance is not #1 priority.
> 3) Almost every GPGPU research paper devotes significant space (perhaps the
> whole paper) to figuring out ways of doing some step in their algorithm,
> that is trivial on a CPU, efficiently on a GPU.  Avoiding round trips is a
> driving force in most algorithm development. So the programming should be
> easier, even if you still need (for now) the explicit API calls for memory
> "transfers".
[Joshua] GPUs/accelerators provide API calls to discover the devices and their
features so you do the right blocking and transfers and you do "manual"
coherence. Because of that information is being exposed I don't see that as
becoming easier for the programmer. It is just more tedious and you have to
provide intelligence in your code to autotune. That isn't easier...
> 
> 
> On Sun, Mar 10, 2013 at 5:55 PM, Joshua mora acosta
<joshua_mora at usa.net>wrote:
> 
> > See this paper
> > http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf
> >
> > While discrete GPUs underperform wrt APU on host to/from device transfers
> > in a
> > ratio of ~2X, it compensates by far the computing power and local
bandwidth
> > ~8-10X.
> >
> > You can cook though a test where you do little computation and it is all
> > bound
> > by the host to/from device transfers.
> >
> > Programming wise there is no difference as there isn't yet coherence so
> > explicit transfers through API calls are needed.
> >
> > Joshua
> >
> > ------ Original Message ------
> > Received: 04:06 PM CDT, 03/10/2013
> > From: Vincent Diepeveen <diep at xs4all.nl>
> > To: Mark Hahn <hahn at mcmaster.ca>Cc: Beowulf List <beowulf at beowulf.org>
> > Subject: Re: [Beowulf] difference between accelerators and co-processors
> >
> > >
> > > On Mar 10, 2013, at 9:03 PM, Mark Hahn wrote:
> > >
> > > >> Is there any line/point to make distinction between accelerators and
> > > >> co-processors (that are used in conjunction with the primary CPU
> > > >> to boost
> > > >> up the performance)? or these terms can be used interchangeably?
> > > >
> > > > IMO, a coprocessor executes the same instruction stream as the
> > > > "primary" processor.  this was the case with the x87, for instance,
> > > > though the distinction became less significant once the x87 came
> > > > onchip.
> > > > (though you certainly notice that FPU on any of these chips is mostly
> > > > separate - not sharing functional units or register files,
> > > > sometimes even
> > > > with separate micro-op schedulers.)
> > > >
> > > >> Specifically, the word "accelerator" is used commonly with GPU. On
> > > >> the
> > > >> other hand  the word "co-processors" is used commonly with Xeon Phi.
> > > >
> > > > I don't think it is a useful distinction: both are basiclly
> > > > independent
> > > > computers.  obviously, the programming model of Phi is dramatically
> > > > more
> > > > like a conventional processor than Nvidia.
> > > >
> > >
> > > Mark, that's the marketing talk about Xeon Phi.
> > >
> > > It's surprisingly the same of course except for the cache coherency;
> > > big vector processors.
> > >
> > > > there is a meaningful distinction between offload and coprocessor
> > > > approaches.
> > > > that is, offload means you use the device to accelerate a set of
> > > > libraries
> > > > (offload matrix multiply, eig, fft, etc).  to use a coprocessor, I
> > > > think the
> > > > expectation is that the main code will be very much aware of the
> > > > state of the
> > > > PCIe-attached hardware.
> > > >
> > > > I suppose one might suggest that "accelerator" to some extent implies
> > > > offload usage: you're accelerating a library.
> > > >
> > > > another interesting example is AMD's upcoming HSA concept: since
> > > > nearly all
> > > > GPUs are now on-chip, AMD wants to integrate the CPU and GPU
> > > > programming
> > > > models (at least to some extent).  as far as I understand it, HSA
> > > > is based
> > > > on introducing a quite general intermediate ISA that can be
> > > > executed using
> > > > all available hardware resources: CPU and/or GPU.  although Nvidia
> > > > does have
> > > > its own intermediate ISA, they don't seem to be trying to make it
> > > > general,
> > > > *and* they don't seem interested in making it work on both C/GPU.
> > > > (well,
> > > > so far at least - I wouldn't be surprised if they _did_ have a PTX
> > > > JIT for
> > > > their ARM-based C/GPU chips...)
> > > >
> > > > I think HSA is potentially interesting for HPC, too.
> > > >   I really expect
> > > > AMD and/or Intel to ship products this year that have a C/GPU chip
> > > > mounted on
> > > > the same interposer as some high-bandwidth ram.
> > >
> > > How can an integrated gpu outperform a gpgpu card?
> > >
> > > Something like what is it 25 watt versus 250 watt, what will be faster?
> > >
> > > I assume you will not build 10 nodes with 10 cpu's with integrated
> > > gpu in order to rival a
> > > single card.
> > >
> > > >   a fixed amount of very high
> > > > performance memory sounds very tasty to me.  a surprising amount of
> > > > power
> > > > in current systems is spend getting high-speed signals off-socket.
> > > >
> > > > imagine a package dissipating say 40W containing a, say, 4 CPU cores,
> > > > 256 GPU ALUs and 2GB of gddr5.  the point would be to tile 32 of them
> > > > in a 1U box.  (dropping socketed, off-package dram would probably
make
> > > > it uninteresting for memcached and some space-intensive HPC.
> > > >
> > > > then again, if you think carefully about the numbers, any code today
> > > > that has a big working set is almost as anachronistic as codes that
> > > > use
> > > > disk-based algorithms.  (same conceptual thing happening: capacity is
> > > > growing much faster than the pipe.)
> > > >
> > > > regards, mark hahn.
> > > > _______________________________________________
> > > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > > > Computing
> > > > To change your subscription (digest mode or unsubscribe) visit
> > > > http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
>