[Beowulf] difference between accelerators and co-processors

Sun Mar 10 17:55:47 PDT 2013

See this paper
http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf

While discrete GPUs underperform wrt APU on host to/from device transfers in a
ratio of ~2X, it compensates by far the computing power and local bandwidth
~8-10X.

You can cook though a test where you do little computation and it is all bound
by the host to/from device transfers.

Programming wise there is no difference as there isn't yet coherence so
explicit transfers through API calls are needed.

Joshua

------ Original Message ------
Received: 04:06 PM CDT, 03/10/2013
From: Vincent Diepeveen <diep at xs4all.nl>
To: Mark Hahn <hahn at mcmaster.ca>Cc: Beowulf List <beowulf at beowulf.org>
Subject: Re: [Beowulf] difference between accelerators and co-processors

> 
> On Mar 10, 2013, at 9:03 PM, Mark Hahn wrote:
> 
> >> Is there any line/point to make distinction between accelerators and
> >> co-processors (that are used in conjunction with the primary CPU  
> >> to boost
> >> up the performance)? or these terms can be used interchangeably?
> >
> > IMO, a coprocessor executes the same instruction stream as the
> > "primary" processor.  this was the case with the x87, for instance,
> > though the distinction became less significant once the x87 came  
> > onchip.
> > (though you certainly notice that FPU on any of these chips is mostly
> > separate - not sharing functional units or register files,  
> > sometimes even
> > with separate micro-op schedulers.)
> >
> >> Specifically, the word "accelerator" is used commonly with GPU. On  
> >> the
> >> other hand  the word "co-processors" is used commonly with Xeon Phi.
> >
> > I don't think it is a useful distinction: both are basiclly  
> > independent
> > computers.  obviously, the programming model of Phi is dramatically  
> > more
> > like a conventional processor than Nvidia.
> >
> 
> Mark, that's the marketing talk about Xeon Phi.
> 
> It's surprisingly the same of course except for the cache coherency;
> big vector processors.
> 
> > there is a meaningful distinction between offload and coprocessor  
> > approaches.
> > that is, offload means you use the device to accelerate a set of  
> > libraries
> > (offload matrix multiply, eig, fft, etc).  to use a coprocessor, I  
> > think the
> > expectation is that the main code will be very much aware of the  
> > state of the
> > PCIe-attached hardware.
> >
> > I suppose one might suggest that "accelerator" to some extent implies
> > offload usage: you're accelerating a library.
> >
> > another interesting example is AMD's upcoming HSA concept: since  
> > nearly all
> > GPUs are now on-chip, AMD wants to integrate the CPU and GPU  
> > programming
> > models (at least to some extent).  as far as I understand it, HSA  
> > is based
> > on introducing a quite general intermediate ISA that can be  
> > executed using
> > all available hardware resources: CPU and/or GPU.  although Nvidia  
> > does have
> > its own intermediate ISA, they don't seem to be trying to make it  
> > general,
> > *and* they don't seem interested in making it work on both C/GPU.   
> > (well,
> > so far at least - I wouldn't be surprised if they _did_ have a PTX  
> > JIT for
> > their ARM-based C/GPU chips...)
> >
> > I think HSA is potentially interesting for HPC, too.
> >   I really expect
> > AMD and/or Intel to ship products this year that have a C/GPU chip  
> > mounted on
> > the same interposer as some high-bandwidth ram.
> 
> How can an integrated gpu outperform a gpgpu card?
> 
> Something like what is it 25 watt versus 250 watt, what will be faster?
> 
> I assume you will not build 10 nodes with 10 cpu's with integrated  
> gpu in order to rival a
> single card.
> 
> >   a fixed amount of very high
> > performance memory sounds very tasty to me.  a surprising amount of  
> > power
> > in current systems is spend getting high-speed signals off-socket.
> >
> > imagine a package dissipating say 40W containing a, say, 4 CPU cores,
> > 256 GPU ALUs and 2GB of gddr5.  the point would be to tile 32 of them
> > in a 1U box.  (dropping socketed, off-package dram would probably make
> > it uninteresting for memcached and some space-intensive HPC.
> >
> > then again, if you think carefully about the numbers, any code today
> > that has a big working set is almost as anachronistic as codes that  
> > use
> > disk-based algorithms.  (same conceptual thing happening: capacity is
> > growing much faster than the pipe.)
> >
> > regards, mark hahn.
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> > Computing
> > To change your subscription (digest mode or unsubscribe) visit  
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf