[Beowulf] difference between accelerators and co-processors
Joshua mora acosta
joshua_mora at usa.net
Sun Mar 10 17:55:47 PDT 2013
See this paper
http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf
While discrete GPUs underperform wrt APU on host to/from device transfers in a
ratio of ~2X, it compensates by far the computing power and local bandwidth
~8-10X.
You can cook though a test where you do little computation and it is all bound
by the host to/from device transfers.
Programming wise there is no difference as there isn't yet coherence so
explicit transfers through API calls are needed.
Joshua
------ Original Message ------
Received: 04:06 PM CDT, 03/10/2013
From: Vincent Diepeveen <diep at xs4all.nl>
To: Mark Hahn <hahn at mcmaster.ca>Cc: Beowulf List <beowulf at beowulf.org>
Subject: Re: [Beowulf] difference between accelerators and co-processors
>
> On Mar 10, 2013, at 9:03 PM, Mark Hahn wrote:
>
> >> Is there any line/point to make distinction between accelerators and
> >> co-processors (that are used in conjunction with the primary CPU
> >> to boost
> >> up the performance)? or these terms can be used interchangeably?
> >
> > IMO, a coprocessor executes the same instruction stream as the
> > "primary" processor. this was the case with the x87, for instance,
> > though the distinction became less significant once the x87 came
> > onchip.
> > (though you certainly notice that FPU on any of these chips is mostly
> > separate - not sharing functional units or register files,
> > sometimes even
> > with separate micro-op schedulers.)
> >
> >> Specifically, the word "accelerator" is used commonly with GPU. On
> >> the
> >> other hand the word "co-processors" is used commonly with Xeon Phi.
> >
> > I don't think it is a useful distinction: both are basiclly
> > independent
> > computers. obviously, the programming model of Phi is dramatically
> > more
> > like a conventional processor than Nvidia.
> >
>
> Mark, that's the marketing talk about Xeon Phi.
>
> It's surprisingly the same of course except for the cache coherency;
> big vector processors.
>
> > there is a meaningful distinction between offload and coprocessor
> > approaches.
> > that is, offload means you use the device to accelerate a set of
> > libraries
> > (offload matrix multiply, eig, fft, etc). to use a coprocessor, I
> > think the
> > expectation is that the main code will be very much aware of the
> > state of the
> > PCIe-attached hardware.
> >
> > I suppose one might suggest that "accelerator" to some extent implies
> > offload usage: you're accelerating a library.
> >
> > another interesting example is AMD's upcoming HSA concept: since
> > nearly all
> > GPUs are now on-chip, AMD wants to integrate the CPU and GPU
> > programming
> > models (at least to some extent). as far as I understand it, HSA
> > is based
> > on introducing a quite general intermediate ISA that can be
> > executed using
> > all available hardware resources: CPU and/or GPU. although Nvidia
> > does have
> > its own intermediate ISA, they don't seem to be trying to make it
> > general,
> > *and* they don't seem interested in making it work on both C/GPU.
> > (well,
> > so far at least - I wouldn't be surprised if they _did_ have a PTX
> > JIT for
> > their ARM-based C/GPU chips...)
> >
> > I think HSA is potentially interesting for HPC, too.
> > I really expect
> > AMD and/or Intel to ship products this year that have a C/GPU chip
> > mounted on
> > the same interposer as some high-bandwidth ram.
>
> How can an integrated gpu outperform a gpgpu card?
>
> Something like what is it 25 watt versus 250 watt, what will be faster?
>
> I assume you will not build 10 nodes with 10 cpu's with integrated
> gpu in order to rival a
> single card.
>
> > a fixed amount of very high
> > performance memory sounds very tasty to me. a surprising amount of
> > power
> > in current systems is spend getting high-speed signals off-socket.
> >
> > imagine a package dissipating say 40W containing a, say, 4 CPU cores,
> > 256 GPU ALUs and 2GB of gddr5. the point would be to tile 32 of them
> > in a 1U box. (dropping socketed, off-package dram would probably make
> > it uninteresting for memcached and some space-intensive HPC.
> >
> > then again, if you think carefully about the numbers, any code today
> > that has a big working set is almost as anachronistic as codes that
> > use
> > disk-based algorithms. (same conceptual thing happening: capacity is
> > growing much faster than the pipe.)
> >
> > regards, mark hahn.
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list