[Beowulf] i7-4770R 128MB L4 cache CPU in compact 0.79 litre box - DIY cluster?

Mon Jan 20 19:02:10 PST 2014

Hi Mark,

Thanks for your reply.  I am not professing expertise or arguing
strongly for something - just noting that these i7-4770R Brix units
might make it easy for someone to craft their own compact and power
efficient cluster, if they were happy with gigabit Ethernet, did not
want to use specialised industrial strength HPC servers and either did
not desire more cores per system, or wanted faster per-core performance
than is generally available with multi-socket systems.

8 core (4 cores per chip), 3.2GHz (6328) 32nm G34 Opterons with coolers
can be found in eBay for USD$565 and dual socket MBs for less than this,
so with RAM, a 16 core machine could probably be built for not much more
than $2k, which is half the cost per core than the Brix approach I
suggested.  The cores, inter-core communications, caching and main
memory arrangements are totally different, but I think 16 cores per
machine is generally better for HPC than 4, assuming that the
application works well with the cache and contention for the two sets of
main memory.  This is on the assumption of the HPC applications
requiring shared memory and/or intensive communication between the threads.

You wrote:

>> the 128MB cache is supported by Linux kernel 3.12 (Nov 2013).
> 
> I'm not really sure why the kernel would need to do anything...

I have no detailed knowledge, but I guess there would be a way of
managing how much of the 128MB cache is used for the GPU rather than the
CPU cores.

> as for performance, this pretty much says it all:
> 
> http://images.anandtech.com/doci/6993/latency.png

This is part of these articles:

http://www.anandtech.com/print/7003/the-haswell-review-intel-core-i74770k-i54560k-tested

http://www.anandtech.com/print/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested

The green and blue lines are for the i7-3770K and i7-4770K, both of
which are conventional single-chip devices with 8MB L3 cache.  The red
line is for the i7-4950HQ, which has the 128MB eDRAM L4 cache chip, like
the i7-4770R.  I guess they are much the same device, with 2.4GHz and
3.2GHz clock speeds respectively.  Thanks for pointing to this - it is
the only information I know of yet regarding the performance of the
128MB cache.

The second article above suggests that 128MB was bigger than Intel
thought was really necessary.  I haven't checked the veracity of this:

->  Intel found that hit rate rarely dropped below 95%. It turns
->  out that for current workloads, Intel didn't see much benefit
->  beyond a 32MB eDRAM however it wanted the design to be future
->  proof. Intel doubled the size to deal with any increases in game
->  complexity, and doubled it again just to be sure. I believe the
->  exact wording Intel's Tom Piazza used during his explanation of
->  why 128MB was "go big or go home".

> in other words, if your working set size happens to fit in this cache,
> you might find it very interesting.  notice, though, how the observed
> memory latency for larger data is made worse by the added layer of caching.

Indeed, beyond 64MB, the red curve goes to about 115 clock cycles,
rather than ~101.  The benefit is that at the 16, 32 and 64MB points the
128MB L4 device has a ~57 cycle latency compared to ~101.  There's no
128MB step on these graphs.

> I'm not really sure whether this 8-64M sweetspot is special in the
> marketplace - perhaps you have in mind some particular workload,
> especially one which is small-footprint?  perhaps financial MC?

I don't have anything in mind in particular, but this substantial
reduction in latency for 16 to 64MB, probably out to nearly 128MB (I
guess) strikes me as something of value for HPC in general, despite the
reported comments about Intel supposedly regarding 32MB as being
generally sufficient for, I guess, desktop PCs and gaming.

> most HPC I see has a memory footprint of 1-2G/core - arguably the
> working set size is smaller.  maybe a quarter that, but still >100 MB.
> 
> it seems like vendors are waiting for memory density to increase a bit more
> before really jumping on the ram-in-package thing.  it's also not clear
> that everyone is going to follow the path of making it last-level cache.

I think it is neat to see a significant step towards larger caches turn
up in a consumer chip first, and be available in these tiny Brix boxes.
 However, these chips are not suited to dual or quad socket
motherboards, which I think is generally the best way to improve
performance in HPC.

>> Ordinary mass market PC cases, motherboards and i7 CPUs might work out a
>> little cheaper, but they would not be as compact, would not have the
> 
> why is compactness such an important goal?  for a conventional, air-cool
> datacenter, it's pretty comfortable to configure 32-40 cores per U.
> I'm not really sure using desktop chips gives much of an advantage
> in power-per-ghz*core, but perhaps your argument is more about consumer
> rather than server prices?

I think these Brix devices might suit a company or researcher who wanted
a lot of CPU power, was OK with larger numbers of smaller servers using
Ethernet, and wanted for some reason to cobble it together themselves in
a more compact form than using traditional (more power hungry, I am
sure) PC boxes, while also not (for some reason - maybe cost?) wanting
to avoid rack-mount servers which are specifically made for HPC.  I
think these generally use pairs of Xeons and ECC memory.  I think that
the Xeons which are available at any one time being somewhat slower and
more expensive than the bleeding edge consumer i7s, though for the
really expensive ones, having larger caches and perhaps numbers of
cores.  The i7-4770R at 3.2GHz is not quite bleeding edge since the base
(non-turbo) clock speed of other i7s is up to 3.5GHz - but those faster
devices don't have the 128MB L4 cache.

For real HPC, as is needed on large scale clusters, as far as I know,
Infiniband, ECC, Xeons or G34 Opterons are mandatory (ignoring for the
moment GPUs as co-processors and the Intel Xeon Phi).

The Brix approach might be good for applications in which each thread is
largely independent, with only low levels of communication to a central
process on another machine which coordinates them.

 - Robin